Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++: Fix some FPs in cpp/missing-check-scanf #16009

Closed
wants to merge 10 commits into from

Conversation

MathiasVP
Copy link
Contributor

This PR fixes a simple class of FPs in cpp/missing-check-scanf by improving the guards library. Previously, the guards library understood things such as:

if(a == b) {
  // a == b is inferred here
}

but we failed to infer this equality in:

bool r = a == b;
if(r) {
  // a == b was NOT inferred here
}

because the guards library hardcoded that the comparison should be the actual thing being branched on. This PR fixes this by using value numbers in the internal guard logic predicates. And since r and a == b has the same value number these two cases are now identical 🎉

Annoyingly, some FPs still appear because of the awkward conversions between IRBlocks and BasicBlocks done. This FPs should ideally be fixed by completely getting rid of BasicBlocks, but that's a job for another PR.

For now, I'm happy that a simple improvement to the guards library fixes the simple cases automagically 🎉

Commit-by-commit review recommended.

of the implementation predicates in 'IRGuards' from an 'Instruction' to
a value number. This means it doesn't matter whether the expression being
branched on is a comparison, or a use of a variable whose value is the
result of a comparison.

This first commit adds the various subclasses of value number that we
need the rest purely mechanical changes that replaces 'Instruction's with
'ValueNumber's in the relevant places.
subclasses of value numbers added in the previous commit.
@MathiasVP MathiasVP requested a review from a team as a code owner March 21, 2024 10:40
@github-actions github-actions bot added the C++ label Mar 21, 2024
@MathiasVP MathiasVP added the no-change-note-required This PR does not need a change note label Mar 21, 2024
@@ -872,15 +872,15 @@ private predicate simple_comparison_eq(
}

private predicate complex_eq(
CompareInstruction cmp, Operand left, Operand right, int k, boolean areEqual, AbstractValue value
ValueNumber cmp, Operand left, Operand right, int k, boolean areEqual, AbstractValue value
Copy link
Contributor

@jketema jketema Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for not using the newly introduced CompareValueNumber here? And also in some other places in the commit that introduces this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually does matter because of the LogicalNotInstruction case:

exists(AbstractValue dual, LogicalNotInstruction logicalNot |
  value = dual.getDualValue() and
  logicalNot = test.getAnInstruction() and
  compares_eq(valueNumber(logicalNot.getUnary()), left, right, k, areEqual, dual)
)

here we don't require the value number to be a CompareValueNumber, but instead we require it to be something of the form !x, where the x is then used recursively.

I'll see if I can come up with a testcase where this is important 🤔

@MathiasVP
Copy link
Contributor Author

I see some unexpected library test changes. I'll investigate them!

@aschackmull
Copy link
Contributor

We really should get a shared Guards library. Anyway, just wanted to note that the Java Guards library already supports guards derived from assignments to boolean variables, but the approach is somewhat different, and, I think, more extensible. The Java version has the GuardsLogic library, which extends guards by making use of all sorts of implications that can be derived between conditions. Ideally all of this would be shared code...

@MathiasVP
Copy link
Contributor Author

We really should get a shared Guards library. Anyway, just wanted to note that the Java Guards library already supports guards derived from assignments to boolean variables, but the approach is somewhat different, and, I think, more extensible. The Java version has the GuardsLogic library, which extends guards by making use of all sorts of implications that can be derived between conditions. Ideally all of this would be shared code...

Yeah, I agree. The C/C++ library has a very naive and ad-hoc approach to some of this.

@MathiasVP
Copy link
Contributor Author

MathiasVP commented Mar 21, 2024

I see some unexpected library test changes. I'll investigate them!

These are now fixed by d67dae8 and 2252936. I'll rerun DCA.

Even though Schack described how it would be much better to have a shared guards library I still think this PR is worth it since it fixes some actual FPs in queries, and a shared guards library is still a (Glorious) Future item that no one is currently working on right now.

@MathiasVP
Copy link
Contributor Author

Hm. I want to check Kamailio's 9% analysis speed regression before we move forward with this 🤔

@MathiasVP
Copy link
Contributor Author

Hm, the use of value numbering here is biting us, unfortunately. On main, when running ensuresEq on Kamailio we get:

Evaluated relational algebra for predicate IRGuards::GuardCondition.ensuresEq/5#dispred#42506afe@628ffbju with tuple counts:
  10627332   ~0%    {6} r1 = JOIN `_IRGuards::GuardCondition.controls/2#dispred#b7ca92ee_IRGuards::GuardConditionFromBinaryLogicalOpera__#shared` WITH `IRGuards::GuardCondition.comparesEq/5#dispred#f67bf973_051234#join_rhs` ON FIRST 2 OUTPUT Lhs.0, Rhs.2, Rhs.3, Rhs.4, Lhs.2, Rhs.5
                
  8659784   ~0%    {6} r2 = JOIN `_IRGuards::GuardCondition.controls/2#dispred#b7ca92ee_IRGuards::GuardConditionFromIR#2c83f350#shared` WITH `IRGuards::IRGuardCondition.comparesEq/5#dispred#8751d3be_051234#join_rhs` ON FIRST 2 OUTPUT Rhs.2, Lhs.2, Lhs.3, Rhs.3, Rhs.4, Rhs.5
  8659784   ~2%    {6}    | JOIN WITH `Instruction::Instruction.getAUse/0#dispred#3072f0dc_10#join_rhs` ON FIRST 1 OUTPUT Rhs.1, Lhs.1, Lhs.2, Lhs.3, Lhs.4, Lhs.5
  8657912   ~0%    {6}    | JOIN WITH `Instruction::Instruction.getUnconvertedResultExpression/0#dispred#d46d1df3` ON FIRST 1 OUTPUT Lhs.3, Lhs.1, Lhs.2, Lhs.4, Lhs.5, Rhs.1
  8657912   ~0%    {6}    | JOIN WITH `Instruction::Instruction.getAUse/0#dispred#3072f0dc_10#join_rhs` ON FIRST 1 OUTPUT Rhs.1, Lhs.1, Lhs.2, Lhs.3, Lhs.4, Lhs.5
  8656040   ~3%    {6}    | JOIN WITH `Instruction::Instruction.getUnconvertedResultExpression/0#dispred#d46d1df3` ON FIRST 1 OUTPUT Lhs.1, Lhs.5, Rhs.1, Lhs.3, Lhs.2, Lhs.4
                
  19283372   ~5%    {6} r3 = r1 UNION r2
                    return r3

whereas on this PR we get:

[2024-03-21 15:02:50] Evaluated non-recursive predicate IRGuards::GuardCondition.ensuresEq/5#dispred#42506afe@1cfa5anp in 228413ms (size: 264371532).
Evaluated relational algebra for predicate IRGuards::GuardCondition.ensuresEq/5#dispred#42506afe@1cfa5anp with tuple counts:
  129856584   ~4%    {6} r1 = JOIN `_IRGuards::GuardCondition.controls/2#dispred#b7ca92ee_IRGuards::GuardConditionFromBinaryLogicalOpera__#shared` WITH `IRGuards::GuardCondition.comparesEq/5#dispred#f67bf973_051234#join_rhs` ON FIRST 2 OUTPUT Lhs.0, Rhs.2, Rhs.3, Rhs.4, Lhs.2, Rhs.5
                  
  134518866   ~0%    {6} r2 = JOIN `_IRGuards::GuardCondition.controls/2#dispred#b7ca92ee_IRGuards::GuardConditionFromIR#2c83f350#shared` WITH `IRGuards::IRGuardCondition.comparesEq/5#dispred#8751d3be_051234#join_rhs` ON FIRST 2 OUTPUT Rhs.2, Lhs.2, Lhs.3, Rhs.3, Rhs.4, Rhs.5
  134518866   ~0%    {6}    | JOIN WITH `Instruction::Instruction.getAUse/0#dispred#3072f0dc_10#join_rhs` ON FIRST 1 OUTPUT Rhs.1, Lhs.1, Lhs.2, Lhs.3, Lhs.4, Lhs.5
  134516907   ~0%    {6}    | JOIN WITH `Instruction::Instruction.getUnconvertedResultExpression/0#dispred#d46d1df3` ON FIRST 1 OUTPUT Lhs.3, Lhs.1, Lhs.2, Lhs.4, Lhs.5, Rhs.1
  134516907   ~5%    {6}    | JOIN WITH `Instruction::Instruction.getAUse/0#dispred#3072f0dc_10#join_rhs` ON FIRST 1 OUTPUT Rhs.1, Lhs.1, Lhs.2, Lhs.3, Lhs.4, Lhs.5
  134514948   ~0%    {6}    | JOIN WITH `Instruction::Instruction.getUnconvertedResultExpression/0#dispred#d46d1df3` ON FIRST 1 OUTPUT Lhs.1, Lhs.5, Rhs.1, Lhs.3, Lhs.2, Lhs.4
                  
  264371532   ~0%    {6} r3 = r1 UNION r2
                      return r3

i.e., the pipeline grew by a factor of ~13 😱

To see where these come from, I looked at the DIL for computing r1 which corresponds roughly to (after removing irrelevant columns):

int histogram(GuardConditionFromBinaryLogicalOperator gc) {
  result =
    count(BasicBlock block, Expr e |
      exists(AbstractValue value |
        gc.valueControls(block, value) and
        gc.comparesEq(e, _, _, value)
      )
    )
}

the largest rows on main is 2598, whereas on this PR there are many rows with >25000 results. In particular, the guard here is now (incorrectly) inferred to provide an equality for any use of ket->ptypes[0] that is guarded by an value-number equivalent guard condition. So to fix this we'd need to put some sort of dominance criteria on the e above, and at this point I think this is getting more complex than I was aiming for 😭 So we should probably just leave this as-is until we start working on a properly shared guards library.

Oh well

@MathiasVP MathiasVP closed this Mar 21, 2024
@bdrodes
Copy link
Contributor

bdrodes commented Nov 4, 2024

I investigated this a bit. I decided to try a different approach. Rather than using GVN for every step in the conditional parsing, I specifically only get GVN if the 'next' instruction differs. The intuition is this would/should only give me the current instruction or any defining instruction (since the definition should be the only place where the GVN instruction differs). I.e., I basically hacked around GVN to give me the SSA def/use. SSA seems like the preferred choice honestly, so I'm open to suggestions there, but I'll make a separate PR for that discussion.

Using your histogram @MathiasVP I'm seeing 81 rows with counts > 2000, and the highest is 16000. There is only one result > 6400.

This is still higher than main according to @MathiasVP's original findings, but I have a suspicion it will provide much greater performance.

I'll work up a PR for this separately for further testing/comments.

@bdrodes
Copy link
Contributor

bdrodes commented Nov 4, 2024

@MathiasVP , I've created a draft PR here: https://github.com/github/codeql/pull/17907/files

One thing to note, I did not see any difference in the range analysis tests, like you had in this PR. Is that an error? We can take further discussion over to this PR.

@MathiasVP
Copy link
Contributor Author

Yeah, it likely means something isn't working over in #17907. Let's continue over there 🏃‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ no-change-note-required This PR does not need a change note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants