Skip to content

C++: Rewrite cpp/path-injection to not use DefaultTaintTracking #11435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Dec 6, 2022

Conversation

jketema
Copy link
Contributor

@jketema jketema commented Nov 25, 2022

This is the most straightforward rewrite that made both the test here and the ones in the coding standards repo pass. This might not be sufficient to get parity with (or improvement over the old query). Will look at DCA and MRVA results (in that order).

Only annoying thing is that there's now some duplication in the results. I'm not sure if that's easily removed? There's both duplication on the source side and on the sink side, and this is the cause for the bulk of the increase in DCA results.

@jketema jketema requested a review from a team as a code owner November 25, 2022 16:12
@github-actions github-actions bot added the C++ label Nov 25, 2022
@jketema jketema force-pushed the rewrite-tainted-path branch 2 times, most recently from 0159cd4 to 7933172 Compare November 25, 2022 16:13
@jketema jketema force-pushed the rewrite-tainted-path branch from d735b35 to 165c882 Compare November 26, 2022 10:42
Copy link
Contributor

@geoffw0 geoffw0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, though I definitely agree about the need to look at differences on MRVA.

On the topic of result duplication, two approaches spring to mind. If duplicates are caused by sources / sinks of the form:

result = a or
result = b

try changing them to something like:

result = a
or
(
  not exists(a) and
  result = b
)

The other thought is that if there are two sinks at slightly different DataFlow::Nodes, we might try making sinks a SanitizerOut so that we only get flow to the first of them.

Copy link
Contributor

@MathiasVP MathiasVP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! We need to validate that the new results we're getting are mostly sensible, but once that's done I think this LGTM!

@jketema
Copy link
Contributor Author

jketema commented Nov 28, 2022

Regarding duplication: neither of those solutions work, as the source nodes and the sink nodes of the duplicated results might have no other solution than that the underlying expression is the same. There's no flow between the two alternatives. The only thing that is giving me something close to the old result is dropping in something like:

  predicate hasFilteredFlowPath(DataFlow::PathNode source, DataFlow::PathNode sink) {
    this.hasFlowPath(source, sink) and
    not exists(DataFlow::PathNode source2, DataFlow::PathNode sink2 |
      this.hasFlowPath(source2, sink2) and
      asSourceExpr(source.getNode()) = asSourceExpr(source2.getNode()) and
      asSinkExpr(sink.getNode()) = asSinkExpr(sink2.getNode())
    |
      not exists(source.getNode().asConvertedExpr()) and exists(source2.getNode().asConvertedExpr())
      or
      not exists(sink.getNode().asConvertedExpr()) and exists(sink2.getNode().asConvertedExpr())
    )
  }

this above is the defined configuration.

Also it turns out the DCA results are very deceptive here although DCA only reports 19 more results for vim, the result count (even with the above predicate dropped in) goes from slightly over 200 to slightly over 600(!).

@geoffw0
Copy link
Contributor

geoffw0 commented Nov 28, 2022

Regarding duplication: neither of those solutions work, as the source nodes and the sink nodes of the duplicated results might have no other solution than that the underlying expression is the same. There's no flow between the two alternatives.

That's unfortunate. Where there are two alternatives with no flow between them, can we identify a common parent perhaps?

@jketema
Copy link
Contributor Author

jketema commented Nov 28, 2022

can we identify a common parent perhaps?

What do you mean by parent?

@geoffw0
Copy link
Contributor

geoffw0 commented Nov 28, 2022

What do you mean by parent?

I was thinking on the AST, but any kind of parent would do. If the duplicates, where they exist, always have a common parent (that would not be common to a non-duplicate) we can pair them against each other and choose one of the duplicates to keep. I'm not sure what kinds of things we actually have at the duplicate sources and sinks, however. Might have to dig deeper.

@jketema
Copy link
Contributor Author

jketema commented Nov 28, 2022

I was thinking on the AST

Well, that's the problem right: it's the same expression twice in each case. Hence the code in #11435 (comment), where is just pick one at "random".

@geoffw0
Copy link
Contributor

geoffw0 commented Nov 28, 2022

It looks like you're way ahead of me in the discussion thread with @MathiasVP , so I'll leave this to the two of you now. Let me know if you get stuck / would like my opinion on anything.

Without this we get confusing results:
```
    char *userAndFile = argv[2];
    char *fileName = argv[1];
    fopen(fileName, "wb+"); // Both argv[1] and argv[2] marked as source without
                            // this change.
```

While here add some more test cases.
These are not likely to give the user much control over what can be accessed.
@jketema jketema force-pushed the rewrite-tainted-path branch from 165c882 to d3cccca Compare November 29, 2022 10:18
@jketema
Copy link
Contributor Author

jketema commented Nov 30, 2022

As far as I can currently tell, most new results I'm seeing in DCA are coming from the rewritten version not having a sanitizer mimicking the following one from default taint tracking

exists(Variable checkedVar, Operand access |
/*
* This node is guarded by a condition that forces the accessed variable
* to equal something else. For example:
* ```
* x = taintsource()
* if (x == 10) {
* taintsink(x); // not considered tainted
* }
* ```
*/
nodeIsBarrierEqualityCandidate(node, access, checkedVar) and
readsVariable(access.getDef(), checkedVar)
)
Is this something we want to have?

@MathiasVP
Copy link
Contributor

As far as I can currently tell, most new results I'm seeing in DCA are coming from the rewritten version not having a sanitizer mimicking the following one from default taint tracking

In general, I really don't like that sanitizer in DefaultTaintTracking (since a node may be marked as a sanitizer when the equality comparison is totally unrelated to the dataflow path). It's a good hammer for when a query has an unreasonable number of FPs, but my guess is that it's a lot more useful in query that does taint through integral expressions (where I'd expect a lot more equality comparisons).

@jketema jketema force-pushed the rewrite-tainted-path branch from bb7faf0 to 3dfe18b Compare December 1, 2022 08:14
@jketema
Copy link
Contributor Author

jketema commented Dec 1, 2022

Turns out the real difference in number of results was coming from hasUpperBoundsCheck. Having that is not ideal either, but introducing it in this rewrite gives as very similar before and after results in DCA.

I briefly discussed with @geoffw0 and I agree with him that we should aim here for a very similar set of results when comparing with the version we had before the rewrite. Improving the query is a separate task, which we can do at a later moment, and which should now be easier, as we no longer depend on DefaultTaintTracking.

@jketema
Copy link
Contributor Author

jketema commented Dec 2, 2022

MRVA results on 996 projects (one workflow timed out): https://gist.github.com/jketema/b2da9ff4d0939ca3a63e15bbcd05a556 There are 188 result changes in total (both lost and gained). This seems acceptable to me.

@jketema
Copy link
Contributor Author

jketema commented Dec 5, 2022

As briefly discussed last week. Could you have a look at this @rdmarsh2? Especially the hasFilteredFlowPath predicate.

@rdmarsh2
Copy link
Contributor

rdmarsh2 commented Dec 5, 2022

Could we use RemoteFlowSource/LocalFlowSource rather than isUserInput to get the "right thing" for sources? For sinks, I'm surprised we can't just use the operand of ReadSideEffect. I think that would get rid of the need for the asSourceExpr/asSinkExpr wrappers and for hasFilteredFlowPath.

@jketema
Copy link
Contributor Author

jketema commented Dec 5, 2022

Could we use RemoteFlowSource/LocalFlowSource rather than isUserInput to get the "right thing" for sources?

If I replace isSink by:

  override predicate isSource(DataFlow::Node node) {
    node instanceof LocalFlowSource
    or
    node instanceof RemoteFlowSource
  }

I'm starting to lose results. In particular scanf related ones like:

    char fileName[20];
    scanf("%s", fileName);
    fopen(fileName, "wb+"); // BAD

For sinks, I'm surprised we can't just use the operand of ReadSideEffect.

That seems to work for the test cases we have. I frankly never tried this.

@MathiasVP
Copy link
Contributor

I'm starting to lose results. In particular scanf related ones like:

I hope we can do such a rewrite as a follow-up. We need to check that the FlowSource class covers all the cases that isUserInput covers, though.

@jketema
Copy link
Contributor Author

jketema commented Dec 5, 2022

We need to check that the FlowSource class covers all the cases that isUserInput covers, though.

Yeah, it currently doesn't, but let's not pull that into this PR.

@jketema jketema force-pushed the rewrite-tainted-path branch from d13b22a to 5637d57 Compare December 6, 2022 07:31
@jketema
Copy link
Contributor Author

jketema commented Dec 6, 2022

With the sink changes suggested by @rdmarsh2 there are 65 additional results (all losses in this case) in MRVA. Given the number of results this query gives, that's probably acceptable. So, if the query looks fine in its current form, I'd like to move on.

I'll file an internal issue for the scanf/LocalFlowSource issue identified above.

@jketema
Copy link
Contributor Author

jketema commented Dec 6, 2022

Does this need a change note?

@MathiasVP
Copy link
Contributor

Does this need a change note?

I'd say no.

@jketema jketema added the no-change-note-required This PR does not need a change note label Dec 6, 2022
@geoffw0
Copy link
Contributor

geoffw0 commented Dec 6, 2022

I'll file an internal issue for the scanf/LocalFlowSource issue identified above.

And fscanf, I think in CPP we consider file reads to be local flow sources as well. These sources will benefit other queries as well.

Copy link
Contributor

@MathiasVP MathiasVP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ no-change-note-required This PR does not need a change note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants