-
Notifications
You must be signed in to change notification settings - Fork 1.7k
C++: Rewrite cpp/path-injection
to not use DefaultTaintTracking
#11435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0159cd4
to
7933172
Compare
d735b35
to
165c882
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, though I definitely agree about the need to look at differences on MRVA.
On the topic of result duplication, two approaches spring to mind. If duplicates are caused by sources / sinks of the form:
result = a or
result = b
try changing them to something like:
result = a
or
(
not exists(a) and
result = b
)
The other thought is that if there are two sinks at slightly different DataFlow::Node
s, we might try making sinks a SanitizerOut
so that we only get flow to the first of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! We need to validate that the new results we're getting are mostly sensible, but once that's done I think this LGTM!
Regarding duplication: neither of those solutions work, as the source nodes and the sink nodes of the duplicated results might have no other solution than that the underlying expression is the same. There's no flow between the two alternatives. The only thing that is giving me something close to the old result is dropping in something like: predicate hasFilteredFlowPath(DataFlow::PathNode source, DataFlow::PathNode sink) {
this.hasFlowPath(source, sink) and
not exists(DataFlow::PathNode source2, DataFlow::PathNode sink2 |
this.hasFlowPath(source2, sink2) and
asSourceExpr(source.getNode()) = asSourceExpr(source2.getNode()) and
asSinkExpr(sink.getNode()) = asSinkExpr(sink2.getNode())
|
not exists(source.getNode().asConvertedExpr()) and exists(source2.getNode().asConvertedExpr())
or
not exists(sink.getNode().asConvertedExpr()) and exists(sink2.getNode().asConvertedExpr())
)
}
Also it turns out the DCA results are very deceptive here although DCA only reports 19 more results for vim, the result count (even with the above predicate dropped in) goes from slightly over 200 to slightly over 600(!). |
That's unfortunate. Where there are two alternatives with no flow between them, can we identify a common parent perhaps? |
What do you mean by parent? |
I was thinking on the AST, but any kind of parent would do. If the duplicates, where they exist, always have a common parent (that would not be common to a non-duplicate) we can pair them against each other and choose one of the duplicates to keep. I'm not sure what kinds of things we actually have at the duplicate sources and sinks, however. Might have to dig deeper. |
Well, that's the problem right: it's the same expression twice in each case. Hence the code in #11435 (comment), where is just pick one at "random". |
It looks like you're way ahead of me in the discussion thread with @MathiasVP , so I'll leave this to the two of you now. Let me know if you get stuck / would like my opinion on anything. |
Without this we get confusing results: ``` char *userAndFile = argv[2]; char *fileName = argv[1]; fopen(fileName, "wb+"); // Both argv[1] and argv[2] marked as source without // this change. ``` While here add some more test cases.
These are not likely to give the user much control over what can be accessed.
165c882
to
d3cccca
Compare
As far as I can currently tell, most new results I'm seeing in DCA are coming from the rewritten version not having a sanitizer mimicking the following one from default taint tracking codeql/cpp/ql/lib/semmle/code/cpp/ir/dataflow/DefaultTaintTracking.qll Lines 182 to 196 in b8c11de
|
In general, I really don't like that sanitizer in DefaultTaintTracking (since a node may be marked as a sanitizer when the equality comparison is totally unrelated to the dataflow path). It's a good hammer for when a query has an unreasonable number of FPs, but my guess is that it's a lot more useful in query that does taint through integral expressions (where I'd expect a lot more equality comparisons). |
bb7faf0
to
3dfe18b
Compare
Turns out the real difference in number of results was coming from I briefly discussed with @geoffw0 and I agree with him that we should aim here for a very similar set of results when comparing with the version we had before the rewrite. Improving the query is a separate task, which we can do at a later moment, and which should now be easier, as we no longer depend on |
MRVA results on 996 projects (one workflow timed out): https://gist.github.com/jketema/b2da9ff4d0939ca3a63e15bbcd05a556 There are 188 result changes in total (both lost and gained). This seems acceptable to me. |
As briefly discussed last week. Could you have a look at this @rdmarsh2? Especially the |
Could we use |
If I replace override predicate isSource(DataFlow::Node node) {
node instanceof LocalFlowSource
or
node instanceof RemoteFlowSource
} I'm starting to lose results. In particular char fileName[20];
scanf("%s", fileName);
fopen(fileName, "wb+"); // BAD
That seems to work for the test cases we have. I frankly never tried this. |
I hope we can do such a rewrite as a follow-up. We need to check that the FlowSource class covers all the cases that isUserInput covers, though. |
Yeah, it currently doesn't, but let's not pull that into this PR. |
d13b22a
to
5637d57
Compare
With the sink changes suggested by @rdmarsh2 there are 65 additional results (all losses in this case) in MRVA. Given the number of results this query gives, that's probably acceptable. So, if the query looks fine in its current form, I'd like to move on. I'll file an internal issue for the |
Does this need a change note? |
I'd say no. |
And |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This is the most straightforward rewrite that made both the test here and the ones in the coding standards repo pass. This might not be sufficient to get parity with (or improvement over the old query). Will look at DCA and MRVA results (in that order).
Only annoying thing is that there's now some duplication in the results. I'm not sure if that's easily removed? There's both duplication on the source side and on the sink side, and this is the cause for the bulk of the increase in DCA results.