Skip to content

Add a standalone reducer and serialization support #1252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

jinhuix
Copy link
Contributor

@jinhuix jinhuix commented Jun 30, 2025

Resolves #1241

This PR introduces a standalone SQL reducer mechanism to SQLancer. It enables automated simplification of bug-inducing SQL queries in an offline manner via serialized context. The PR is split into 5 commits for clarity:

  1. Serializable context: Add SerializableReducerContext class to store reproducible context (SQL queries, expected errors, and oracle type, etc).
  2. SimpleReducer: Implement a lightweight SimpleReducer engine that uses delta-debugging-like reduction to minimize failing SQL inputs. It supports both exception-based and oracle-based reduction (NoRECOracle, TLPWhereOracle).
  3. Integration Hooks: Add minimal changes to the SQLancer core to support optional serialization of reducer context data (via CLI flags).
  4. Testing: Add a TestSimpleReducer.java class for validation, covering Exception, NoREC, and TLPWhere bugs.
  5. Documentation: Add usage guide with examples and more details in README.md.

@mrigger @KabilanMA Happy to hear any thoughts or suggestions!

@mrigger
Copy link
Contributor

mrigger commented Jun 30, 2025

Thanks, this looks like a great start, and it's great to see that this also works!

The information you save in the reducer context is quite low level (i.e., strings). Do you think we can save more high-level information? This would make the approach more robust, as the current parsing seems quite complex. We could also do this for the expected errors. Rather than checking for every statement whether an error matches any expected error of the system, we could store a detailed statement-to-expected-error mapping. When reading back the reduced statements, this would require doing some kind of best-effort alignment. For now, with the line granularity, this could be a 1:1 match per statement.

@jinhuix
Copy link
Contributor Author

jinhuix commented Jul 1, 2025

Thanks a lot for the feedback! I'll revise the implementation accordingly:

  1. Structured serialization: I'll split the reducer context into general base fields and oracle-specific ones. For example, NoRECReproducerData will contain optimizedQueryString, unoptimizedQueryString, and shouldUseAggregate.
  2. Direct reproducer info: The serializer will retrieve information directly from the Reproducer, rather than parsing query strings.
  3. Error mapping: For expected errors, I plan to switch to a structure like Map<String, String> statementToExpectedError. Since this would change how expectedErrors are checked, I’m considering re-implementing the relevant parts of SQLQueryAdapter inside the new reducer to keep SQLancer core untouched.

Before I proceed with the refactor, do you think this direction makes sense? @mrigger @KabilanMA

@mrigger
Copy link
Contributor

mrigger commented Jul 2, 2025

Structured serialization: I'll split the reducer context into general base fields and oracle-specific ones. For example, NoRECReproducerData will contain optimizedQueryString, unoptimizedQueryString, and shouldUseAggregate.

I am a bit unclear about that one. If we serialize the reproducer, I assume we will not need this information?

Direct reproducer info: The serializer will retrieve information directly from the Reproducer, rather than parsing query strings.

Sounds good.

Error mapping: For expected errors, I plan to switch to a structure like Map<String, String> statementToExpectedError.

I think this should probably be a Map<SQLQuery, ExpectedErrors> mapping.

@jinhuix
Copy link
Contributor Author

jinhuix commented Jul 4, 2025

If we serialize the reproducer, I assume we will not need this information?

My original concern was that Reproducer is an interface and may have different implementations for each oracle, which could make direct serialization difficult. So instead, we can create oracle-specific ReproducerData classes (e.g., NoRECReproducerData) that store only the necessary fields for reproduction.

I think this should probably be a Map<SQLQuery, ExpectedErrors> mapping.

Agreed — I'll change the structure accordingly.

Copy link

@KabilanMA KabilanMA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the PR changes to some extend and left a few comments, please make the changes accordingly, if needed.

}
}

private static boolean hasAggregateFunction(String query) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not scalable

if (statement.contains("the counts mismatch")) {
parseNoRECOracle(context, statement);
return;
} else if (statement.contains("result sets mismatch")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of string comparison can make it difficult to maintain the source code when there are rapid changes; at the very least, perform a case-insensitive check. I think parsing the context based on a string is not a good idea. Is it possible to determine which Oracle parser to use from somewhere else and use it here?

Comment on lines 628 to 629
if (line.startsWith("-- Query: \"") && line.endsWith("\"")) {
String query = line.substring(11, line.length() - 1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but it seems like, this cannot produce consistent results with changes to the other codebase. Still, I am not sure whether it is the right way to extract the string query.

@@ -0,0 +1,210 @@
package sqlancer;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the use of strings throughout the workflow is hanging by a thin thread. But, the string information or any other information can be converted to something consistent in this file to serialize. Then, when you parse, you don't have to depend on the string comparison again; instead, use the consistent values (e.g., enum) you used here. Also, when there is any change or it isn't working after some changes to other source files, you know this is the only file you have to make the modification.

@jinhuix
Copy link
Contributor Author

jinhuix commented Jul 17, 2025

Updated the PR:

  1. Made the data in ReducerContext.java more high-level (e.g., using enums).
  2. Removed the manual parsing logic from Main.java.
  3. Added one-to-one serialization of <SQLQuery, ExpectedErrors> mappings.

@@ -279,14 +279,14 @@ public SQLConnection createDatabase(PostgresGlobalState globalState) throws SQLE
}
Connection con = DriverManager.getConnection("jdbc:" + entryURL, username, password);
globalState.getState().logStatement(String.format("\\c %s;", entryDatabaseName));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't push these kinds of format changes into the PR.

Comment on lines +116 to +121
public boolean checkException(Exception e) throws AssertionError {
Throwable ex = e;

while (ex != null) {
if (expectedErrors.errorIsExpected(ex.getMessage())) {
return;
return true;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function's logic seems to always return inside the while loop, or run infinitely inside the while loop (theoretically), or throw an AssertionError. I don't understand the logic behind changing it to return boolean. After the modification, it returns true or throws an exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate C-Reduce for Automatic Test Case Reduction
3 participants