Skip to content

Expand parse without semicolons #1949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

aharpervc
Copy link
Contributor

This PR is a followup (ref) to recent work on parsing without requiring semicolon statement delimiters. It's the same content as my former PR which was stale after #1937.

This PR is rebased on latest master and brings forward the increased test coverage and additional parsing fixes from the previous branch.

Particularly, this PR expands support as follows:

  1. Added a supports_statements_without_semicolon_delimiter dialect trait function (default false, but true for SQL Server). Now we can set the default parser option based on the dialect
  2. Tightened up parsing logic, such as for when something is an alias or a keyword and comma delimited lists. When you write T-SQL without semicolons, you're going to have ambiguities, but being more selective here in the parser seems reasonable
  3. More test coverage & expanded test helper functions: statements_without_semicolons_parse_to which deletes semicolons from the string, then parses it (simplifies testing parsing with & without), all_dialects_requiring_semicolon_statement_delimiter & all_dialects_not_requiring_semicolon_statement_delimiter to find configured dialects, and assert_err_parse_statements to simplify asserting "end of statement" vs "an SQL statement" errors. A lot of test assertions that fail when requiring semicolons also fail when not requiring semicolons, but what precisely the parser will complain about could be different (again, if you write SQL this way, you accept that possibility). The helper functions enable running existing tests on dialects that don't require semicolons with minimal changes. The main usage there is in the "common" tests.

@aharpervc aharpervc force-pushed the expand-parse-without-semicolons branch 2 times, most recently from 069ce0e to 9be2dad Compare July 16, 2025 19:15
@aharpervc
Copy link
Contributor Author

@alamb fyi, as previously discussed

@aharpervc aharpervc marked this pull request as ready for review July 16, 2025 19:17
- a corresponding `supports_statements_without_semicolon_delimiter` Dialect trait function
- this is optional for SQL Server, so it's set to `true` for that dialect
- for the implementation, `RETURN` parsing needs to be tightened up to avoid ambiguity & tests that formerly asserted "end of statement" now maybe need to assert "an SQL statement"
- a new `assert_err_parse_statements` splits the dialects based on semicolon requirements & asserts the expected error message accordingly
- at least all of select/insert/update/delete (plus exec) can be added
@aharpervc aharpervc force-pushed the expand-parse-without-semicolons branch from 9be2dad to 53d7930 Compare July 21, 2025 16:52
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aharpervc -- sorry for the delay in reviewing. I left some comments. Let me know what you think

@@ -123,6 +123,10 @@ impl Dialect for MsSqlDialect {
true
}

fn supports_statements_without_semicolon_delimiter(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 MS SQL!!!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♂️🤣

@@ -1136,6 +1142,11 @@ pub trait Dialect: Debug + Any {
fn supports_notnull_operator(&self) -> bool {
false
}

/// Returns true if the dialect supports parsing statements without a semicolon delimiter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to give an example here?

Something like

Suggested change
/// Returns true if the dialect supports parsing statements without a semicolon delimiter.
/// Returns true if the dialect supports parsing statements without a semicolon delimiter.
///
/// If returns true, the following SQL will not parse. If returns `false` the SQL will parse
///
/// ```sql
/// SELECT 1
/// SELECT 2
/// ```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

@@ -266,6 +266,22 @@ impl ParserOptions {
self.unescape = unescape;
self
}

/// Set if semicolon statement delimiters are required.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the usecase to have the same configuration information on the ParserOptions and the Dialect?

It seems there are a few places in the Parser that inconsistently check the options and/or the dialect flag. If we had only one way to specify the option, there would not be any room for inconsistencies

Or maybe I misunderstand what is going on here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the usecase to have the same configuration information on the ParserOptions and the Dialect?

It's similar to the existing with_trailing_commas logic, and has several practical use cases:

  1. can be set to true for SQL Server & left false for all others (until if/when someone discovers this is supported in other dialects)
  2. integration with existing testing patterns. eg, we can have all_dialects_requiring_semicolon_statement_delimiter & all_dialects_not_requiring_semicolon_statement_delimiter
  3. used to set the corresponding default parser option

It seems like testing parsing without semicolons would get a lot worse if it wasn't a dialect option

@@ -4541,6 +4561,18 @@ impl<'a> Parser<'a> {
return Ok(vec![]);
}

if end_token == Token::SemiColon
&& self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also have to check the parser options here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be that we should only be checking the parser options. I'll investigate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to parser options 👍

| Expr::Cast { .. }
| Expr::Convert { .. }
| Expr::Subquery(_) => Ok(expr),
// todo: how to retstrict to variables?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems more like a semantic check -- this crate is focused on syntax parsing.

Read more hre:

https://github.com/apache/datafusion-sqlparser-rs?tab=readme-ov-file#syntax-vs-semantics

I think the error is inappropriate in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the error is inappropriate in this case

What do you suggest?

This seems more like a semantic check -- this crate is focused on syntax parsing.

The difficulty here is that without semicolon tokens the return statement is ambiguous. Consider the following perfectly valid T-SQL statement (which is also a test case example) which has several variations of return:

CREATE OR ALTER PROCEDURE example_sp
AS
    IF USER_NAME() = 'X'
        RETURN
    
    IF 1 = 2
        RETURN (SELECT 1)

    RETURN CONVERT(INT, 123)

If you revert this change and run that test, you get this error:

thread 'test_supports_statements_without_semicolon_delimiter' panicked at tests/sqlparser_mssql.rs:2533:14:
called `Result::unwrap()` on an `Err` value: ParserError("Expected: an SQL statement, found: 1 at Line: 1, Column: 72")

The reason is that for the first return, the self.maybe_parse(|p| p.parse_expr())? "successfully" parses/consumes the IF keyword. However, this is contrary to the keyword documentation for SQL Server (ref) which requires an "integer expression".

I think I had said when introducing parse_return in a previous PR that we'd have to come back and tighten it up eventually, but I can't find that discussion 😐.

.replace(" ;", " ")
.replace(";\n", "\n")
.replace("\n;", "\n")
.replace(";", " ");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this last statement (replace ";" with " " handle all the above? Or is the idea that newlines are important ?

Can you please add comments explaining the rationale for these substitutions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm preserving the newline/whitespace around the statement delimiter because I'm trying to avoid being opinionated on whether they're meaningful or not in this helper. i.e., the helper should only strip semicolons and should avoid other transformations. I can add a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. At some point I was probably testing this in conjunction with #1809 and might have gotten it from there. Regardless, I'll simplify it here per your suggestion.

@@ -272,20 +275,39 @@ fn parse_insert_default_values() {
"INSERT INTO test_table DEFAULT VALUES (some_column)";
assert_eq!(
ParserError::ParserError("Expected: end of statement, found: (".to_string()),
parse_sql_statements(insert_with_default_values_and_hive_after_columns).unwrap_err()
all_dialects_requiring_semicolon_statement_delimiter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this query doesn't seem to have multiple statements. Why does the test need to be changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turning on the option to make semicolons optional results in this test failures:

thread 'parse_insert_default_values' panicked at src/test_utils.rs:101:21:
assertion `left == right` failed: Parse results with PostgreSqlDialect are different from MsSqlDialect
  left: Err(ParserError("Expected: end of statement, found: ("))
 right: Err(ParserError("Expected: SELECT, VALUES, or a subquery in the query body, found: some_column"))

Since that's a parser error in both cases (but different message) I split the assertion out into two -- one for semicolons required that asserts the former error, and one for semicolons optional that asserts the other error.

Comment on lines 488 to 511
// end of statement
Token::Word(word) => {
if expecting_statement_delimiter && word.keyword == Keyword::END {
break;
// don't expect a semicolon statement delimiter after a newline when not otherwise required
Token::Whitespace(Whitespace::Newline) => {
if !self.options.require_semicolon_stmt_delimiter {
expecting_statement_delimiter = false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a merge mistake. I will investigate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't recall why I removed that code & tests all pass when I restore it, so it is now restored

- the dialect's original support informs the parser option, but the parser behavior itself should just check it's own options
@aharpervc aharpervc force-pushed the expand-parse-without-semicolons branch from 9a52518 to 4bde5b4 Compare July 21, 2025 22:00
@aharpervc aharpervc requested a review from alamb July 22, 2025 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants