Skip to content

gh-135661: Fix CDATA section parsing in HTMLParser #135665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions Doc/library/html.parser.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,17 @@ The output will then be:
attributes can be preserved, etc.).


.. method:: HTMLParser.support_cdata(flag)

Sets how the parser will parse CDATA declarations.
If *flag* is true, then the :meth:`unknown_decl` method will be called
for the CDATA section ``<![CDATA[...]]>``.
If *flag* is false, then the :meth:`handle_comment` method will be called
for ``<![CDATA[...>``.
Comment on lines +129 to +130
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should mention the default behaviour.

Suggested change
If *flag* is false, then the :meth:`handle_comment` method will be called
for ``<![CDATA[...>``.
If *flag* is false, or if :meth:`!support_cdata` has not been called yet,
then the :meth:`handle_comment` method will be called for ``<![CDATA[...>``.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a weak point of such approach. It should be true by default to be able to parse valid HTML (when <![CDATA[...]]> is only used in foreign content) by default. But secure parsing needs to set it to false at the beginning and the set to true or false after every open or close tag, depending on the complex algorithm.

So we should set it to true by default if we keep this approach.


.. versionadded:: 3.13.6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should mention the previous behaviour.

Suggested change
.. versionadded:: 3.13.6
.. versionadded:: 3.13.6
Previously, :meth:`unknown_decl` was called for ``<![CDATA[...>``.



The following methods are called when data or markup elements are encountered
and they are meant to be overridden in a subclass. The base class
implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
Expand Down
18 changes: 16 additions & 2 deletions Lib/html/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ def reset(self):
self.lasttag = '???'
self.interesting = interesting_normal
self.cdata_elem = None
self._support_cdata = True
self._escapable = True
super().reset()

Expand Down Expand Up @@ -183,6 +184,9 @@ def clear_cdata_mode(self):
self.cdata_elem = None
self._escapable = True

def support_cdata(self, flag=True):
self._support_cdata = flag
Comment on lines +187 to +188
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this is the best way to handle the issue.
Since this solves a security issue, it also needs to be added to a bug fix release and backported to several other releases. Making the method private should be enough to solve the security issue in the short term without adding to the public API. This will also give us time to think about alternative (and possibly better) solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Private means that users shouldn't call it. Given that nothing in html calls it, making it private is the same as not adding it at all.

It looks like the issue can't be solved in CPython alone -- users need to adapt their code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be solved in CPython, and I hope we will, but it looks like the solution will be too complex to risk backporting it to security-only branches. Providing a method to alter behavior we pass the ball to user side. This is not good, but otherwise the problem will left unsolved. And without closing this hole, all other security fixes for HTMLParser are worthless.

I will try other approach, but can't guarantee anything.


# Internal -- handle data as far as reasonable. May leave state
# and data to be processed by a subsequent call. If 'end' is
# true, force handling all data as if followed by EOF marker.
Expand Down Expand Up @@ -258,7 +262,10 @@ def goahead(self, end):
break
self.handle_comment(rawdata[i+4:j])
elif startswith("<![CDATA[", i):
self.unknown_decl(rawdata[i+3:])
if self._support_cdata:
self.unknown_decl(rawdata[i+3:])
else:
self.handle_comment(rawdata[i+1:])
elif rawdata[i:i+9].lower() == '<!doctype':
self.handle_decl(rawdata[i+2:])
elif startswith("<!", i):
Expand Down Expand Up @@ -334,7 +341,14 @@ def parse_html_declaration(self, i):
# this case is actually already handled in goahead()
return self.parse_comment(i)
elif rawdata[i:i+9] == '<![CDATA[':
return self.parse_marked_section(i)
if self._support_cdata:
j = rawdata.find(']]>', i+9)
if j < 0:
return -1
self.unknown_decl(rawdata[i+3: j])
return j + 3
else:
return self.parse_bogus_comment(i)
elif rawdata[i:i+9].lower() == '<!doctype':
# find the closing >
gtpos = rawdata.find('>', i+9)
Expand Down
86 changes: 63 additions & 23 deletions Lib/test/test_htmlparser.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,13 @@

class EventCollector(html.parser.HTMLParser):

def __init__(self, *args, **kw):
def __init__(self, *args, autocdata=False, **kw):
self.autocdata = autocdata
self.events = []
self.append = self.events.append
html.parser.HTMLParser.__init__(self, *args, **kw)
if autocdata:
self.support_cdata(False)

def get_events(self):
# Normalize the list of events so that buffer artefacts don't
Expand All @@ -34,12 +37,16 @@ def get_events(self):

def handle_starttag(self, tag, attrs):
self.append(("starttag", tag, attrs))
if self.autocdata and tag == 'svg':
self.support_cdata(True)

def handle_startendtag(self, tag, attrs):
self.append(("startendtag", tag, attrs))

def handle_endtag(self, tag):
self.append(("endtag", tag))
if self.autocdata and tag == 'svg':
self.support_cdata(False)

# all other markup

Expand Down Expand Up @@ -767,10 +774,6 @@ def test_eof_in_declarations(self):
('<!', [('comment', '')]),
('<!-', [('comment', '-')]),
('<![', [('comment', '[')]),
('<![CDATA[', [('unknown decl', 'CDATA[')]),
('<![CDATA[x', [('unknown decl', 'CDATA[x')]),
('<![CDATA[x]', [('unknown decl', 'CDATA[x]')]),
('<![CDATA[x]]', [('unknown decl', 'CDATA[x]]')]),
('<!DOCTYPE', [('decl', 'DOCTYPE')]),
('<!DOCTYPE ', [('decl', 'DOCTYPE ')]),
('<!DOCTYPE html', [('decl', 'DOCTYPE html')]),
Expand All @@ -783,6 +786,18 @@ def test_eof_in_declarations(self):
for html, expected in data:
self._run_check(html, expected)

@support.subTests('content', ['', 'x', 'x]', 'x]]'])
def test_eof_in_cdata(self, content):
self._run_check('<![CDATA[' + content,
[('unknown decl', 'CDATA[' + content)])
self._run_check('<![CDATA[' + content,
[('comment', '![CDATA[' + content)],
collector=EventCollector(autocdata=True))
self._run_check('<svg><text y="100"><![CDATA[' + content,
[('starttag', 'svg', []),
('starttag', 'text', [('y', '100')]),
('unknown decl', 'CDATA[' + content)])

def test_bogus_comments(self):
html = ('<!ELEMENT br EMPTY>'
'<! not really a comment >'
Expand Down Expand Up @@ -845,28 +860,53 @@ def test_broken_condcoms(self):
]
self._run_check(html, expected)

def test_cdata_declarations(self):
# More tests should be added. See also "8.2.4.42. Markup
# declaration open state", "8.2.4.69. CDATA section state",
# and issue 32876
html = ('<![CDATA[just some plain text]]>')
expected = [('unknown decl', 'CDATA[just some plain text')]
@support.subTests('content', [
'just some plain text',
'<!-- not a comment -->',
'&not-an-entity-ref;',
"<not a='start tag'>",
'',
'[[I have many brackets]]',
'I have a > in the middle',
'I have a ]] in the middle',
'] ]>',
']] >',
('\n'
' if (a < b && a > b) {\n'
' printf("[<marquee>How?</marquee>]");\n'
' }\n'),
])
def test_cdata_section_content(self, content):
# See "13.2.5.42 Markup declaration open state",
# "13.2.5.69 CDATA section state", and issue bpo-32876.
html = f'<svg><text y="100"><![CDATA[{content}]]></text></svg>'
expected = [
('starttag', 'svg', []),
('starttag', 'text', [('y', '100')]),
('unknown decl', 'CDATA[' + content),
('endtag', 'text'),
('endtag', 'svg'),
]
self._run_check(html, expected)
self._run_check(html, expected, collector=EventCollector(autocdata=True))

def test_cdata_declarations_multiline(self):
html = ('<code><![CDATA['
' if (a < b && a > b) {'
' printf("[<marquee>How?</marquee>]");'
' }'
']]></code>')
def test_cdata_section(self):
# See "13.2.5.42 Markup declaration open state".
html = ('<![CDATA[foo<br>bar]]>'
'<svg><text y="100"><![CDATA[foo<br>bar]]></text></svg>'
'<![CDATA[foo<br>bar]]>')
expected = [
('starttag', 'code', []),
('unknown decl',
'CDATA[ if (a < b && a > b) { '
'printf("[<marquee>How?</marquee>]"); }'),
('endtag', 'code')
('comment', '[CDATA[foo<br'),
('data', 'bar]]>'),
('starttag', 'svg', []),
('starttag', 'text', [('y', '100')]),
('unknown decl', 'CDATA[foo<br>bar'),
('endtag', 'text'),
('endtag', 'svg'),
('comment', '[CDATA[foo<br'),
('data', 'bar]]>'),
]
self._run_check(html, expected)
self._run_check(html, expected, collector=EventCollector(autocdata=True))

def test_convert_charrefs_dropped_text(self):
# #23144: make sure that all the events are triggered when
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to
the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section.
Comment on lines +1 to +2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to
the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section.
Fix CDATA section parsing in :class:`html.parser.HTMLParser` according to
the HTML5 standard: ``] ]>`` and ``]] >`` no longer end the CDATA section.
By default, :meth:`~HTMLParser.handle_comment` is called for CDATA.
The old behavior (calling :meth:`~HTMLParser.unknown_decl`) can be restored
using a new method, :meth:`~HTMLParser.support_cdata`.

Loading