refactor: replace bs4 and perf optimizations #927

JuroOravec · 2025-01-23T13:00:12Z

Part of #14

Changes:

Removed dependency on BeautifulSoup4.
Added custom HTML parser implementation.
Allow to write self-closing HTML elements within component template (e.g. <div />
- Plus added documentation section on that
Couple of optimization refactors (see individual comments)

The plan of action is:

Add the pure Python implementation in this PR
Create the repo for the Rust impl and make a PR for that.
Lastly, add the Rust impl as a Python dependency for django-components, plus add setting to allow users to fall back onto the pure Python implementation, if, for whatever reason, they cannot use the Rust implementation. (as suggested here)

JuroOravec · 2025-01-23T14:44:31Z

src/django_components/component.py

@@ -1208,6 +1247,14 @@ def _validate_outputs(self, data: Any) -> None:
        validate_typed_dict(data, data_type, f"Component '{self.name}'", "data")


+# Perf


This section below relates to second point in #910 (comment):

Avoid duplicitly creating ComponentNode subclasses, as described in #910 (comment)

JuroOravec · 2025-01-23T14:49:39Z

src/django_components/component.py

        # After rendering is done, remove the current state from the stack, which means
        # properties like `self.context` will no longer return the current state.
        self._render_stack.pop()
        context.render_context.pop()

-        return output
+        # Internal component HTML post-processing:


This is part of the change to make it possible to parse the component HTML only once. It's achieved by deferring the postprocessing of the component output until its parent has been post-processed.

That way, by the time we get to post-processing the child component, we know if there are any extra HTML attributes like data-djc-id-a1b2c3, that spilled over from the parent onto the child.

For context see "Idea 2 - Render components top-down instead of bottom-up"

JuroOravec · 2025-01-23T14:52:22Z

src/django_components/component.py

+        def post_processor(root_attributes: Optional[List[str]] = None) -> Tuple[str, Dict[str, List[str]]]:
+            nonlocal html_content
+
+            updated_html, child_components = set_component_attrs_for_js_and_css(


Previously set_component_attrs_for_js_and_css was named _link_dependencies_with_component_html and it was inside postprocess_component_html. I've split the two for simplicity.

JuroOravec · 2025-01-23T15:05:55Z

src/django_components/util/html_parser.py

@@ -0,0 +1,1147 @@
+"""


This is the pure Python implementation of the of the HTML parser. This is the same implementation as what I've shared at the end of this comment.

This file has 3 important parts:

_parse_html() is the low-level parser that is given the HTML.

To make it easy to make modifications to the HTML, and in a single parse, _parse_html accepts a callback that's called for each HTML element it encounters.

When the callback is called, the callback receives an instance of HTMLTag. HTMLTag defines API for modifying the HTML element at hand - e.g. by adding / removing attributes, adding / removing content, etc.

Example

# Add attribute `data-djc-id-123` to ALL tags def on_tag(tag: HTMLTag, stack: List[HTMLTag]): on_tag.add_attr("data-djc-id-123", None, False) updated_html = _parse_html(html, on_tag)

Lastly at the very end, there's set_html_attributes(). This is the interface that's shared by this and Rust implementation.

tests/test_expression.py

JuroOravec · 2025-01-23T15:17:03Z

src/django_components/util/html_parser.py

+    text: str,
+    on_tag: Callable[[HTMLTag, List[HTMLTag]], None],
+    *,
+    expand_shorthand_tags: bool,


Ah, one more on the HTML parser - So I initially implemented it for the use case of converting Vue syntax to Django, I also added the support for expanding self-closing HTML attributes.

In other words, in standard HTML, this is invalid:

<div />

Because <div> is NOT a void element like <input />, <img />, etc.

However, what Vue does is that when it sees a "self-closing" HTML element like that, which is NOT a void element, it expands it into

<div></div>

For Vue and React I believe it's about normalizing the behaviour of components embedded inside the template, and the rest of HTML elements.

E.g. if one can do this in Vue / React:

<MyComp />

Then for convenence, they allow the same syntax also for standard HTML elements:

<div />

So now we'd also allow this syntax for component templates.

This is strange to me, but I don't quite see the harm other than code complexity. If this is common in Vue I'm fine with it.

JuroOravec · 2025-01-23T15:19:56Z

src/django_components/util/html.py

@@ -1,111 +0,0 @@
-from abc import ABC, abstractmethod


Removed the html.py file that defined the API through which we used BeautifulSoup4.

The difference is that the new html_parser.py processes HTML in a stream-like fashion. On the other hand BS4 first built the whole DOM / tree, and then allowed one to freely modify any parts of it.

JuroOravec · 2025-01-23T15:24:38Z

tests/test_html_parser.py

+
+# This same set of tests is also found in djc_html_parser, to ensure that
+# this implementation can be replaced with the djc_html_parser's Rust-based implementation
+class TestHTMLParser(TestCase):


This file contains 2 kinds of files

This TestHTMLParser is the contract that we expect from the HTML parser. So even if we switch to the Rust implementation, we will expect these tests to pass.

On the other hand, the tests defined in TestHTMLParserInternal relate only to the pure-python implementation of the HTML parser.

Once I get to the PR that adds the Rust implementation, I might split this test file, so the pure python impl is separate from the expected "HTMLParser" API.

Going one step further, it might even make sense for the pure-python implementation to live as a separate package too, so one wouldn't have to load that if not needing it. But also there's no harm in the pure-python impl being here.

If we don't expect any problems with being able to install the rust version, I'm not sure why we should maintain two versions. Are we expecting problems?

I don't expect problems, supposedly both ruff and pydantic use maturin, so it sounds solid.

Keeping two versions was suggested / requested here.

CC @dalito What would be the reason for keeping the Python version around?

What I was thinking of:

By looking at the python code one could understand what the hidden binary does.

You could plugin yet another solution easily.

If you don't want to have compiled code in your code base you can still use djc.

But keeping the code base small and less complex is a very strong counter argument. So doing that is absolutely fine by me. My comment was purely based on discussions that I observed in the past.

Good arguments dalito. This is my thoughts around those three points:

I'm hoping that by keeping the other library close, you can easily go there and look at the code to understand the Rust code if you wanted.

Since the above code defines an interface and has a good set of tests, I think that might be enough to understand enough about the project to build another plugin. I don't think we need two implementations to reach that goal.

I think this is valid, but I don't understand how wide-spread this need is. A huge part of the Python ecosystem is based on python wrappers of C code, so I'm not sure this is even viable for a Django project. You are going to use a database for your website, and those python database drivers are likely C backed?

My hunch is that the maintenance overhead for this is larger that what it's worth, but I'm happy to hear your thoughts.

This doesn't mean this PR can't be merged, but maybe that we'll remove this code when we add the Rust-based parser?

Yeah, this discussion is not blocking this PR, I've already merged it.

I definitely wanted to have first merged the pure-python implementation, and then replace it with the Rust implementation in a separate commit, so we could come back to the commit with Python impl if ever necessary.

My argument for keeping Python impl was that, theoretically, there could be some niche OS platforms which might not be supported by the maturin build tool. But as I saw that both ruff and pydantic are using maturin, then I think it's reasonable to assume that if they are ok with the spectrum of platforms supported by maturin, then so can we be.

So actually I'd go with removing the Python impl totally (once I'm adding the Rust impl).

I don't think there will be a need to plug in a different solution. Or at least not here - there will be a support for proper plugin system, but those will be separate from this HTML parser. That way, this HTML parser can remain an implementation detail.

Also, I don't expect that there will need to any more tweaks done on the implementation any time soon.

I imagine we might want to revisit it if / when we decide that component's rendered HTML MUST be a valid HTML fragment (as far as I'm aware, we don't enforce that yet). At that point we could use this parser to report on errors with the HTML syntax (e.g. missing closing tag). And we would maybe need to modify the parser to track line and index to report the position of the error.

Sounds like a plan!

JuroOravec · 2025-01-23T15:33:04Z

src/django_components/dependencies.py

@@ -734,6 +761,10 @@ def get_component_media(comp_cls_hash: str) -> Media:
    return (content, final_script_tags.encode("utf-8"), final_css_tags.encode("utf-8"))


+href_pattern = re.compile(r'href="([^"]+)"')


The rest of the changes in this file is about removing dependency on BS4. For example here, to extract the URL from <script> or <link>, we'll use a regex instead of relying on BS4.

JuroOravec · 2025-01-23T15:37:35Z

src/django_components/dependencies.py


-    if not elems:
+    We find these tags by looking for the first `</head>` and last `</body>` tags.


And this section is about inserting JS and CSS into their default locations.

Now, to avoid having to parse the HTML to find the end of body and head, we use following approach:

We find the end of <body> by searching for last occurrence of </body>

And for the end of <head> we search for the first occurrence of </head>

JuroOravec · 2025-01-23T15:38:30Z

src/django_components/dependencies.py

-def _link_dependencies_with_component_html(
-    component_id: str,
-    html_content: str,
+def set_component_attrs_for_js_and_css(


Renamed _link_dependencies_with_component_html. And instead of using BS4 to insert the HTML attributes, it now uses our custom HTML parser to do so.

JuroOravec · 2025-01-23T15:42:51Z

src/django_components/perfutil/component.py

@@ -0,0 +1,169 @@
+import re


In this file I implemented the top-down approach to rendering nested components, as desribed in Idea 2 - Render components top-down instead of bottom-up.

I've put this into a separate file, so I could proprly document and explain what's going on, and to keep the component.py file relatively navigatable, since it's already quite big.

Please see the comments throughout this file for details.

EmilStenstrom

A couple of small nitpicks you can decide to ignore if you want, but I think this looks really good. Happy that you could solve this super-tricky problem so quickly.

CHANGELOG.md

EmilStenstrom · 2025-01-23T22:10:49Z

pyproject.toml

@@ -29,7 +29,6 @@ classifiers = [
 ]
 dependencies = [
    'Django>=4.2',
-    'beautifulsoup4>=4.12',


Back to no dependencies! 🥳

tests/test_expression.py

EmilStenstrom · 2025-01-23T22:24:46Z

src/django_components/util/html_parser.py

+    return "".join(tokens)  # Join all tokens at the end
+
+
+def set_html_attributes(


This is a super-custom HTML parser, so i see what you mean now. Looks good!

JuroOravec added 3 commits January 23, 2025 13:59

refactor: replace bs4 and perf optimizations

3a32359

refactor: fix linter errors

41b7615

refactor: fix tests

e753130

JuroOravec commented Jan 23, 2025

View reviewed changes

tests/test_expression.py Outdated Show resolved Hide resolved

JuroOravec commented Jan 23, 2025

View reviewed changes

docs: add section on extra HTML behaviour + changelog

7080c5e

JuroOravec marked this pull request as ready for review January 23, 2025 17:54

JuroOravec requested a review from EmilStenstrom January 23, 2025 18:02

EmilStenstrom approved these changes Jan 23, 2025

View reviewed changes

JuroOravec added 2 commits January 24, 2025 10:21

refactor: revert expanding non-void self-closing HTML tags

e73afe0

refactor: update whitespace in expression tests

e756713

JuroOravec merged commit 0b65761 into django-components:master Jan 24, 2025
14 checks passed

JuroOravec deleted the jo-perf-and-remove-bs4 branch January 24, 2025 09:30

		@@ -1208,6 +1247,14 @@ def _validate_outputs(self, data: Any) -> None:
		validate_typed_dict(data, data_type, f"Component '{self.name}'", "data")


		# Perf

		@@ -734,6 +761,10 @@ def get_component_media(comp_cls_hash: str) -> Media:
		return (content, final_script_tags.encode("utf-8"), final_css_tags.encode("utf-8"))


		href_pattern = re.compile(r'href="([^"]+)"')


		if not elems:
		We find these tags by looking for the first `</head>` and last `</body>` tags.

		return "".join(tokens) # Join all tokens at the end


		def set_html_attributes(

Uh oh!

refactor: replace bs4 and perf optimizations #927

refactor: replace bs4 and perf optimizations #927

Uh oh!

Conversation

JuroOravec commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JuroOravec Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dalito Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EmilStenstrom left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JuroOravec commented Jan 23, 2025 •

edited

Loading

JuroOravec Jan 23, 2025 •

edited

Loading

dalito Jan 24, 2025 •

edited

Loading