Skip to content

refactor: replace bs4 and perf optimizations #927

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JuroOravec
Copy link
Collaborator

@JuroOravec JuroOravec commented Jan 23, 2025

Part of #14

Changes:

  • Removed dependency on BeautifulSoup4.
  • Added custom HTML parser implementation.
  • Allow to write self-closing HTML elements within component template (e.g. <div />
    • Plus added documentation section on that
  • Couple of optimization refactors (see individual comments)

The plan of action is:

  1. Add the pure Python implementation in this PR
  2. Create the repo for the Rust impl and make a PR for that.
  3. Lastly, add the Rust impl as a Python dependency for django-components, plus add setting to allow users to fall back onto the pure Python implementation, if, for whatever reason, they cannot use the Rust implementation. (as suggested here)

@@ -1208,6 +1247,14 @@ def _validate_outputs(self, data: Any) -> None:
validate_typed_dict(data, data_type, f"Component '{self.name}'", "data")


# Perf
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section below relates to second point in #910 (comment):

Avoid duplicitly creating ComponentNode subclasses, as described in #910 (comment)

# After rendering is done, remove the current state from the stack, which means
# properties like `self.context` will no longer return the current state.
self._render_stack.pop()
context.render_context.pop()

return output
# Internal component HTML post-processing:
Copy link
Collaborator Author

@JuroOravec JuroOravec Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the change to make it possible to parse the component HTML only once. It's achieved by deferring the postprocessing of the component output until its parent has been post-processed.

That way, by the time we get to post-processing the child component, we know if there are any extra HTML attributes like data-djc-id-a1b2c3, that spilled over from the parent onto the child.

For context see "Idea 2 - Render components top-down instead of bottom-up"

def post_processor(root_attributes: Optional[List[str]] = None) -> Tuple[str, Dict[str, List[str]]]:
nonlocal html_content

updated_html, child_components = set_component_attrs_for_js_and_css(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously set_component_attrs_for_js_and_css was named _link_dependencies_with_component_html and it was inside postprocess_component_html. I've split the two for simplicity.

@@ -0,0 +1,1147 @@
"""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pure Python implementation of the of the HTML parser. This is the same implementation as what I've shared at the end of this comment.

This file has 3 important parts:

  1. _parse_html() is the low-level parser that is given the HTML.

  2. To make it easy to make modifications to the HTML, and in a single parse, _parse_html accepts a callback that's called for each HTML element it encounters.

    When the callback is called, the callback receives an instance of HTMLTag. HTMLTag defines API for modifying the HTML element at hand - e.g. by adding / removing attributes, adding / removing content, etc.

    Example

    # Add attribute `data-djc-id-123` to ALL tags
    def on_tag(tag: HTMLTag, stack: List[HTMLTag]):
        on_tag.add_attr("data-djc-id-123", None, False)
    
    updated_html = _parse_html(html, on_tag)
  3. Lastly at the very end, there's set_html_attributes(). This is the interface that's shared by this and Rust implementation.

text: str,
on_tag: Callable[[HTMLTag, List[HTMLTag]], None],
*,
expand_shorthand_tags: bool,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, one more on the HTML parser - So I initially implemented it for the use case of converting Vue syntax to Django, I also added the support for expanding self-closing HTML attributes.

In other words, in standard HTML, this is invalid:

<div />

Because <div> is NOT a void element like <input />, <img />, etc.

However, what Vue does is that when it sees a "self-closing" HTML element like that, which is NOT a void element, it expands it into

<div></div>

For Vue and React I believe it's about normalizing the behaviour of components embedded inside the template, and the rest of HTML elements.

E.g. if one can do this in Vue / React:

<MyComp />

Then for convenence, they allow the same syntax also for standard HTML elements:

<div />

So now we'd also allow this syntax for component templates.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange to me, but I don't quite see the harm other than code complexity. If this is common in Vue I'm fine with it.

@@ -1,111 +0,0 @@
from abc import ABC, abstractmethod
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the html.py file that defined the API through which we used BeautifulSoup4.

The difference is that the new html_parser.py processes HTML in a stream-like fashion. On the other hand BS4 first built the whole DOM / tree, and then allowed one to freely modify any parts of it.


# This same set of tests is also found in djc_html_parser, to ensure that
# this implementation can be replaced with the djc_html_parser's Rust-based implementation
class TestHTMLParser(TestCase):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains 2 kinds of files

This TestHTMLParser is the contract that we expect from the HTML parser. So even if we switch to the Rust implementation, we will expect these tests to pass.

On the other hand, the tests defined in TestHTMLParserInternal relate only to the pure-python implementation of the HTML parser.

Once I get to the PR that adds the Rust implementation, I might split this test file, so the pure python impl is separate from the expected "HTMLParser" API.


Going one step further, it might even make sense for the pure-python implementation to live as a separate package too, so one wouldn't have to load that if not needing it. But also there's no harm in the pure-python impl being here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't expect any problems with being able to install the rust version, I'm not sure why we should maintain two versions. Are we expecting problems?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect problems, supposedly both ruff and pydantic use maturin, so it sounds solid.

Keeping two versions was suggested / requested here.

CC @dalito What would be the reason for keeping the Python version around?

Copy link
Contributor

@dalito dalito Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was thinking of:

  • By looking at the python code one could understand what the hidden binary does.
  • You could plugin yet another solution easily.
  • If you don't want to have compiled code in your code base you can still use djc.

But keeping the code base small and less complex is a very strong counter argument. So doing that is absolutely fine by me. My comment was purely based on discussions that I observed in the past.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good arguments dalito. This is my thoughts around those three points:

  • I'm hoping that by keeping the other library close, you can easily go there and look at the code to understand the Rust code if you wanted.
  • Since the above code defines an interface and has a good set of tests, I think that might be enough to understand enough about the project to build another plugin. I don't think we need two implementations to reach that goal.
  • I think this is valid, but I don't understand how wide-spread this need is. A huge part of the Python ecosystem is based on python wrappers of C code, so I'm not sure this is even viable for a Django project. You are going to use a database for your website, and those python database drivers are likely C backed?

My hunch is that the maintenance overhead for this is larger that what it's worth, but I'm happy to hear your thoughts.

This doesn't mean this PR can't be merged, but maybe that we'll remove this code when we add the Rust-based parser?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this discussion is not blocking this PR, I've already merged it.

I definitely wanted to have first merged the pure-python implementation, and then replace it with the Rust implementation in a separate commit, so we could come back to the commit with Python impl if ever necessary.

My argument for keeping Python impl was that, theoretically, there could be some niche OS platforms which might not be supported by the maturin build tool. But as I saw that both ruff and pydantic are using maturin, then I think it's reasonable to assume that if they are ok with the spectrum of platforms supported by maturin, then so can we be.

So actually I'd go with removing the Python impl totally (once I'm adding the Rust impl).


I don't think there will be a need to plug in a different solution. Or at least not here - there will be a support for proper plugin system, but those will be separate from this HTML parser. That way, this HTML parser can remain an implementation detail.


Also, I don't expect that there will need to any more tweaks done on the implementation any time soon.

I imagine we might want to revisit it if / when we decide that component's rendered HTML MUST be a valid HTML fragment (as far as I'm aware, we don't enforce that yet). At that point we could use this parser to report on errors with the HTML syntax (e.g. missing closing tag). And we would maybe need to modify the parser to track line and index to report the position of the error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a plan!

@@ -734,6 +761,10 @@ def get_component_media(comp_cls_hash: str) -> Media:
return (content, final_script_tags.encode("utf-8"), final_css_tags.encode("utf-8"))


href_pattern = re.compile(r'href="([^"]+)"')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the changes in this file is about removing dependency on BS4. For example here, to extract the URL from <script> or <link>, we'll use a regex instead of relying on BS4.


if not elems:
We find these tags by looking for the first `</head>` and last `</body>` tags.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this section is about inserting JS and CSS into their default locations.

Now, to avoid having to parse the HTML to find the end of body and head, we use following approach:

  • We find the end of <body> by searching for last occurrence of </body>
  • And for the end of <head> we search for the first occurrence of </head>

def _link_dependencies_with_component_html(
component_id: str,
html_content: str,
def set_component_attrs_for_js_and_css(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed _link_dependencies_with_component_html. And instead of using BS4 to insert the HTML attributes, it now uses our custom HTML parser to do so.

@@ -0,0 +1,169 @@
import re
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this file I implemented the top-down approach to rendering nested components, as desribed in Idea 2 - Render components top-down instead of bottom-up.

I've put this into a separate file, so I could proprly document and explain what's going on, and to keep the component.py file relatively navigatable, since it's already quite big.

Please see the comments throughout this file for details.

@JuroOravec JuroOravec marked this pull request as ready for review January 23, 2025 17:54
Copy link
Collaborator

@EmilStenstrom EmilStenstrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small nitpicks you can decide to ignore if you want, but I think this looks really good. Happy that you could solve this super-tricky problem so quickly.

@@ -29,7 +29,6 @@ classifiers = [
]
dependencies = [
'Django>=4.2',
'beautifulsoup4>=4.12',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to no dependencies! 🥳

return "".join(tokens) # Join all tokens at the end


def set_html_attributes(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a super-custom HTML parser, so i see what you mean now. Looks good!

@JuroOravec JuroOravec merged commit 0b65761 into django-components:master Jan 24, 2025
14 checks passed
@JuroOravec JuroOravec deleted the jo-perf-and-remove-bs4 branch January 24, 2025 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants