-
-
Notifications
You must be signed in to change notification settings - Fork 96
refactor: replace bs4 and perf optimizations #927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: replace bs4 and perf optimizations #927
Conversation
@@ -1208,6 +1247,14 @@ def _validate_outputs(self, data: Any) -> None: | |||
validate_typed_dict(data, data_type, f"Component '{self.name}'", "data") | |||
|
|||
|
|||
# Perf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section below relates to second point in #910 (comment):
Avoid duplicitly creating ComponentNode subclasses, as described in #910 (comment)
# After rendering is done, remove the current state from the stack, which means | ||
# properties like `self.context` will no longer return the current state. | ||
self._render_stack.pop() | ||
context.render_context.pop() | ||
|
||
return output | ||
# Internal component HTML post-processing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of the change to make it possible to parse the component HTML only once. It's achieved by deferring the postprocessing of the component output until its parent has been post-processed.
That way, by the time we get to post-processing the child component, we know if there are any extra HTML attributes like data-djc-id-a1b2c3
, that spilled over from the parent onto the child.
For context see "Idea 2 - Render components top-down instead of bottom-up"
def post_processor(root_attributes: Optional[List[str]] = None) -> Tuple[str, Dict[str, List[str]]]: | ||
nonlocal html_content | ||
|
||
updated_html, child_components = set_component_attrs_for_js_and_css( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously set_component_attrs_for_js_and_css
was named _link_dependencies_with_component_html
and it was inside postprocess_component_html
. I've split the two for simplicity.
@@ -0,0 +1,1147 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the pure Python implementation of the of the HTML parser. This is the same implementation as what I've shared at the end of this comment.
This file has 3 important parts:
-
_parse_html()
is the low-level parser that is given the HTML. -
To make it easy to make modifications to the HTML, and in a single parse,
_parse_html
accepts a callback that's called for each HTML element it encounters.When the callback is called, the callback receives an instance of
HTMLTag
.HTMLTag
defines API for modifying the HTML element at hand - e.g. by adding / removing attributes, adding / removing content, etc.Example
# Add attribute `data-djc-id-123` to ALL tags def on_tag(tag: HTMLTag, stack: List[HTMLTag]): on_tag.add_attr("data-djc-id-123", None, False) updated_html = _parse_html(html, on_tag)
-
Lastly at the very end, there's
set_html_attributes()
. This is the interface that's shared by this and Rust implementation.
text: str, | ||
on_tag: Callable[[HTMLTag, List[HTMLTag]], None], | ||
*, | ||
expand_shorthand_tags: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, one more on the HTML parser - So I initially implemented it for the use case of converting Vue syntax to Django, I also added the support for expanding self-closing HTML attributes.
In other words, in standard HTML, this is invalid:
<div />
Because <div>
is NOT a void element like <input />
, <img />
, etc.
However, what Vue does is that when it sees a "self-closing" HTML element like that, which is NOT a void element, it expands it into
<div></div>
For Vue and React I believe it's about normalizing the behaviour of components embedded inside the template, and the rest of HTML elements.
E.g. if one can do this in Vue / React:
<MyComp />
Then for convenence, they allow the same syntax also for standard HTML elements:
<div />
So now we'd also allow this syntax for component templates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strange to me, but I don't quite see the harm other than code complexity. If this is common in Vue I'm fine with it.
@@ -1,111 +0,0 @@ | |||
from abc import ABC, abstractmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the html.py
file that defined the API through which we used BeautifulSoup4.
The difference is that the new html_parser.py processes HTML in a stream-like fashion. On the other hand BS4 first built the whole DOM / tree, and then allowed one to freely modify any parts of it.
|
||
# This same set of tests is also found in djc_html_parser, to ensure that | ||
# this implementation can be replaced with the djc_html_parser's Rust-based implementation | ||
class TestHTMLParser(TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file contains 2 kinds of files
This TestHTMLParser
is the contract that we expect from the HTML parser. So even if we switch to the Rust implementation, we will expect these tests to pass.
On the other hand, the tests defined in TestHTMLParserInternal relate only to the pure-python implementation of the HTML parser.
Once I get to the PR that adds the Rust implementation, I might split this test file, so the pure python impl is separate from the expected "HTMLParser" API.
Going one step further, it might even make sense for the pure-python implementation to live as a separate package too, so one wouldn't have to load that if not needing it. But also there's no harm in the pure-python impl being here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't expect any problems with being able to install the rust version, I'm not sure why we should maintain two versions. Are we expecting problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I was thinking of:
- By looking at the python code one could understand what the hidden binary does.
- You could plugin yet another solution easily.
- If you don't want to have compiled code in your code base you can still use djc.
But keeping the code base small and less complex is a very strong counter argument. So doing that is absolutely fine by me. My comment was purely based on discussions that I observed in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good arguments dalito. This is my thoughts around those three points:
- I'm hoping that by keeping the other library close, you can easily go there and look at the code to understand the Rust code if you wanted.
- Since the above code defines an interface and has a good set of tests, I think that might be enough to understand enough about the project to build another plugin. I don't think we need two implementations to reach that goal.
- I think this is valid, but I don't understand how wide-spread this need is. A huge part of the Python ecosystem is based on python wrappers of C code, so I'm not sure this is even viable for a Django project. You are going to use a database for your website, and those python database drivers are likely C backed?
My hunch is that the maintenance overhead for this is larger that what it's worth, but I'm happy to hear your thoughts.
This doesn't mean this PR can't be merged, but maybe that we'll remove this code when we add the Rust-based parser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this discussion is not blocking this PR, I've already merged it.
I definitely wanted to have first merged the pure-python implementation, and then replace it with the Rust implementation in a separate commit, so we could come back to the commit with Python impl if ever necessary.
My argument for keeping Python impl was that, theoretically, there could be some niche OS platforms which might not be supported by the maturin
build tool. But as I saw that both ruff
and pydantic
are using maturin
, then I think it's reasonable to assume that if they are ok with the spectrum of platforms supported by maturin
, then so can we be.
So actually I'd go with removing the Python impl totally (once I'm adding the Rust impl).
I don't think there will be a need to plug in a different solution. Or at least not here - there will be a support for proper plugin system, but those will be separate from this HTML parser. That way, this HTML parser can remain an implementation detail.
Also, I don't expect that there will need to any more tweaks done on the implementation any time soon.
I imagine we might want to revisit it if / when we decide that component's rendered HTML MUST be a valid HTML fragment (as far as I'm aware, we don't enforce that yet). At that point we could use this parser to report on errors with the HTML syntax (e.g. missing closing tag). And we would maybe need to modify the parser to track line and index to report the position of the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a plan!
@@ -734,6 +761,10 @@ def get_component_media(comp_cls_hash: str) -> Media: | |||
return (content, final_script_tags.encode("utf-8"), final_css_tags.encode("utf-8")) | |||
|
|||
|
|||
href_pattern = re.compile(r'href="([^"]+)"') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of the changes in this file is about removing dependency on BS4. For example here, to extract the URL from <script>
or <link>
, we'll use a regex instead of relying on BS4.
|
||
if not elems: | ||
We find these tags by looking for the first `</head>` and last `</body>` tags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this section is about inserting JS and CSS into their default locations.
Now, to avoid having to parse the HTML to find the end of body and head, we use following approach:
- We find the end of
<body>
by searching for last occurrence of</body>
- And for the end of
<head>
we search for the first occurrence of</head>
def _link_dependencies_with_component_html( | ||
component_id: str, | ||
html_content: str, | ||
def set_component_attrs_for_js_and_css( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed _link_dependencies_with_component_html
. And instead of using BS4 to insert the HTML attributes, it now uses our custom HTML parser to do so.
@@ -0,0 +1,169 @@ | |||
import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this file I implemented the top-down approach to rendering nested components, as desribed in Idea 2 - Render components top-down instead of bottom-up.
I've put this into a separate file, so I could proprly document and explain what's going on, and to keep the component.py
file relatively navigatable, since it's already quite big.
Please see the comments throughout this file for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of small nitpicks you can decide to ignore if you want, but I think this looks really good. Happy that you could solve this super-tricky problem so quickly.
@@ -29,7 +29,6 @@ classifiers = [ | |||
] | |||
dependencies = [ | |||
'Django>=4.2', | |||
'beautifulsoup4>=4.12', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back to no dependencies! 🥳
return "".join(tokens) # Join all tokens at the end | ||
|
||
|
||
def set_html_attributes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a super-custom HTML parser, so i see what you mean now. Looks good!
Part of #14
Changes:
<div />
The plan of action is: