Skip to content

Adding query performance optimizations #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tee8z
Copy link

@tee8z tee8z commented Feb 25, 2025

We have found that adding these few indexes and adding the most common query paths for the get_category_messages function as static queries to greatly improve the performance of our application using this database model. Based off of our own experience these changes have been able to reduce the query time for ~1000 rows using the get_category_messages from ~200ms to ~6ms. We leverage the get_category_messages heavily throughout our application logic to get the next batch of messages for a given stream to process, so this has had massively positive benefits. The additional indexes have also helped in that speed up. One thing to note, we are adding 3 indexes so this may slow down the writes some what and should be watch closely to see how often the indexes are hit for a given use-case.

@tee8z tee8z force-pushed the improve-get-category-messages branch from bd5bfd4 to 6f0593a Compare February 26, 2025 15:00
@tee8z tee8z marked this pull request as ready for review February 26, 2025 15:00
@sbellware
Copy link
Contributor

sbellware commented Mar 3, 2025

@tee8z This is good stuff. Thanks.

It's not clear yet that it's pertinent to most users. The point about write performance is top-of-mind, but it's not entirely definitive, either. It depends on the workloads.

I'm really curious about your workloads. Having some more insight would help characterize the needs of the change, and the tradeoff.

Typically, the query performance of retrieving a batch of messages from a category in order to process those messages is insignificant compared to the amount of time it takes to process those messages.

Optimizing the batch retrieval query improves the performance of an infinitesimal fraction of the total time between the execution of batch retrieval queries.

The typical workload looks like:

retrieve 1000 messages -> process each message of the 1000 retrieved -> retrieve anther 1000 messages -> process each message of the 1000 retrieved -> and repeat

The slow part is the processing of the messages. Making the batch retrieval go faster doesn't make much of a noticeable impact on the system performance because for each message in the batch, there's at least one message write, or some other I/O.

So, lets say that it takes 1000 units of time to retrieve and process 1000 messages. Out of that 1000 units of time, 1 unit of time might be taken up by the retrieval. Optimizing 1/1000th of the operation usually doesn't yield an improvement worth undertaking with the extra cost of ownership of the additional implementation.

In other workloads, like bulk loading and data analysis for example, the category batch retrieval might happen in a tight loop, which makes its performance for more significant.

Can you disclose the kind of workload you have, or can you characterize how much time is spent in the retrieval of a batch and the processing of a batch?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants