Skip to content
New issue

Have a question about this project?Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of serviceand privacy statement.We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document dictionary compression #4760

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

Kerollmops
Copy link
Member

@Kerollmops Kerollmops commented Jul 2, 2024

This PR fixes #4750 by introducing document compression to Meilisearch.

I had to use thezstdlibrary directly instead of thelz4_flexbecause the latter doesn't provide an option to specify the compression level.According to the documentation,I used:

  • A compression level of 19,
  • A dictionary size of 64000 bytes,
  • The first 10k documents in the LMDB database as the sample to generate the dictionary,
  • The dictionary is generated if there is no dictionary yet and we have reached 10k documents,
  • The dictionary is deleted when all the documents are deleted (index clear or by hand).

Note that the benchmarks only represent the first couple hours of real usage: When the user uploads some documents and settings, the number of documents reaches 10k+. This PR will compress the documents when there are 10k+ with a dictionary generated from them and thenneverchanges the dictionary (but if no documents are left).

The first results show between 2x and 3x compression of the documents database (from a 25GiB to a 16GiB data.ms and an average document size going from 305B to 126B). Still, we can see performance regression due to the compression done on a single thread (👇).

To do

  • Resolve merge conflicts by rebasing on main
  • Make sure the tests are passing
  • Measure performance impact withAdd search benchmarks#4762
  • Simplify the usage of document decoding
  • Specify a const for document sample size, compression level, and dictionary size
  • Compress the documents in parallel instead of in the write loop
  • See if we can use theexperimentalzstdfeature toavoid copying 64k bytes in memory.

@Kerollmops Kerollmops added performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption disk space usage labels Jul 2, 2024
@Kerollmops Kerollmops added this to thev1.10.0milestone Jul 2, 2024
@Kerollmops Kerollmops force-pushed the document-dictionnary-compression branch 2 times, most recently from b567c8b to 264baed Compare July 3, 2024 09:47
@Kerollmops
Copy link
Member Author

/bench workloads/search/*.json

@Kerollmops
Copy link
Member Author

/bench workloads/*.json

@Kerollmops
Copy link
Member Author

/bench workloads/search/*.json

@Kerollmops Kerollmops changed the title Document dictionnary compression Document dictionary compression Jul 3, 2024
@Kerollmops Kerollmops force-pushed the document-dictionnary-compression branch from 11e4f9f to f73d95d Compare July 4, 2024 09:33
@Kerollmops
Copy link
Member Author

/bench workloads/hackernews-ignore-first-100k.json

@meili-bot
Copy link
Contributor

☀️ Benchmark invocation completed, please find the results for your workloads below:

@Kerollmops Kerollmops force-pushed the document-dictionnary-compression branch from f73d95d to a63f202 Compare July 8, 2024 13:33
@Kerollmops Kerollmops force-pushed the document-dictionnary-compression branch from a63f202 to deee22b Compare July 10, 2024 14:42
@curquiza curquiza removed this from thev1.10.0milestone Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
disk space usage performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants