Document dictionary compression #4760

Kerollmops · 2024-07-02T09:19:46Z

This PR fixes #4750 by introducing document compression to Meilisearch.

I had to use thezstdlibrary directly instead of thelz4_flexbecause the latter doesn't provide an option to specify the compression level.According to the documentation,I used:

A compression level of 19,
A dictionary size of 64000 bytes,
The first 10k documents in the LMDB database as the sample to generate the dictionary,
The dictionary is generated if there is no dictionary yet and we have reached 10k documents,
The dictionary is deleted when all the documents are deleted (index clear or by hand).

Note that the benchmarks only represent the first couple hours of real usage: When the user uploads some documents and settings, the number of documents reaches 10k+. This PR will compress the documents when there are 10k+ with a dictionary generated from them and thenneverchanges the dictionary (but if no documents are left).

The first results show between 2x and 3x compression of the documents database (from a 25GiB to a 16GiB data.ms and an average document size going from 305B to 126B). Still, we can see performance regression due to the compression done on a single thread (👇).

To do

Resolve merge conflicts by rebasing on main
Make sure the tests are passing
Measure performance impact withAdd search benchmarks#4762
Simplify the usage of document decoding
Specify a const for document sample size, compression level, and dictionary size
Compress the documents in parallel instead of in the write loop
See if we can use theexperimentalzstdfeature toavoid copying 64k bytes in memory.

Kerollmops · 2024-07-03T14:38:49Z

/bench workloads/search/*.json

Kerollmops · 2024-07-03T14:41:11Z

/bench workloads/*.json

meili-bot · 2024-07-03T15:05:34Z

☀️ Benchmark invocation completed, please find the results for your workloads below:

meili-bot · 2024-07-03T15:45:21Z

☀️ Benchmark invocation completed, please find the results for your workloads below:

Kerollmops · 2024-07-03T15:52:05Z

/bench workloads/search/*.json

meili-bot · 2024-07-03T16:19:14Z

☀️ Benchmark invocation completed, please find the results for your workloads below:

Kerollmops · 2024-07-04T09:35:06Z

/bench workloads/hackernews-ignore-first-100k.json

meili-bot · 2024-07-04T10:54:58Z

☀️ Benchmark invocation completed, please find the results for your workloads below:

Kerollmops added performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption disk space usage labels Jul 2, 2024

Kerollmops added this to thev1.10.0milestone Jul 2, 2024

Kerollmops force-pushed the document-dictionnary-compression branch 2 times, most recently from b567c8b to 264baed Compare July 3, 2024 09:47

Kerollmops changed the title ~~Document dictionnary compression~~ Document dictionary compression Jul 3, 2024

Kerollmops force-pushed the document-dictionnary-compression branch from 11e4f9f to f73d95d Compare July 4, 2024 09:33

Kerollmops force-pushed the document-dictionnary-compression branch from f73d95d to a63f202 Compare July 8, 2024 13:33

Kerollmops added11commits July 10, 2024 16:32

Introduce the compressed obkv readers and writers

2f0567f

First compiling version with compressed documents iterators

e9d6b42

First version compressing the documents

bf5d9f6

Prefer encoding the output size when compressing documents

0d63d02

Generate the dictionary from the first 10k documents

767f20e

Fix merging of documents to support compressed documents

b15e8aa

Use the zstd library directly to be able to define the compression level

e18b06d

Simplify optional document decompression usage

e95e47d

Make the tests pass

4ceade4

Clean up some parts of the code

fd8c90b

Use an experimental feature to avoid copying 64k in memory

deee22b

Kerollmops force-pushed the document-dictionnary-compression branch from a63f202 to deee22b Compare July 10, 2024 14:42

curquiza removed this from thev1.10.0milestone Jul 23, 2024

Kerollmops mentioned this pull request Jul 24, 2024

About two major LMDB optimization tricks #4793

Open

Kerollmops mentioned this pull request Aug 9, 2024

Meilisearch is too slow Kerollmops/blog#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document dictionary compression #4760

Document dictionary compression #4760

Kerollmops commented Jul 2, 2024 •

edited

Loading

Kerollmops commented Jul 3, 2024

Kerollmops commented Jul 3, 2024

meili-bot commented Jul 3, 2024

meili-bot commented Jul 3, 2024

Kerollmops commented Jul 3, 2024

meili-bot commented Jul 3, 2024

Kerollmops commented Jul 4, 2024

meili-bot commented Jul 4, 2024

Document dictionary compression #4760

Are you sure you want to change the base?

Document dictionary compression #4760

Conversation

Kerollmops commented Jul 2, 2024 • edited Loading

To do

Kerollmops commented Jul 3, 2024

Kerollmops commented Jul 3, 2024

meili-bot commented Jul 3, 2024

meili-bot commented Jul 3, 2024

Kerollmops commented Jul 3, 2024

meili-bot commented Jul 3, 2024

Kerollmops commented Jul 4, 2024

meili-bot commented Jul 4, 2024

Kerollmops commented Jul 2, 2024 •

edited

Loading