Increasing Indexed Column Size (Pgbigm-hackers) - pg_bigm

On Tue, Mar 17, 2015 at 5:53 PM, Sawada Masahiko <sawad****@gmail*****> wrote:
> On Mon, Mar 16, 2015 at 4:10 PM, Beena Emerson <memis****@gmail*****> wrote:
>> Hello,
>>
>> In the functions generate_bigm and generate_wildcard_bigm, the current
>> formula used to allocate space for the generated bigrams is picked up from
>> pg_trgm:
>> bgm = (BIGM *) palloc(VARHDRSZ + sizeof(bigm) * (Size) (slen / 2 + 1) * 3);
>> This limits the current indexed column size to 43 MB.
>>
>> However, this formula allocates more memory than required.
>> Consider a string with w words of varying length (w - 1 special characters
>> between words).
>> Slen = Sum(word_length) + w - 1 ....(1)
>> Number of bigrams generated by 1 word = word_length + 1 bigrams.
>>
>> For the whole string the total number of bigrams generated
>> = (word_length1 + 1)  + (word_length2 + 1) .. .for w words.
>> = sum(word_length) + w
>> = (slen - w + 1) + w   ...[from equation 1]
>> = slen + 1
>>
>> Hence the maximum number of bigrams generated by  a string is slen + 1.

This formula is true only when LPADDING = 1 and RPADDING = 1 basically.
That is, if they are changed, the formula also needs to be updated. I think
we should add the comment about this fact.

Don't we need to check whether palloc() can allocate "slen + 1" memory or not
before actually trying to call the palloc()? You can see that pg_trgm has this
kind of safeguard, and an error "out of memory" is emit if the requested size
is too large and palloc() cannnot allocate such memory.

>>
>> The attached patches update the code and documentation
>> 01_update formula.patch - This updates the formula used in the functions
>> generate_bigm and generate_wildcard_bigm. This increases the indexed column
>> size to 64 MB.
>>
>> 02_update_datatype.patch - Updates the data type of bytelen in the struct
>> bigm from int to int8. This further increases the indexed column size to
>> 102MB.
>>
>> 03_update_documentation.patch - Updates the indexed column size values in
>> html/pg_bigm_en-1-1.html file.
>>
>
> Thank you for your patch!
>
> These are good patch for making relax the pg_bigm restriction.
> Also, 02_update_datatype.patch reduces GIN index size.

Oh, really? I was thinking that GIN key entry is stored in text (i.e., varlena)
data type, so changing the data type of the field in bigm struct doesn't
affect the GIN key entry directly at all.

> For more increasing indexed column size, when allocation memory size
> exceeds limit we can have following two ways,
>
> 1.  Separate the input string into several strings, and concatenate
> bigms after generating each bigm.
> 2.  Recalculate multi-byte word length, and re-try to allocate memory.
>      (Currently pg_bigm always use byte length, but multi-byte word
> could be more than 1 bytes.)
>
> #1 eliminates the restrictions, #2 would more increase indexed column size.
> But pg_bigm allocates memory using by traditional way at first time
> for performance reason.
>
> Thought?

What about the third one? ;) The idea is to change the bigm struct as follows.

typedef struct
{
                bool    pmatch;
                int8    bytelen;
                char    *str;
} bigm;

Currently every two characters are stored in the bigm struct directly.
But ISTM that we can change the struct so that it stores just the pointer to
two characters instead because whole characters are already stored in
palloc()'d memory area.

Which versions should we apply these changes? I'm thinking to do that only on
the master because they (i.e., alleviating the column size limitation) look new
features. OTOH, Beena's second patch (i.e., change from int to int8) is really
simple, small but very useful. So I'm tempted to apply that to even older
versions. Thought?

Regards,

-- 
Fujii Masao

pg_bigm
Fork

[Pgbigm-hackers] Increasing Indexed Column Size

pg_bigm Fork

[Pgbigm-hackers] Increasing Indexed Column Size

pg_bigm
Fork