Kouhei Sutou
null+****@clear*****
Mon Mar 16 14:59:34 JST 2015
Kouhei Sutou 2015-03-16 14:59:34 +0900 (Mon, 16 Mar 2015) New Revision: cb3979bc464b38efe78ad66b834076ec92930877 https://github.com/groonga/groonga/commit/cb3979bc464b38efe78ad66b834076ec92930877 Message: doc: describe about TokenBigram Added files: doc/source/example/reference/tokenizers/token-bigram-ascii-and-character-type-change-with-normalizer.log doc/source/example/reference/tokenizers/token-bigram-ascii-and-white-space-with-normalizer.log doc/source/example/reference/tokenizers/token-bigram-no-normalizer.log doc/source/example/reference/tokenizers/token-bigram-non-ascii-with-normalizer.log doc/source/example/reference/tokenizers/tokenize-example.log Modified files: doc/source/reference/tokenizers.rst Added: doc/source/example/reference/tokenizers/token-bigram-ascii-and-character-type-change-with-normalizer.log (+24 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-bigram-ascii-and-character-type-change-with-normalizer.log 2015-03-16 14:59:34 +0900 (95f101c) @@ -0,0 +1,24 @@ +Execution example:: + + tokenize TokenBigram "100cents!!!" NormalizerAuto + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "100" + # }, + # { + # "position": 1, + # "value": "cents" + # }, + # { + # "position": 2, + # "value": "!!!" + # } + # ] + # ] Added: doc/source/example/reference/tokenizers/token-bigram-ascii-and-white-space-with-normalizer.log (+20 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-bigram-ascii-and-white-space-with-normalizer.log 2015-03-16 14:59:34 +0900 (66e5a97) @@ -0,0 +1,20 @@ +Execution example:: + + tokenize TokenBigram "Hello World" NormalizerAuto + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "hello" + # }, + # { + # "position": 1, + # "value": "world" + # } + # ] + # ] Added: doc/source/example/reference/tokenizers/token-bigram-no-normalizer.log (+56 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-bigram-no-normalizer.log 2015-03-16 14:59:34 +0900 (0951932) @@ -0,0 +1,56 @@ +Execution example:: + + tokenize TokenBigram "Hello World" + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "He" + # }, + # { + # "position": 1, + # "value": "el" + # }, + # { + # "position": 2, + # "value": "ll" + # }, + # { + # "position": 3, + # "value": "lo" + # }, + # { + # "position": 4, + # "value": "o " + # }, + # { + # "position": 5, + # "value": " W" + # }, + # { + # "position": 6, + # "value": "Wo" + # }, + # { + # "position": 7, + # "value": "or" + # }, + # { + # "position": 8, + # "value": "rl" + # }, + # { + # "position": 9, + # "value": "ld" + # }, + # { + # "position": 10, + # "value": "d" + # } + # ] + # ] Added: doc/source/example/reference/tokenizers/token-bigram-non-ascii-with-normalizer.log (+36 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-bigram-non-ascii-with-normalizer.log 2015-03-16 14:59:34 +0900 (055d94d) @@ -0,0 +1,36 @@ +Execution example:: + + tokenize TokenBigram "日本語の勉強" NormalizerAuto + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "日本" + # }, + # { + # "position": 1, + # "value": "本語" + # }, + # { + # "position": 2, + # "value": "語の" + # }, + # { + # "position": 3, + # "value": "の勉" + # }, + # { + # "position": 4, + # "value": "勉強" + # }, + # { + # "position": 5, + # "value": "強" + # } + # ] + # ] Added: doc/source/example/reference/tokenizers/tokenize-example.log (+56 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/tokenize-example.log 2015-03-16 14:59:34 +0900 (0951932) @@ -0,0 +1,56 @@ +Execution example:: + + tokenize TokenBigram "Hello World" + # [ + # [ + # 0, + # 1337566253.89858, + # 0.000355720520019531 + # ], + # [ + # { + # "position": 0, + # "value": "He" + # }, + # { + # "position": 1, + # "value": "el" + # }, + # { + # "position": 2, + # "value": "ll" + # }, + # { + # "position": 3, + # "value": "lo" + # }, + # { + # "position": 4, + # "value": "o " + # }, + # { + # "position": 5, + # "value": " W" + # }, + # { + # "position": 6, + # "value": "Wo" + # }, + # { + # "position": 7, + # "value": "or" + # }, + # { + # "position": 8, + # "value": "rl" + # }, + # { + # "position": 9, + # "value": "ld" + # }, + # { + # "position": 10, + # "value": "d" + # } + # ] + # ] Modified: doc/source/reference/tokenizers.rst (+66 -5) =================================================================== --- doc/source/reference/tokenizers.rst 2015-03-16 14:23:11 +0900 (9a0f851) +++ doc/source/reference/tokenizers.rst 2015-03-16 14:59:34 +0900 (f4c2b8d) @@ -39,6 +39,15 @@ Normally, :ref:`token-bigram` is a suitable tokenizer. If you don't know much about tokenizer, it's recommended that you choose :ref:`token-bigram`. +You can try a tokenizer by :doc:`/reference/commands/tokenize.rst` and +:doc:`/reference/commands/table_tokenize.rst`. Here is an example to +try :ref:`token-bigram` tokenizer by +:doc:`/reference/commands/tokenize.rst`: + +.. groonga-command +.. include:: ../example/reference/tokenizers/tokenize-example.log +.. tokenize TokenBigram "Hello World" + What is "tokenize"? ------------------- @@ -63,7 +72,7 @@ In the above example, 10 tokens are extracted from one text ``Hello World``. For example, ``Hello World`` is tokenized to the following tokens by -whitespace-separate tokenize method: +white-space-separate tokenize method: * ``Hello`` * ``World`` @@ -74,8 +83,8 @@ World``. Token is used as search key. You can find indexed documents only by tokens that are extracted by used tokenize method. For example, you can find ``Hello World`` by ``ll`` with bigram tokenize method but you -can't find ``Hello World`` by ``ll`` with whitespace-separate tokenize -method. Because whitespace-separate tokenize method doesn't extract +can't find ``Hello World`` by ``ll`` with white-space-separate tokenize +method. Because white-space-separate tokenize method doesn't extract ``ll`` token. It just extracts ``Hello`` and ``World`` tokens. In general, tokenize method that generates small tokens increases @@ -87,9 +96,9 @@ bigram tokenize method. ``Hello World`` is a noise for people who wants to search "logical and". It means that precision is decreased. But recall is increased. -We can find only ``A or B`` by ``or`` with whitespace-separate +We can find only ``A or B`` by ``or`` with white-space-separate tokenize method. Because ``World`` is tokenized to one token ``World`` -with whitespace-separate tokenize method. It means that precision is +with white-space-separate tokenize method. It means that precision is increased for people who wants to search "logical and". But recall is decreased because ``Hello World`` that contains ``or`` isn't found. @@ -118,6 +127,58 @@ Here is a list of built-in tokenizers: ``TokenBigram`` ^^^^^^^^^^^^^^^ +``TokenBigram`` is a bigram based tokenizer. It's recommended to use +this tokenizer for most cases. + +``TokenBigram`` behavior is different when it's worked with any +:doc:/reference/normalizers . + +If no normalizer is used, ``TokenBigram`` uses pure bigram (all tokens +except the last token have two characters) tokenize method: + +.. groonga-command +.. include:: ../example/reference/tokenizers/token-bigram-no-normalizer.log +.. tokenize TokenBigram "Hello World" + +If normalizer is used, ``TokenBigram`` uses white-space-separate like +tokenize method for ASCII characters. ``TokenBigram`` uses bigram +tokenize method for non-ASCII characters. + +``TokenBigram`` uses one or more white-spaces as token delimiter for +ASCII characters: + +.. groonga-command +.. include:: ../example/reference/tokenizers/token-bigram-ascii-and-white-space-with-normalizer.log +.. tokenize TokenBigram "Hello World" NormalizerAuto + +``TokenBigram`` uses character type change as token delimiter for +ASCII characters. Character type is one of them: + + * Alphabet + * Digit + * Symbol (such as ``(``, ``)`` and ``!``) + * Hiragana + * Katakana + * Kanji + * Others + +The following example shows two token delimiters: + + * at between ``100`` (digits) and ``cents`` (alphabets) + * at between ``cents`` (alphabets) and ``!!!`` (symbols) + +.. groonga-command +.. include:: ../example/reference/tokenizers/token-bigram-ascii-and-character-type-change-with-normalizer.log +.. tokenize TokenBigram "100cents!!!" NormalizerAuto + +Here is an example that ``TokenBigram`` uses bigram tokenize method +for non-ASCII characters. + +.. groonga-command +.. include:: ../example/reference/tokenizers/token-bigram-non-ascii-with-normalizer.log +.. tokenize TokenBigram "日本語の勉強" NormalizerAuto + + .. _token-bigram-split-symbol ``TokenBigramSplitSymbol`` -------------- next part -------------- HTML����������������������������... Download