[Groonga-commit] groonga/groonga at cb3979b [master] doc: describe about TokenBigram

Back to archive index

Kouhei Sutou null+****@clear*****
Mon Mar 16 14:59:34 JST 2015


Kouhei Sutou	2015-03-16 14:59:34 +0900 (Mon, 16 Mar 2015)

  New Revision: cb3979bc464b38efe78ad66b834076ec92930877
  https://github.com/groonga/groonga/commit/cb3979bc464b38efe78ad66b834076ec92930877

  Message:
    doc: describe about TokenBigram

  Added files:
    doc/source/example/reference/tokenizers/token-bigram-ascii-and-character-type-change-with-normalizer.log
    doc/source/example/reference/tokenizers/token-bigram-ascii-and-white-space-with-normalizer.log
    doc/source/example/reference/tokenizers/token-bigram-no-normalizer.log
    doc/source/example/reference/tokenizers/token-bigram-non-ascii-with-normalizer.log
    doc/source/example/reference/tokenizers/tokenize-example.log
  Modified files:
    doc/source/reference/tokenizers.rst

  Added: doc/source/example/reference/tokenizers/token-bigram-ascii-and-character-type-change-with-normalizer.log (+24 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-bigram-ascii-and-character-type-change-with-normalizer.log    2015-03-16 14:59:34 +0900 (95f101c)
@@ -0,0 +1,24 @@
+Execution example::
+
+  tokenize TokenBigram "100cents!!!" NormalizerAuto
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "100"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "cents"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "!!!"
+  #     }
+  #   ]
+  # ]

  Added: doc/source/example/reference/tokenizers/token-bigram-ascii-and-white-space-with-normalizer.log (+20 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-bigram-ascii-and-white-space-with-normalizer.log    2015-03-16 14:59:34 +0900 (66e5a97)
@@ -0,0 +1,20 @@
+Execution example::
+
+  tokenize TokenBigram "Hello World" NormalizerAuto
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "hello"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "world"
+  #     }
+  #   ]
+  # ]

  Added: doc/source/example/reference/tokenizers/token-bigram-no-normalizer.log (+56 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-bigram-no-normalizer.log    2015-03-16 14:59:34 +0900 (0951932)
@@ -0,0 +1,56 @@
+Execution example::
+
+  tokenize TokenBigram "Hello World"
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "He"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "el"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "ll"
+  #     }, 
+  #     {
+  #       "position": 3, 
+  #       "value": "lo"
+  #     }, 
+  #     {
+  #       "position": 4, 
+  #       "value": "o "
+  #     }, 
+  #     {
+  #       "position": 5, 
+  #       "value": " W"
+  #     }, 
+  #     {
+  #       "position": 6, 
+  #       "value": "Wo"
+  #     }, 
+  #     {
+  #       "position": 7, 
+  #       "value": "or"
+  #     }, 
+  #     {
+  #       "position": 8, 
+  #       "value": "rl"
+  #     }, 
+  #     {
+  #       "position": 9, 
+  #       "value": "ld"
+  #     }, 
+  #     {
+  #       "position": 10, 
+  #       "value": "d"
+  #     }
+  #   ]
+  # ]

  Added: doc/source/example/reference/tokenizers/token-bigram-non-ascii-with-normalizer.log (+36 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/token-bigram-non-ascii-with-normalizer.log    2015-03-16 14:59:34 +0900 (055d94d)
@@ -0,0 +1,36 @@
+Execution example::
+
+  tokenize TokenBigram "日本語の勉強" NormalizerAuto
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "日本"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "本語"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "語の"
+  #     }, 
+  #     {
+  #       "position": 3, 
+  #       "value": "の勉"
+  #     }, 
+  #     {
+  #       "position": 4, 
+  #       "value": "勉強"
+  #     }, 
+  #     {
+  #       "position": 5, 
+  #       "value": "強"
+  #     }
+  #   ]
+  # ]

  Added: doc/source/example/reference/tokenizers/tokenize-example.log (+56 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/example/reference/tokenizers/tokenize-example.log    2015-03-16 14:59:34 +0900 (0951932)
@@ -0,0 +1,56 @@
+Execution example::
+
+  tokenize TokenBigram "Hello World"
+  # [
+  #   [
+  #     0, 
+  #     1337566253.89858, 
+  #     0.000355720520019531
+  #   ], 
+  #   [
+  #     {
+  #       "position": 0, 
+  #       "value": "He"
+  #     }, 
+  #     {
+  #       "position": 1, 
+  #       "value": "el"
+  #     }, 
+  #     {
+  #       "position": 2, 
+  #       "value": "ll"
+  #     }, 
+  #     {
+  #       "position": 3, 
+  #       "value": "lo"
+  #     }, 
+  #     {
+  #       "position": 4, 
+  #       "value": "o "
+  #     }, 
+  #     {
+  #       "position": 5, 
+  #       "value": " W"
+  #     }, 
+  #     {
+  #       "position": 6, 
+  #       "value": "Wo"
+  #     }, 
+  #     {
+  #       "position": 7, 
+  #       "value": "or"
+  #     }, 
+  #     {
+  #       "position": 8, 
+  #       "value": "rl"
+  #     }, 
+  #     {
+  #       "position": 9, 
+  #       "value": "ld"
+  #     }, 
+  #     {
+  #       "position": 10, 
+  #       "value": "d"
+  #     }
+  #   ]
+  # ]

  Modified: doc/source/reference/tokenizers.rst (+66 -5)
===================================================================
--- doc/source/reference/tokenizers.rst    2015-03-16 14:23:11 +0900 (9a0f851)
+++ doc/source/reference/tokenizers.rst    2015-03-16 14:59:34 +0900 (f4c2b8d)
@@ -39,6 +39,15 @@ Normally, :ref:`token-bigram` is a suitable tokenizer. If you don't
 know much about tokenizer, it's recommended that you choose
 :ref:`token-bigram`.
 
+You can try a tokenizer by :doc:`/reference/commands/tokenize.rst` and
+:doc:`/reference/commands/table_tokenize.rst`. Here is an example to
+try :ref:`token-bigram` tokenizer by
+:doc:`/reference/commands/tokenize.rst`:
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/tokenize-example.log
+.. tokenize TokenBigram "Hello World"
+
 What is "tokenize"?
 -------------------
 
@@ -63,7 +72,7 @@ In the above example, 10 tokens are extracted from one text ``Hello
 World``.
 
 For example, ``Hello World`` is tokenized to the following tokens by
-whitespace-separate tokenize method:
+white-space-separate tokenize method:
 
   * ``Hello``
   * ``World``
@@ -74,8 +83,8 @@ World``.
 Token is used as search key. You can find indexed documents only by
 tokens that are extracted by used tokenize method. For example, you
 can find ``Hello World`` by ``ll`` with bigram tokenize method but you
-can't find ``Hello World`` by ``ll`` with whitespace-separate tokenize
-method. Because whitespace-separate tokenize method doesn't extract
+can't find ``Hello World`` by ``ll`` with white-space-separate tokenize
+method. Because white-space-separate tokenize method doesn't extract
 ``ll`` token. It just extracts ``Hello`` and ``World`` tokens.
 
 In general, tokenize method that generates small tokens increases
@@ -87,9 +96,9 @@ bigram tokenize method. ``Hello World`` is a noise for people who
 wants to search "logical and". It means that precision is
 decreased. But recall is increased.
 
-We can find only ``A or B`` by ``or`` with whitespace-separate
+We can find only ``A or B`` by ``or`` with white-space-separate
 tokenize method. Because ``World`` is tokenized to one token ``World``
-with whitespace-separate tokenize method. It means that precision is
+with white-space-separate tokenize method. It means that precision is
 increased for people who wants to search "logical and". But recall is
 decreased because ``Hello World`` that contains ``or`` isn't found.
 
@@ -118,6 +127,58 @@ Here is a list of built-in tokenizers:
 ``TokenBigram``
 ^^^^^^^^^^^^^^^
 
+``TokenBigram`` is a bigram based tokenizer. It's recommended to use
+this tokenizer for most cases.
+
+``TokenBigram`` behavior is different when it's worked with any
+:doc:/reference/normalizers .
+
+If no normalizer is used, ``TokenBigram`` uses pure bigram (all tokens
+except the last token have two characters) tokenize method:
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-bigram-no-normalizer.log
+.. tokenize TokenBigram "Hello World"
+
+If normalizer is used, ``TokenBigram`` uses white-space-separate like
+tokenize method for ASCII characters. ``TokenBigram`` uses bigram
+tokenize method for non-ASCII characters.
+
+``TokenBigram`` uses one or more white-spaces as token delimiter for
+ASCII characters:
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-bigram-ascii-and-white-space-with-normalizer.log
+.. tokenize TokenBigram "Hello World" NormalizerAuto
+
+``TokenBigram`` uses character type change as token delimiter for
+ASCII characters. Character type is one of them:
+
+  * Alphabet
+  * Digit
+  * Symbol (such as ``(``, ``)`` and ``!``)
+  * Hiragana
+  * Katakana
+  * Kanji
+  * Others
+
+The following example shows two token delimiters:
+
+  * at between ``100`` (digits) and ``cents`` (alphabets)
+  * at between ``cents`` (alphabets) and ``!!!`` (symbols)
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-bigram-ascii-and-character-type-change-with-normalizer.log
+.. tokenize TokenBigram "100cents!!!" NormalizerAuto
+
+Here is an example that ``TokenBigram`` uses bigram tokenize method
+for non-ASCII characters.
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-bigram-non-ascii-with-normalizer.log
+.. tokenize TokenBigram "日本語の勉強" NormalizerAuto
+
+
 .. _token-bigram-split-symbol
 
 ``TokenBigramSplitSymbol``
-------------- next part --------------
HTML����������������������������...
Download 



More information about the Groonga-commit mailing list
Back to archive index