[Groonga-commit] groonga/groonga at 602bef5 [master] doc: describe about "tokenize"

Back to archive index

Kouhei Sutou null+****@clear*****
Mon Mar 16 14:23:11 JST 2015


Kouhei Sutou	2015-03-16 14:23:11 +0900 (Mon, 16 Mar 2015)

  New Revision: 602bef51e4078067d0da59b223a91503d55da522
  https://github.com/groonga/groonga/commit/602bef51e4078067d0da59b223a91503d55da522

  Message:
    doc: describe about "tokenize"

  Modified files:
    doc/source/reference/tokenizers.rst

  Modified: doc/source/reference/tokenizers.rst (+75 -9)
===================================================================
--- doc/source/reference/tokenizers.rst    2015-03-16 14:22:54 +0900 (9283376)
+++ doc/source/reference/tokenizers.rst    2015-03-16 14:23:11 +0900 (9a0f851)
@@ -12,20 +12,86 @@ Summary
 -------
 
 Groonga has tokenizer module that tokenizes text. It is used when
-indexing text and searching by query.
+the following cases:
 
-.. figure:: /images/reference/tokenizers/used-when-indexing.png
-   :align: center
-   :width: 80%
+  * Indexing text
 
-   Tokenizer is used when indexing text.
+    .. figure:: /images/reference/tokenizers/used-when-indexing.png
+       :align: center
+       :width: 80%
 
-.. figure:: /images/reference/tokenizers/used-when-searching.png
-   :align: center
-   :width: 80%
+       Tokenizer is used when indexing text.
 
-   Tokenizer is used when searching by query.
+  * Searching by query
 
+    .. figure:: /images/reference/tokenizers/used-when-searching.png
+       :align: center
+       :width: 80%
+
+       Tokenizer is used when searching by query.
+
+Tokenizer is an important module for full-text search. You can change
+trade-off between `precision and recall
+<http://en.wikipedia.org/wiki/Precision_and_recall>`_ by changing
+tokenizer.
+
+Normally, :ref:`token-bigram` is a suitable tokenizer. If you don't
+know much about tokenizer, it's recommended that you choose
+:ref:`token-bigram`.
+
+What is "tokenize"?
+-------------------
+
+"tokenize" is the process that extracts zero or more tokens from a
+text. There are some "tokenize" methods.
+
+For example, ``Hello World`` is tokenized to the following tokens by
+bigram tokenize method:
+
+  * ``He``
+  * ``el``
+  * ``ll``
+  * ``lo``
+  * ``o ``
+  * `` W``
+  * ``Wo``
+  * ``or``
+  * ``rl``
+  * ``ld``
+
+In the above example, 10 tokens are extracted from one text ``Hello
+World``.
+
+For example, ``Hello World`` is tokenized to the following tokens by
+whitespace-separate tokenize method:
+
+  * ``Hello``
+  * ``World``
+
+In the above example, 2 tokens are extracted from one text ``Hello
+World``.
+
+Token is used as search key. You can find indexed documents only by
+tokens that are extracted by used tokenize method. For example, you
+can find ``Hello World`` by ``ll`` with bigram tokenize method but you
+can't find ``Hello World`` by ``ll`` with whitespace-separate tokenize
+method. Because whitespace-separate tokenize method doesn't extract
+``ll`` token. It just extracts ``Hello`` and ``World`` tokens.
+
+In general, tokenize method that generates small tokens increases
+recall but decreases precision. Tokenize method that generates large
+tokens increases precision but decreases recall.
+
+For example, we can find ``Hello World`` and ``A or B`` by ``or`` with
+bigram tokenize method. ``Hello World`` is a noise for people who
+wants to search "logical and". It means that precision is
+decreased. But recall is increased.
+
+We can find only ``A or B`` by ``or`` with whitespace-separate
+tokenize method. Because ``World`` is tokenized to one token ``World``
+with whitespace-separate tokenize method. It means that precision is
+increased for people who wants to search "logical and". But recall is
+decreased because ``Hello World`` that contains ``or`` isn't found.
 
 Built-in tokenizsers
 --------------------
-------------- next part --------------
HTML����������������������������...
Download 



More information about the Groonga-commit mailing list
Back to archive index