[groonga-dev,02816] Re: IN NATURAL LANGUAGE MODEにおける検索スコア

Back to archive index

Kazuhiko kazuh****@fdiar*****
2014年 9月 26日 (金) 17:50:23 JST


こんにちは。

On 26/09/2014 10:15, Kouhei Sutou wrote:
>>>> でも、「*S1"..."」演算子は、boolean modeでだけ使えて、その結果のスコアは
>>>> natural language mode相当です、っていうのは分かりにくすぎる仕様だと感じ
>>>> ます。その演算子のついていない部分については、出現数=スコアで、差があり
>>>> すぎますし。
>>>
>>> うーん、どういうユースケースを想定していますか?
>>
>> 実際のところ、'*S"word1 word2" word3' みたいなクエリは想定していません
>> が、この仕様はどういうドキュメントになるんだろうな?と疑問に思ったのでした。
>>
>> * natural language modeのクエリは、'*S"word1 word2 ..."' in boolean mode
>> と等価です。
>> * その際のスコアは、通常のboolean検索とは異なり、... のように決定されます。
>>
>> という感じでしょうか? でもやっぱり、'boolean' modeなのに巨大なスコアは
>> どうよ? というのは少しひっかかりますが、仕方がないのかな。
> 
> 「boolean」ってスコアは小さめっていう意味合いもあったんでし
> たっけ。私は、ANDとかORを使って検索できる、っていう意味なの
> かなぁと思っていました。

boolean、つまり「あったらTrue、なかったらFalse」で、Trueの数を返すものだ
と思っていました。少なくともMyISAMだとそうです。

DROP TABLE IF EXISTS `diaries2`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `diaries2` (
  `id` int(10) unsigned NOT NULL,
  `content` text COLLATE utf8_unicode_ci,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `content` (`content`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
/*!40101 SET character_set_client = @saved_cs_client */;
INSERT INTO `diaries2` VALUES (1,'It\'ll be fine tomorrow as well.');
INSERT INTO `diaries2` VALUES (2,'It\'ll rain tomorrow.');
INSERT INTO `diaries2` VALUES (3,'It\'s fine today. It\'ll be fine
tomorrow as well.');
INSERT INTO `diaries2` VALUES (4,'It\'s fine today. But it\'ll rain
tomorrow.');
INSERT INTO `diaries2` VALUES (5,'Ring the bell.');
INSERT INTO `diaries2` VALUES (6,'I love dumbbells.');
INSERT INTO `diaries2` VALUES (7,'dummy');
INSERT INTO `diaries2` VALUES (8,'dummy');
INSERT INTO `diaries2` VALUES (9,'dummy');

SELECT *, MATCH (content) AGAINST ('fine tomorrow bell' in boolean mode)
AS score FROM diaries2;
+----+--------------------------------------------------+-------+
| id | content                                          | score |
+----+--------------------------------------------------+-------+
|  1 | It'll be fine tomorrow as well.                  |     2 |
|  2 | It'll rain tomorrow.                             |     1 |
|  3 | It's fine today. It'll be fine tomorrow as well. |     2 |
|  4 | It's fine today. But it'll rain tomorrow.        |     2 |
|  5 | Ring the bell.                                   |     1 |
|  6 | I love dumbbells.                                |     0 |
|  7 | dummy                                            |     0 |
|  8 | dummy                                            |     0 |
|  9 | dummy                                            |     0 |
+----+--------------------------------------------------+-------+

あれ、id=3のスコアが、Mroonga in boolean modeと違いますね。Mroongaだと、
ここは 'fine' * 2 + 'tomorrow' で 3 ですね。

というわけで、boolean modeのスコアについては、

MyISAM : 出現したキーワードの数
Mroonga : 各キーワードが出現した数の合計

という違いがすでにありました。

>> (ヘヴィユーザのみなさんはboolean modeしか使っていないとか...)
> 
> もしかしたら、なんですけど、かずひこさんが求めている挙動って
> 
>   "word1 word2 word3" IN BOOLEAN MODE
> 
> (word1 OR word2 OR word3)だったりしないかなぁと思いました。

求めている挙動は、「適当にキーワードを多めにつっこんだら、いい感じのスコ
アでソートされて返ってくる」です。

上述の仕様の違いで、Mroonga in boolean modeは、MyISAM in boolean modeよ
りはましな結果になりそうですが、やっぱりキーワードごとの重みが考慮される
ほうが「いい感じのスコアでソート」になりそうなので、(word1 OR word2 OR
word3)だとあまり嬉しくありません。以下、比較実験しました。

* MyISAM で natural language mode

SELECT *, MATCH (content) AGAINST ('fine tomorrow bell') AS score FROM
diaries2 order by score desc;
+----+--------------------------------------------------+---------------------+
| id | content                                          | score
      |
+----+--------------------------------------------------+---------------------+
|  5 | Ring the bell.                                   |
2.0326898097991943 |
|  3 | It's fine today. It'll be fine tomorrow as well. |
1.096758246421814 |
|  1 | It'll be fine tomorrow as well.                  |
0.8956899046897888 |
|  4 | It's fine today. But it'll rain tomorrow.        |
0.8759949207305908 |
|  2 | It'll rain tomorrow.                             |
0.21812663972377777 |
|  6 | I love dumbbells.                                |
    0 |
|  7 | dummy                                            |
    0 |
|  8 | dummy                                            |
    0 |
|  9 | dummy                                            |
    0 |
+----+--------------------------------------------------+---------------------+

* MyISAM で boolean mode

SELECT *, MATCH (content) AGAINST ('fine tomorrow bell' in boolean mode)
AS score FROM diaries2 order by score desc;
+----+--------------------------------------------------+-------+
| id | content                                          | score |
+----+--------------------------------------------------+-------+
|  1 | It'll be fine tomorrow as well.                  |     2 |
|  3 | It's fine today. It'll be fine tomorrow as well. |     2 |
|  4 | It's fine today. But it'll rain tomorrow.        |     2 |
|  2 | It'll rain tomorrow.                             |     1 |
|  5 | Ring the bell.                                   |     1 |
|  6 | I love dumbbells.                                |     0 |
|  7 | dummy                                            |     0 |
|  8 | dummy                                            |     0 |
|  9 | dummy                                            |     0 |
+----+--------------------------------------------------+-------+

* Mroonga で類似文書検索

SELECT *, MATCH (content) AGAINST ('*S"fine tomorrow bell"' in boolean
mode) AS score FROM diaries order by score desc;
+----+--------------------------------------------------+---------+
| id | content                                          | score   |
+----+--------------------------------------------------+---------+
|  5 | Ring the bell.                                   | 1048577 |
|  3 | It's fine today. It'll be fine tomorrow as well. |  328344 |
|  1 | It'll be fine tomorrow as well.                  |  211835 |
|  4 | It's fine today. But it'll rain tomorrow.        |  211835 |
|  2 | It'll rain tomorrow.                             |   95326 |
|  6 | I love dumbbells.                                |       0 |
+----+--------------------------------------------------+---------+

これだと、MyISAMのnatural language modeときと同じ順序です。

* Mroonga でboolean mode

SELECT *, MATCH (content) AGAINST ('fine OR tomorrow OR bell' in boolean
mode) AS score FROM diaries order by score desc;
+----+--------------------------------------------------+-------+
| id | content                                          | score |
+----+--------------------------------------------------+-------+
|  3 | It's fine today. It'll be fine tomorrow as well. |     3 |
|  1 | It'll be fine tomorrow as well.                  |     2 |
|  4 | It's fine today. But it'll rain tomorrow.        |     2 |
|  2 | It'll rain tomorrow.                             |     1 |
|  5 | Ring the bell.                                   |     1 |
|  6 | I love dumbbells.                                |     0 |
+----+--------------------------------------------------+-------+

キーワードの重みに関係なく、出た回数だけで決まるので、こういう順序になり
ます。

なお、mroongaの方のスキーマとデータは前回と同じく以下のとおり。

DROP TABLE IF EXISTS `diaries`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `diaries` (
  `id` int(10) unsigned NOT NULL,
  `content` text COLLATE utf8_unicode_ci,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `content` (`content`) COMMENT 'parser "TokenBigram"'
) ENGINE=Mroonga DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
/*!40101 SET character_set_client = @saved_cs_client */;
INSERT INTO `diaries` VALUES (1,'It\'ll be fine tomorrow as well.');
INSERT INTO `diaries` VALUES (2,'It\'ll rain tomorrow.');
INSERT INTO `diaries` VALUES (3,'It\'s fine today. It\'ll be fine
tomorrow as well.');
INSERT INTO `diaries` VALUES (4,'It\'s fine today. But it\'ll rain
tomorrow.');
INSERT INTO `diaries` VALUES (5,'Ring the bell.');
INSERT INTO `diaries` VALUES (6,'I love dumbbells.');

かずひこ




groonga-dev メーリングリストの案内
Back to archive index