Kazuhiko
kazuh****@fdiar*****
2014年 9月 26日 (金) 17:50:23 JST
こんにちは。 On 26/09/2014 10:15, Kouhei Sutou wrote: >>>> でも、「*S1"..."」演算子は、boolean modeでだけ使えて、その結果のスコアは >>>> natural language mode相当です、っていうのは分かりにくすぎる仕様だと感じ >>>> ます。その演算子のついていない部分については、出現数=スコアで、差があり >>>> すぎますし。 >>> >>> うーん、どういうユースケースを想定していますか? >> >> 実際のところ、'*S"word1 word2" word3' みたいなクエリは想定していません >> が、この仕様はどういうドキュメントになるんだろうな?と疑問に思ったのでした。 >> >> * natural language modeのクエリは、'*S"word1 word2 ..."' in boolean mode >> と等価です。 >> * その際のスコアは、通常のboolean検索とは異なり、... のように決定されます。 >> >> という感じでしょうか? でもやっぱり、'boolean' modeなのに巨大なスコアは >> どうよ? というのは少しひっかかりますが、仕方がないのかな。 > > 「boolean」ってスコアは小さめっていう意味合いもあったんでし > たっけ。私は、ANDとかORを使って検索できる、っていう意味なの > かなぁと思っていました。 boolean、つまり「あったらTrue、なかったらFalse」で、Trueの数を返すものだ と思っていました。少なくともMyISAMだとそうです。 DROP TABLE IF EXISTS `diaries2`; /*!40101 SET @saved_cs_client = @@character_set_client */; /*!40101 SET character_set_client = utf8 */; CREATE TABLE `diaries2` ( `id` int(10) unsigned NOT NULL, `content` text COLLATE utf8_unicode_ci, PRIMARY KEY (`id`), FULLTEXT KEY `content` (`content`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; /*!40101 SET character_set_client = @saved_cs_client */; INSERT INTO `diaries2` VALUES (1,'It\'ll be fine tomorrow as well.'); INSERT INTO `diaries2` VALUES (2,'It\'ll rain tomorrow.'); INSERT INTO `diaries2` VALUES (3,'It\'s fine today. It\'ll be fine tomorrow as well.'); INSERT INTO `diaries2` VALUES (4,'It\'s fine today. But it\'ll rain tomorrow.'); INSERT INTO `diaries2` VALUES (5,'Ring the bell.'); INSERT INTO `diaries2` VALUES (6,'I love dumbbells.'); INSERT INTO `diaries2` VALUES (7,'dummy'); INSERT INTO `diaries2` VALUES (8,'dummy'); INSERT INTO `diaries2` VALUES (9,'dummy'); SELECT *, MATCH (content) AGAINST ('fine tomorrow bell' in boolean mode) AS score FROM diaries2; +----+--------------------------------------------------+-------+ | id | content | score | +----+--------------------------------------------------+-------+ | 1 | It'll be fine tomorrow as well. | 2 | | 2 | It'll rain tomorrow. | 1 | | 3 | It's fine today. It'll be fine tomorrow as well. | 2 | | 4 | It's fine today. But it'll rain tomorrow. | 2 | | 5 | Ring the bell. | 1 | | 6 | I love dumbbells. | 0 | | 7 | dummy | 0 | | 8 | dummy | 0 | | 9 | dummy | 0 | +----+--------------------------------------------------+-------+ あれ、id=3のスコアが、Mroonga in boolean modeと違いますね。Mroongaだと、 ここは 'fine' * 2 + 'tomorrow' で 3 ですね。 というわけで、boolean modeのスコアについては、 MyISAM : 出現したキーワードの数 Mroonga : 各キーワードが出現した数の合計 という違いがすでにありました。 >> (ヘヴィユーザのみなさんはboolean modeしか使っていないとか...) > > もしかしたら、なんですけど、かずひこさんが求めている挙動って > > "word1 word2 word3" IN BOOLEAN MODE > > (word1 OR word2 OR word3)だったりしないかなぁと思いました。 求めている挙動は、「適当にキーワードを多めにつっこんだら、いい感じのスコ アでソートされて返ってくる」です。 上述の仕様の違いで、Mroonga in boolean modeは、MyISAM in boolean modeよ りはましな結果になりそうですが、やっぱりキーワードごとの重みが考慮される ほうが「いい感じのスコアでソート」になりそうなので、(word1 OR word2 OR word3)だとあまり嬉しくありません。以下、比較実験しました。 * MyISAM で natural language mode SELECT *, MATCH (content) AGAINST ('fine tomorrow bell') AS score FROM diaries2 order by score desc; +----+--------------------------------------------------+---------------------+ | id | content | score | +----+--------------------------------------------------+---------------------+ | 5 | Ring the bell. | 2.0326898097991943 | | 3 | It's fine today. It'll be fine tomorrow as well. | 1.096758246421814 | | 1 | It'll be fine tomorrow as well. | 0.8956899046897888 | | 4 | It's fine today. But it'll rain tomorrow. | 0.8759949207305908 | | 2 | It'll rain tomorrow. | 0.21812663972377777 | | 6 | I love dumbbells. | 0 | | 7 | dummy | 0 | | 8 | dummy | 0 | | 9 | dummy | 0 | +----+--------------------------------------------------+---------------------+ * MyISAM で boolean mode SELECT *, MATCH (content) AGAINST ('fine tomorrow bell' in boolean mode) AS score FROM diaries2 order by score desc; +----+--------------------------------------------------+-------+ | id | content | score | +----+--------------------------------------------------+-------+ | 1 | It'll be fine tomorrow as well. | 2 | | 3 | It's fine today. It'll be fine tomorrow as well. | 2 | | 4 | It's fine today. But it'll rain tomorrow. | 2 | | 2 | It'll rain tomorrow. | 1 | | 5 | Ring the bell. | 1 | | 6 | I love dumbbells. | 0 | | 7 | dummy | 0 | | 8 | dummy | 0 | | 9 | dummy | 0 | +----+--------------------------------------------------+-------+ * Mroonga で類似文書検索 SELECT *, MATCH (content) AGAINST ('*S"fine tomorrow bell"' in boolean mode) AS score FROM diaries order by score desc; +----+--------------------------------------------------+---------+ | id | content | score | +----+--------------------------------------------------+---------+ | 5 | Ring the bell. | 1048577 | | 3 | It's fine today. It'll be fine tomorrow as well. | 328344 | | 1 | It'll be fine tomorrow as well. | 211835 | | 4 | It's fine today. But it'll rain tomorrow. | 211835 | | 2 | It'll rain tomorrow. | 95326 | | 6 | I love dumbbells. | 0 | +----+--------------------------------------------------+---------+ これだと、MyISAMのnatural language modeときと同じ順序です。 * Mroonga でboolean mode SELECT *, MATCH (content) AGAINST ('fine OR tomorrow OR bell' in boolean mode) AS score FROM diaries order by score desc; +----+--------------------------------------------------+-------+ | id | content | score | +----+--------------------------------------------------+-------+ | 3 | It's fine today. It'll be fine tomorrow as well. | 3 | | 1 | It'll be fine tomorrow as well. | 2 | | 4 | It's fine today. But it'll rain tomorrow. | 2 | | 2 | It'll rain tomorrow. | 1 | | 5 | Ring the bell. | 1 | | 6 | I love dumbbells. | 0 | +----+--------------------------------------------------+-------+ キーワードの重みに関係なく、出た回数だけで決まるので、こういう順序になり ます。 なお、mroongaの方のスキーマとデータは前回と同じく以下のとおり。 DROP TABLE IF EXISTS `diaries`; /*!40101 SET @saved_cs_client = @@character_set_client */; /*!40101 SET character_set_client = utf8 */; CREATE TABLE `diaries` ( `id` int(10) unsigned NOT NULL, `content` text COLLATE utf8_unicode_ci, PRIMARY KEY (`id`), FULLTEXT KEY `content` (`content`) COMMENT 'parser "TokenBigram"' ) ENGINE=Mroonga DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci; /*!40101 SET character_set_client = @saved_cs_client */; INSERT INTO `diaries` VALUES (1,'It\'ll be fine tomorrow as well.'); INSERT INTO `diaries` VALUES (2,'It\'ll rain tomorrow.'); INSERT INTO `diaries` VALUES (3,'It\'s fine today. It\'ll be fine tomorrow as well.'); INSERT INTO `diaries` VALUES (4,'It\'s fine today. But it\'ll rain tomorrow.'); INSERT INTO `diaries` VALUES (5,'Ring the bell.'); INSERT INTO `diaries` VALUES (6,'I love dumbbells.'); かずひこ