且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

SQL Server,ISABOUT,加权术语

更新时间:2023-02-27 08:52:02

根据我的经验,我已经获得了权重加起来为1的***结果。 code> CONTAINSTABLE(documentParts,content,
'ISABOUT(
wordA wordB wordCweight(0.5),
wordA *NEARwordB *NEARwordC *权重(0.2),
wordA *权重(0.1),
wordB *权重(0.1),
wordC *权重(0.1)
)')


I am trying to figure out exactly how weighted terms work in a ISABOUT query in SQL SERVER. Here is where I currently am:

Each query returns the following rows:

QUERY 1 (weight 1): Initial ranking

SELECT * FROM CONTAINSTABLE(documentParts, title, 'ISABOUT ("e" weight (1) ) ') ORDER BY RANK DESC, [KEY]

KEY     RANK
306342  249
272619  156
221557  114

QUERY 2 (weight 0.8): Ranking increases, initial order is preserved

SELECT * FROM CONTAINSTABLE(documentParts, title, 'ISABOUT ("e" weight (0.8) ) ') ORDER BY RANK DESC, [KEY]

 KEY     RANK
 306342  321
 272619  201
 221557  146

QUERY 3 (weight 0.2): Ranking increases, initial order is preserved

SELECT * FROM CONTAINSTABLE(documentParts, title, 'ISABOUT ("e" weight (0.2) ) ') ORDER BY RANK DESC, [KEY]

 KEY    RANK
 306342 998
 272619 877
 221557 692

QUERY 4 (weight 0.17): Ranking decreases, best match is now last, inverted behavior for these terms begin at 0.17

SELECT * FROM CONTAINSTABLE(documentParts, title, 'ISABOUT ("e" weight (0.17) ) ') ORDER BY RANK DESC, [KEY]

 KEY      RANK
 272619   960
 221557   958
 306342   802

QUERY 5 (weight 0.16): Ranking increases, best match is now second

SELECT * FROM CONTAINSTABLE(documentParts, title, 'ISABOUT ("e" weight (0.17) ) ') ORDER BY RANK DESC, [KEY]

 KEY      RANK
 272619   978
 306342   935
 221557   841

QUERY 6 (weight 0.01): Ranking decreases, best match is last again

SELECT * FROM CONTAINSTABLE(documentParts, title, 'ISABOUT ("e" weight (0.01) ) ') ORDER BY RANK DESC, [KEY]

 KEY    RANK
 221557 105
 272619 77
 306342 50

Best match for weight 1 has a rank of 249 and while weight goes down to 0.2 ranking of best match increases to 998. From 0.2 to 0.17 ranking decreases and from 0.16 results are inverted (the weight values that reproduce this behavior depend on terms and maybe on columns searched...)

It seems there is a point where weight means the opposite, something like "do not include this term".
Do you have any explanation of this behavior?
Why ranking increases when weight decreases?
Why ranking decreases after some point until results are inverted and how can you predict this point?

I use a custom "word-breaker", when user searches for something creating the following query:

CONTAINSTABLE(documentParts, title, 
      'ISABOUT (
          "wordA wordB wordC" weight (0.8), 
          "wordA*" NEAR "wordB*" NEAR "wordC*" weight (0.6), 
          "wordA*" weight (0.1), 
          "wordB*" weight (0.1), 
          "wordC*" weight (0.1), 
       ) ')

Am I to expect big ranks for for 0.1 words?

Is the following query the same as above and am I to expect some weird behavior with the 0.1 rankings?

CONTAINSTABLE(documentParts, title, '
      ISABOUT ( "wordA wordB wordC" weight (0.8) ), 
      OR ISABOUT ( "wordA*" NEAR "wordB*" NEAR "wordC*" weight (0.6) ), 
      OR ISABOUT ( "wordA*" weight (0.1) ), 
      OR ISABOUT ( "wordB*" weight (0.1) ), 
      OR ISABOUT ( "wordC*" weight (0.1) ), 
      ')

EDIT:
I found this topic: http://msdn.microsoft.com/en-us/library/ms142524(v=sql.105).aspx which answers some of my questions, but creates some new!

I am searching in two tables, "documents" and "documentParts" and use a union all to sum ranks and get my results. According to this article there it's wrong since indexed rows are counted to compute ranking so RANK will be like adding apples and carrots...

My solution for now is to compute a percentage for each CONTAINSTABLE like this:

Log(RANK) / Log(Sum(RANK) OVER( PARTITION BY 1)) AS [PERCENT]

and sum on this...

In my experience I have had the best results where the weights add up to 1.

CONTAINSTABLE(documentParts, content, 
          'ISABOUT (
              "wordA wordB wordC" weight (0.5), 
              "wordA*" NEAR "wordB*" NEAR "wordC*" weight (0.2), 
              "wordA*" weight (0.1), 
              "wordB*" weight (0.1), 
              "wordC*" weight (0.1) 
           ) ')