- lucene 查询语法[中英文对照] - CLQ工作室提供技术支持

您的位置：首页 >> 程序员学前班[不再更新，只读] >> 主题: lucene 查询语法[中英文对照] [回主站] [分站链接]

您的位置：首页 >> 程序员学前班[不再更新，只读] >> 主题: lucene 查询语法[中英文对照] [最新] [回主站]

标题

lucene 查询语法[中英文对照]

clq

浏览(0) + 2006-07-01 19:37:10 发表编辑

关键字:

http://lucene.apache.org/java/docs/queryparsersyntax.html
--------------------------------------------------
Overview

Although Lucene provides the ability to create your own queries through its API, it also provides a rich query language through the Query Parser, a lexer which interprets a string into a Lucene Query using JavaCC.

This page provides the Query Parser syntax in Lucene 1.9. If you are using a different version of Lucene, please consult the copy of docs/queryparsersyntax.html that was distributed with the version you are using.

Before choosing to use the provided Query Parser, please consider the following:
If you are programmatically generating a query string and then parsing it with the query parser then you should seriously consider building your queries directly with the query API. In other words, the query parser is designed for human-entered text, not for program-generated text.
Untokenized fields are best added directly to queries, and not through the query parser. If a field's values are generated programmatically by the application, then so should query clauses for this field. An analyzer, which the query parser uses, is designed to convert human-entered text to terms. Program-generated values, like dates, keywords, etc., should be consistently program-generated.
In a query form, fields which are general text should use the query parser. All others, such as date ranges, keywords, etc. are better added directly through the query API. A field with a limit set of values, that can be specified with a pull-down menu should not be added to a query string which is subsequently parsed, but rather added as a TermQuery clause.

Terms

A query is broken up into terms and operators. There are two types of terms: Single Terms and Phrases.

A Single Term is a single word such as "test" or "hello".

A Phrase is a group of words surrounded by double quotes such as "hello dolly".

Multiple terms can be combined together with Boolean operators to form a more complex query (see below).

Note: The analyzer used to create the index will be used on the terms and phrases in the query string. So it is important to choose an analyzer that will not interfere with the terms used in the query string.

Fields

Lucene supports fielded data. When performing a search you can either specify a field, or use the default field. The field names and default field is implementation specific.

You can search any field by typing the field name followed by a colon ":" and then the term you are looking for.

As an example, let's assume a Lucene index contains two fields, title and text and text is the default field. If you want to find the document entitled "The Right Way" which contains the text "don't go this way", you can enter:
title:"The Right Way" AND text:go

or
title:"Do it right" AND right

Since text is the default field, the field indicator is not required.

Note: The field is only valid for the term that it directly precedes, so the query
title:Do it right

Will only find "Do" in the title field. It will find "it" and "right" in the default field (in this case the text field).

Term Modifiers

Lucene supports modifying query terms to provide a wide range of searching options.Wildcard Searches

Lucene supports single and multiple character wildcard searches.

To perform a single character wildcard search use the "?" symbol.

To perform a multiple character wildcard search use the "*" symbol.

The single character wildcard search looks for terms that match that with the single character replaced. For example, to search for "text" or "test" you can use the search:
te?t

Multiple character wildcard searches looks for 0 or more characters. For example, to search for test, tests or tester, you can use the search:
test*

You can also use the wildcard searches in the middle of a term.
te*t

Note: You cannot use a * or ? symbol as the first character of a search.

Fuzzy Searches

Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:
roam~

This search will find terms like foam and roams.

Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:
roam~0.8

The default that is used if the parameter is not given is 0.5.

Proximity Searches

Lucene supports finding words are a within a specific distance away. To do a proximity search use the tilde, "~", symbol at the end of a Phrase. For example to search for a "apache" and "jakarta" within 10 words of each other in a document use the search:
"jakarta apache"~10

Range Searches

Range Queries allow one to match documents whose field(s) values are between the lower and upper bound specified by the Range Query. Range Queries can be inclusive or exclusive of the upper and lower bounds. Sorting is done lexicographically.
mod_date:[20020101 TO 20030101]

This will find documents whose mod_date fields have values between 20020101 and 20030101, inclusive. Note that Range Queries are not reserved for date fields. You could also use range queries with non-date fields:
title:{Aida TO Carmen}

This will find all documents whose titles are between Aida and Carmen, but not including Aida and Carmen.

Inclusive range queries are denoted by square brackets. Exclusive range queries are denoted by curly brackets.

Boosting a Term

Lucene provides the relevance level of matching documents based on the terms found. To boost a term use the caret, "^", symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be.

Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for
jakarta apache

and you want the term "jakarta" to be more relevant boost it using the ^ symbol along with the boost factor next to the term. You would type:
jakarta^4 apache

This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the example:
"jakarta apache"^4 "Apache Lucene"

By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2)

Boolean operators

Boolean operators allow terms to be combined through logic operators. Lucene supports AND, "+", OR, NOT and "-" as Boolean operators(Note: Boolean operators must be ALL CAPS).OR

The OR operator is the default conjunction operator. This means that if there is no Boolean operator between two terms, the OR operator is used. The OR operator links two terms and finds a matching document if either of the terms exist in a document. This is equivalent to a union using sets. The symbol || can be used in place of the word OR.

To search for documents that contain either "jakarta apache" or just "jakarta" use the query:
"jakarta apache" jakarta

or
"jakarta apache" OR jakarta

AND

The AND operator matches documents where both terms exist anywhere in the text of a single document. This is equivalent to an intersection using sets. The symbol && can be used in place of the word AND.

To search for documents that contain "jakarta apache" and "Apache Lucene" use the query:
"jakarta apache" AND "Apache Lucene"

+

The "+" or required operator requires that the term after the "+" symbol exist somewhere in a the field of a single document.

To search for documents that must contain "jakarta" and may contain "lucene" use the query:
+jakarta apache

NOT

The NOT operator excludes documents that contain the term after NOT. This is equivalent to a difference using sets. The symbol ! can be used in place of the word NOT.

To search for documents that contain "jakarta apache" but not "Apache Lucene" use the query:
"jakarta apache" NOT "Apache Lucene"

Note: The NOT operator cannot be used with just one term. For example, the following search will return no results:
NOT "jakarta apache"

-

The "-" or prohibit operator excludes documents that contain the term after the "-" symbol.

To search for documents that contain "jakarta apache" but not "Apache Lucene" use the query:
"jakarta apache" -"Apache Lucene"

Grouping

Lucene supports using parentheses to group clauses to form sub queries. This can be very useful if you want to control the boolean logic for a query.

To search for either "jakarta" or "apache" and "website" use the query:
(jakarta OR apache) AND website

This eliminates any confusion and makes sure you that website must exist and either term jakarta or apache may exist.

Field Grouping

Lucene supports using parentheses to group multiple clauses to a single field.

To search for a title that contains both the word "return" and the phrase "pink panther" use the query:
title:(+return +"pink panther")

Escaping Special Characters

Lucene supports escaping special characters that are part of the query syntax. The current list special characters are

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \

To escape these character use the \ before the character. For example to search for (1+1):2 use the query:
\(1\+1\)\:2

clq

2006-7-1 19:38:09 发表编辑

（李宇翻译，来自Lucene的帮助文档）
------------------------------------------------------------------------------
绪论

Lucene提供了方便您创建自建查询的API，也通过QueryParser提供了强大的查询语言。

本文讲述Lucene的查询语句解析器支持的语法，Lucene的查询语句解析器是使用JavaCC工具生成的词法解析器，它将查询字串解析为Lucene Query对象。

项（Term）

一条搜索语句被拆分为一些项（term）和操作符（operator）。项有两种类型：单独项和短语。

单独项就是一个单独的单词，例如"test" ， "hello"。

短语是一组被双引号包围的单词，例如"hello dolly"。

多个项可以用布尔操作符连接起来形成复杂的查询语句（接下来您就会看到）。

注意：Analyzer建立索引时使用的解析器和解析单独项和短语时的解析器相同，因此选择一个不会受查询语句干扰的Analyzer非常重要。

域（Field）

Lucene支持域。您可以指定在某一个域中搜索，或者就使用默认域。域名及默认域是具体索引器实现决定的。

您可以这样搜索域：域名+":"+搜索的项名。

举个例子，假设某一个Lucene索引包含两个域，title和text，text是默认域。如果您想查找标题为"The Right Way"且含有"don't go this way"的文章，您可以输入：

title:"The Right Way" AND text:go

或者

title:"Do it right" AND right

因为text是默认域，所以这个域名可以不行。

注意：域名只对紧接于其后的项生效，所以

title:Do it right

只有"Do"属于title域。"it"和"right"仍将在默认域中搜索（这里是text域）。

项修饰符（Term Modifiers）

Lucene支持项修饰符以支持更宽范围的搜索选项。

用通配符搜索

Lucene支持单个与多个字符的通配搜索。

使用符号"?"表示单个任意字符的通配。

使用符号"*"表示多个任意字符的通配。

单个任意字符匹配的是所有可能单个字符。例如，搜索"text或者"test"，可以这样：

te?t

多个任意字符匹配的是0个及更多个可能字符。例如，搜索test, tests 或者 tester，可以这样：

test*

您也可以在字符窜中间使用多个任意字符通配符。

te*t

注意：您不能在搜索的项开始使用*或者?符号。

模糊查询

Lucene支持基于Levenshtein Distance与Edit Distance算法的模糊搜索。要使用模糊搜索只需要在单独项的最后加上符号"~"。例如搜索拼写类似于"roam"的项这样写：

roam~

这次搜索将找到形如foam和roams的单词。

注意：使用模糊查询将自动得到增量因子（boost factor）为0.2的搜索结果.

邻近搜索(Proximity Searches)

Lucene还支持查找相隔一定距离的单词。邻近搜索是在短语最后加上符号"~"。例如在文档中搜索相隔10个单词的"apache"和"jakarta"，这样写：

"jakarta apache"~10

Boosting a Term

Lucene provides the relevance level of matching documents based on the terms found. To boost a term use the caret, "^", symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be.

Lucene可以设置在搜索时匹配项的相似度。在项的最后加上符号"^"紧接一个数字（增量值），表示搜索时的相似度。增量值越高，搜索到的项相关度越好。

Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for jakarta apache and you want the term "jakarta" to be more relevant boost it using the ^ symbol along with the boost factor next to the term. You would type:

通过增量一个项可以控制搜索文档时的相关度。例如如果您要搜索jakarta apache，同时您想让"jakarta"的相关度更加好，那么在其后加上"^"符号和增量值，也就是您输入：

jakarta^4 apache

This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the example:

这将使得生成的doucment尽可能与jakarta相关度高。您也可以增量短语，象以下这个例子一样：

"jakarta apache"^4 "jakarta lucene"

By default, the boost factor is 1. Although, the boost factor must be positive, it can be less than 1 (i.e. .2)

默认情况下，增量值是1。增量值也可以小于1（例如0.2），但必须是有效的。

布尔操作符

布尔操作符可将项通过逻辑操作连接起来。Lucene支持AND, "+", OR, NOT 和 "-"这些操作符。（注意：布尔操作符必须全部大写）

OR

OR操作符是默认的连接操作符。这意味着如果两个项之间没有布尔操作符，就是使用OR操作符。OR操作符连接两个项，意味着查找含有任意项的文档。这与集合并运算相同。符号||可以代替符号OR。

搜索含有"jakarta apache" 或者 "jakarta"的文档，可以使用这样的查询：

"jakarta apache" jakarta

或者

"jakarta apache" OR jakarta

AND

AND操作符匹配的是两项同时出现的文档。这个与集合交操作相等。符号&&可以代替符号AND。

搜索同时含有"jakarta apache" 与 "jakarta lucene"的文档，使用查询：

"jakarta apache" AND "jakarta lucene"

+

"+"操作符或者称为存在操作符，要求符号"+"后的项必须在文档相应的域中存在。

搜索必须含有"jakarta"，可能含有"lucene"的文档，使用查询：

+jakarta apache

NOT

NOT操作符排除那些含有NOT符号后面项的文档。这和集合的差运算相同。符号！可以代替符号NOT。

搜索含有"jakarta apache"，但是不含有"jakarta lucene"的文档，使用查询：

"jakarta apache" NOT "jakarta lucene"

注意：NOT操作符不能单独与项使用构成查询。例如，以下的查询查不到任何结果：

NOT "jakarta apache"

-

"-"操作符或者禁止操作符排除含有"-"后面的相似项的文档。

搜索含有"jakarta apache"，但不是"jakarta lucene"，使用查询：

"jakarta apache" -"jakarta lucene"

分组（Grouping）

Lucene支持使用圆括号来组合字句形成子查询。这对于想控制查询布尔逻辑的人十分有用。

搜索含有"jakarta"或者"apache"，同时含有"website"的文档，使用查询：

(jakarta OR apache) AND website

这样就消除了歧义，保证website必须存在，jakarta和apache中之一也存在。

转义特殊字符（Escaping Special Characters）

Lucene支持转义特殊字符，因为特殊字符是查询语法用到的。现在，特殊字符包括

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \

转义特殊字符只需在字符前加上符号\,例如搜索(1+1):2，使用查询

\(1\+1\)\:2

------

NEWBT官方QQ群1: 276678893
可求档连环画,漫画;询问文本处理大师等软件使用技巧;求档softhub软件下载及使用技巧.
但不可"开车",严禁国家敏感话题,不可求档涉及版权的文档软件.
验证问题说明申请入群原因即可.