经验首页 前端设计 程序设计 Java相关 移动开发 数据库/运维 软件/图像 大数据/云计算 其他经验
当前位置:技术经验 » 程序设计 » Elasticsearch » 查看文章
【ElasticSearch】大数据量情况下的前缀、中缀实时搜索方案
来源:cnblogs  作者:雨叶微枫  时间:2023/7/21 8:49:24  对本文有异议

简述

业务开发中经常会遇到这样一种情况,用户在搜索框输入时要实时展示搜索相关的结果。要实现这个场景常用的方案有Completion Suggester、search_as_you_type。那么这两种方式有什么区别呢?一起来了解下。

环境说明:

数据量:9000w+
es版本:7.10.1
脚本执行工具:kibana

Completion Suggester和search_as_you_type的区别

1.Completion Suggester是基于前缀匹配、且数据结构存储在内存中,超级快,缺点是耗内存
2.search_as_you_type可以是前缀、中缀匹配,可以很快,但是要选好查询方式
3.Api调用方式不同,Completion Suggester是通过Suggest语句查询,search_as_you_type和常规查询方式一致

举个栗子

如何实现前缀匹配需求

使用Completion Suggester,示例如下:

  1. 创建索引
  1. PUT /es_demo
  2. {
  3. "mappings": {
  4. "properties": {
  5. "title_comp": {
  6. "type": "completion",
  7. "analyzer": "standard"
  8. }
  9. }
  10. }
  11. }
  1. 初始化数据
  1. POST _bulk
  2. {"index":{"_index":"es_demo","_id":"1"}}
  3. {"title_comp": "愤怒的小鸟"}
  4. {"index":{"_index":"es_demo","_id":"2"}}
  5. {"title_comp": "最后一只渡渡鸟"}
  6. {"index":{"_index":"es_demo","_id":"3"}}
  7. {"title_comp": "今天不加班啊"}
  8. {"index":{"_index":"es_demo","_id":"4"}}
  9. {"title_comp": "愤怒的青年"}
  10. {"index":{"_index":"es_demo","_id":"5"}}
  11. {"title_comp": "最后一只996程序猿"}
  12. {"index":{"_index":"es_demo","_id":"6"}}
  13. {"title_comp": "今日无事,勾栏听曲"}
  1. 查询DSL
    通过前缀查询,查找以“愤怒”开头的字符串
  1. GET /es_demo/_search
  2. {
  3. "suggest": {
  4. "title_suggest": {
  5. "prefix": "愤怒",
  6. "completion": {
  7. "field": "title_comp"
  8. }
  9. }
  10. }
  11. }
  1. 查询代码demo
  1. @SpringBootTest
  2. public class SuggestTest {
  3. @Autowired
  4. private RestHighLevelClient restHighLevelClient;
  5. @Test
  6. public void testComp() {
  7. List<Map<String, Object>> list = suggestComplete("愤怒");
  8. list.forEach(m -> System.out.println("[" + m.get("title_comp") + "]"));
  9. }
  10. public List<Map<String, Object>> suggestComplete(String keyword) {
  11. CompletionSuggestionBuilder completionSuggestionBuilder = SuggestBuilders.completionSuggestion("title_comp");
  12. completionSuggestionBuilder.size(5)
  13. //跳过重复的
  14. .skipDuplicates(true);
  15. SuggestBuilder suggestBuilder = new SuggestBuilder();
  16. suggestBuilder.addSuggestion("suggest_title", completionSuggestionBuilder)
  17. .setGlobalText(keyword);
  18. SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
  19. searchSourceBuilder.suggest(suggestBuilder);
  20. SearchRequest searchRequest = new SearchRequest("es_demo").source(searchSourceBuilder);
  21. try {
  22. SearchResponse response = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
  23. CompletionSuggestion completionSuggestion = response.getSuggest().getSuggestion("suggest_title");
  24. List<Map<String, Object>> suggestList = new LinkedList<>();
  25. for (CompletionSuggestion.Entry.Option option : completionSuggestion.getOptions()) {
  26. Map<String, Object> map = new HashMap<>();
  27. map.put("title_comp", option.getHit().getSourceAsMap().get("title_comp"));
  28. suggestList.add(map);
  29. }
  30. return suggestList;
  31. } catch (IOException e) {
  32. throw new RuntimeException("ES查询出错");
  33. }
  34. }
  35. }

查询结果:

  1. [愤怒的小鸟]
  2. [愤怒的青年]

如何实现中缀匹配需求

使用search_as_you_type,此处提供了hanlp_index和standard两种分词器的字段示例。示例如下:

  1. 创建索引
  1. PUT /es_search_as_you_type
  2. {
  3. "mappings": {
  4. "properties": {
  5. "title": {
  6. "type": "text",
  7. "fields": {
  8. "han": {
  9. "type": "search_as_you_type",
  10. "analyzer": "hanlp_index"
  11. },
  12. "stan": {
  13. "type": "search_as_you_type",
  14. "analyzer": "standard"
  15. }
  16. }
  17. }
  18. }
  19. }
  20. }
  1. 初始化数据
  1. POST _bulk
  2. {"index":{"_index":"es_search_as_you_type","_id":"1"}}
  3. {"title": "愤怒的小鸟"}
  4. {"index":{"_index":"es_search_as_you_type","_id":"2"}}
  5. {"title": "最后一只渡渡鸟"}
  6. {"index":{"_index":"es_search_as_you_type","_id":"3"}}
  7. {"title": "今天不加班啊"}
  8. {"index":{"_index":"es_search_as_you_type","_id":"4"}}
  9. {"title": "愤怒的青年"}
  10. {"index":{"_index":"es_search_as_you_type","_id":"5"}}
  11. {"title": "最后一只996程序猿"}
  12. {"index":{"_index":"es_search_as_you_type","_id":"6"}}
  13. {"title": "今日无事,勾栏听曲"}
  1. 查询DSL
  1. GET /es_search_as_you_type/_search
  2. {
  3. "query": {
  4. "match": {
  5. "title.stan": {
  6. "query": "的小",
  7. "operator": "and"
  8. }
  9. }
  10. }
  11. }
  1. 查询代码demo
  1. @SpringBootTest
  2. public class SuggestTest {
  3. @Autowired
  4. private RestHighLevelClient restHighLevelClient;
  5. @Test
  6. public void testSearchAsYouType() {
  7. List<Map<String, Object>> list = suggestSearchAsYouType("的小");
  8. list.forEach(m -> System.out.println("[" + m.get("title") + "]"));
  9. }
  10. public List<Map<String, Object>> suggestSearchAsYouType(String keyword) {
  11. //这里使用了search_as_you_type的2gram字段,可以根据自己需求调整配置
  12. MatchQueryBuilder matchQueryBuilder = matchQuery("title.stan._2gram", keyword).operator(Operator.AND);
  13. //需要返回的字段
  14. String[] includeFields = new String[]{"title"};
  15. SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder()
  16. .query(matchQueryBuilder).size(5)
  17. .fetchSource(includeFields, null)
  18. .trackTotalHits(false)
  19. .trackScores(true)
  20. .sort(SortBuilders.scoreSort());
  21. SearchRequest searchRequest = new SearchRequest("es_search_as_you_type").source(searchSourceBuilder);
  22. try {
  23. SearchResponse response = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
  24. org.elasticsearch.search.SearchHits hits = response.getHits();
  25. List<Map<String, Object>> suggestList = new LinkedList<>();
  26. for (org.elasticsearch.search.SearchHit hit : hits) {
  27. Map<String, Object> map = new HashMap<>();
  28. map.put("title", hit.getSourceAsMap().get("title").toString());
  29. suggestList.add(map);
  30. }
  31. return suggestList;
  32. } catch (IOException e) {
  33. throw new RuntimeException("ES查询出错");
  34. }
  35. }
  36. }

查询结果:

  1. [愤怒的小鸟]

分词器说明

查看分词结果的方式

第一种

指定分词器

  1. GET _analyze
  2. {
  3. "analyzer": "standard",
  4. "text": [
  5. "愤怒的小鸟"
  6. ]
  7. }

第二种

指定使用某个字段的分词器

  1. POST es_search_as_you_type/_analyze
  2. {
  3. "field": "title.stan",
  4. "text": [
  5. "愤怒的青年"
  6. ]
  7. }

hanlp_index和standard分词器的区别

standard分词器

  • 默认会过滤掉符号
  • 中文以单个字为最小单位,英文则会以空格符或其他符号或中文分隔作为一个单词

例:

  1. GET _analyze
  2. {
  3. "analyzer": "standard",
  4. "text": [
  5. "愤怒的小鸟"
  6. ]
  7. }

分词结果:

  1. {
  2. "tokens" : [
  3. {
  4. "token" : "愤",
  5. "start_offset" : 0,
  6. "end_offset" : 1,
  7. "type" : "<IDEOGRAPHIC>",
  8. "position" : 0
  9. },
  10. {
  11. "token" : "怒",
  12. "start_offset" : 1,
  13. "end_offset" : 2,
  14. "type" : "<IDEOGRAPHIC>",
  15. "position" : 1
  16. },
  17. {
  18. "token" : "的",
  19. "start_offset" : 2,
  20. "end_offset" : 3,
  21. "type" : "<IDEOGRAPHIC>",
  22. "position" : 2
  23. },
  24. {
  25. "token" : "小",
  26. "start_offset" : 3,
  27. "end_offset" : 4,
  28. "type" : "<IDEOGRAPHIC>",
  29. "position" : 3
  30. },
  31. {
  32. "token" : "鸟",
  33. "start_offset" : 4,
  34. "end_offset" : 5,
  35. "type" : "<IDEOGRAPHIC>",
  36. "position" : 4
  37. }
  38. ]
  39. }

hanlp_index分词器

  • 默认不会过滤符号
  • 通过语义等对字符串进行分词,会分出词语

例:

  1. GET _analyze
  2. {
  3. "analyzer": "hanlp_index",
  4. "text": [
  5. "愤怒的小鸟"
  6. ]
  7. }

分词结果:

  1. {
  2. "tokens" : [
  3. {
  4. "token" : "愤怒",
  5. "start_offset" : 0,
  6. "end_offset" : 2,
  7. "type" : "a",
  8. "position" : 0
  9. },
  10. {
  11. "token" : "的",
  12. "start_offset" : 2,
  13. "end_offset" : 3,
  14. "type" : "ude1",
  15. "position" : 1
  16. },
  17. {
  18. "token" : "小鸟",
  19. "start_offset" : 3,
  20. "end_offset" : 5,
  21. "type" : "n",
  22. "position" : 2
  23. }
  24. ]
  25. }

生产实践中的查询情况

基本都是几百毫秒就解决。ps:如果一条数据字段很多,最好只返回几个需要的字段即可,否则数据传输就要占用较多时间。
image

总结

当然,无论是Completion Suggester还是search_as_you_type的查询配置方式都还有很多,例如Completion Suggester的Context Suggester,search_as_you_type的2gram、3gram,还有查询类型match_bool_prefix、match_phrase、match_phrase_prefix等等。各种组合起来都会产生不同的效果,笔者这里只是列举出一种还算可以的方式。关于其他的查询类型和配置如何使用以及分别是怎么工作的,下次有空再聊聊。

官方文档链接

https://www.elastic.co/guide/en/elasticsearch/reference/7.10/search-as-you-type.html

原文链接:https://www.cnblogs.com/yywf/p/17541137.html

 友情链接:直通硅谷  点职佳  北美留学生论坛

本站QQ群:前端 618073944 | Java 606181507 | Python 626812652 | C/C++ 612253063 | 微信 634508462 | 苹果 692586424 | C#/.net 182808419 | PHP 305140648 | 运维 608723728

W3xue 的所有内容仅供测试,对任何法律问题及风险不承担任何责任。通过使用本站内容随之而来的风险与本站无关。
关于我们  |  意见建议  |  捐助我们  |  报错有奖  |  广告合作、友情链接(目前9元/月)请联系QQ:27243702 沸活量
皖ICP备17017327号-2 皖公网安备34020702000426号