Analyzer와 Tokenizer

Analyzer와 Tokenizer

2019. 4. 7. 00:33ㆍ[정리] 데이터베이스/[NoSQL] ElasticSearch

Analyzer 가 필요한 이유
Elastic search는 Default Analyzer 가 Edge-nGram Analyzer다.
fox 라는 데이터를 저장하면 f, fo, fox 으로 analyze하고 데이터를 인덱싱하여 저장한다.
그런데 사용자가 fox를 찾고싶은데 앞에 철자가 기억안났다고 치자.

localhost:9200/animal/_search

{ 
	"query" : { 
		"term":{ 
			"species":"ox" 
		} 
	} 
}

라고 검색했는데 소(ox) 만 검색되고 여우(fox)가 검색이 안된다.
이렇게만 되면 이게 무슨 검색엔진인가.

fox 라고 데이터가 저장되면 f, o, x, fo, ox, fox 이렇게 분석하고 인덱싱하는 Analyzer를 적용하길 원한다.
즉 nGram 방식의 분석기가 필요하다. n gram analyze (min 1, max 3)
Analyer는 tokenizer를 통해서 만들어진다.

localhost:9200/animal

{
	"settings":{
		"analysis": {
			"analyzer":{
				"ngram_analyzer_3_6":{
					"tokenizer": "ngram_tokenizer_3_6"
				}
			},
			"tokenizer":{
				"ngram_tokenizer_3_6":{
					"type":"nGram",
					"min_gram" : 3,
					"max_gram" : 6
				}
			}
		}
	},
    "mappings": {
        "_doc": {
            "properties": {
                "species": {
                    "type": "text",
                    "index": true,
                    "analyzer":"ngram_analyzer_3_6"
                }
                ...
            }
        }
    }
}

Analyzer, Tokenizer 테스트방법
출처 : https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html

Testing analyzers | Elasticsearch Reference [6.7] | Elastic

The analyze API is an invaluable tool for viewing the terms produced by an analyzer. A built-in analyzer (or combination of built-in tokenizer, token filters, and character filters) can be specified inline in the request: POST _analyze { "analyzer": "white

www.elastic.co

localhost:9200/species/_analyze

{
	"analyzer": "ngram_analyzer_3_10",
	"filter":  [ "lowercase", "asciifolding" ],
	"text":      "Is this quick brown fox."
}

Tokenizer 타입
저장하려는 데이터 Hello world! x3

Tokenizer	Tokenize 방식	인덱식 되는 토큰
keyword	그대로 사용	Hello world! X3
whitespace	띄어쓰기 기준	Hello, world!, x3
standard	띄어쓰기 기준, 특수 문자 빠짐	Hello, world, x3
lowercase	띄어쓰기 기준, 특수 문자 빠짐, 소문자로 만듬	hello, world, x3
letter	띄어쓰기 기준, 특수 문자 빠짐, 글자만으로 만듬	Hello, world, x
nGram	min, max를 정해서 인덱싱 (띄어쓰기 포함한다.)	너무 길어서 아래 참조
uax_url_email	데이터가 이메일 포맷일 경우 토큰화 하지 않고 그대로 저장	test@daum.net
path_hierachy	데이터가 path 포맷일 경우 디렉토리 별로 토큰을 만들어서 저장	~ Desktop Project Test

nGram, min : 4 max : 5
4 : "Hell”, "ello”, "llo “, "lo W”, "o Wo”, " Wor”, "Worl”, "orld”, "rld!”, "ld! “, "d! x”, "! x3"
5 : "Hello”, "ello ”, "llo W“, "lo Wo”, "o Wor”, " Worl”, "World”, "orld!”, "rld! ”, "ld! x“, "d! x3”

저작자표시 비영리 (새창열림)

'[정리] 데이터베이스 > [NoSQL] ElasticSearch' 카테고리의 다른 글

ES Search query depth 정리 (0)	2019.08.03
Properties에서 Index 필드의 설정값, Search 짧은 지식 (0)	2019.04.06
엘라스틱 서치 String이 사라진 이유 (0)	2019.04.02
[2019.03.24] 엘라스틱 서치 스프링 연동 방법 정리 (0)	2019.03.24
[2019.03.14] 엘라스틱 서치 (클러스터 설정 방법, 플러그인) (0)	2019.03.14

태그

최근글

댓글

공지사항

아카이브

'[정리] 데이터베이스 > [NoSQL] ElasticSearch' 카테고리의 다른 글

관련글

티스토리툴바