On the SitePoint PHP blog a new tutorial has been posted showing you how to effectively search Chinese content with ElasticSearch. ElasticSearch is a "powerful open source search and analytics engine that makes data easy to explore" and plays nice with PHP via a JSON based query format.
If you have played with Elasticsearch, you already know that analyzing and tokenization are the most important steps while indexing content, and without them your pertinency is going to be bad, your users unhappy and your results poorly sorted. Even with English content you can lose pertinence with a bad stemming, miss some documents when not performing proper elision and so on. And that's worse if you are indexing another language; the default analyzers are not all-purpose. When dealing with Chinese documents, everything is even more complex, even by considering only Mandarin which is the official language in China and the most spoken worldwide.
He starts by explaining exactly what the problem is with searching Chinese content including the fact that some words can actually be a combination of two or more characters (words). He then lists out a few plugins and tools that can be integrated with ElasticSearch to help with analyzing the content. He goes through each of them and provides instructions on installation and usage. He ends the post with a sample of the results for a set of three search terms, comparing the matches each found.
Link: http://www.sitepoint.com/efficient-chinese-search-elasticsearch/
没有评论:
发表评论