![]() ![]() The on-wiki search for Wikipedia, Wiktionary, ‡ and the other language-specific projects is provided by CirrusSearch, which is a MediaWiki extension currently built on top of the Elasticsearch search engine, which is in turn built on the Apache Lucene search library. Queries from searchers are similarly processed, so that text from a query can be compared to the text in the search index. It can include general text processing or language-specific processing, and either can be fairly simple or quite complex. ![]() ![]() Language analysis is a series of steps to prepare text-like Wikipedia articles-to be indexed by a search engine. Come and join me for an overview of the project, and take the opportunity to appreciate language in its near infinite variety! Prelude-Language Analysis Along the way, I’ve discovered some fun facts about various languages, and uncovered some bothersome bugs in their analyzers. I want to tell you about the “language analyzer unpacking” project that I’ve been working on over the last couple of years, to directly improve search in a few dozen languages, and as part of a larger effort to improve and “harmonize” search across all the languages we support. It’s not the only thing I do, but it’s my favorite, for sure. I like to say that my role on the Search team is to improve language processing for search-especially for languages other than English. Hi! I’m Trey, and I’m a computational linguist * on the Wikimedia Search Platform team. Detail from da Vinci’s Codex Atlanticus, showing an exploded view of a hoist.Īdapted from a public domain image on Wikimedia Commons. ![]()
0 Comments
Leave a Reply. |