Change the future

Friday 4:30 p.m.–5 p.m.

Elasticsearch (Part 1): Indexing and Querying

Erik Rose

Audience level:
Novice
Category:
Big Data

Description

Elasticsearch provides an easy path to clusterable full-text search, with synonyms, faceting, and geographic math, but there's a paucity of written wisdom beyond its API docs. This talk, part 1 of a 2-part series, surveys its capabilities and shows how its internal data structures and algorithms work. With the groundwork laid, we explore how to choose efficient indexing and the right queries to make your apps go fast.

Abstract

This talk focuses on what isn't in the documentation and doesn't assume you're already a Lucene expert. It reveals what goes on behind the scenes: the data structures used for indexing and the algorithms that make finding things so fast. From these fundamentals, you will be able to deduce how to make your own use cases efficient. I'll also warn you against the mistakes I fell into when getting started with ES.

We'll also touch on some of the available Python libraries for talking to ES and examine strategies for index your content, modeling it in your program, and keeping it up to date.

  • Intro
    • elasticsearch is a one-man show.
      • Scope of project
      • Survey of capabilities: why would I use it?
      • Docs look good at first, but you quickly realize there's a lot missing.
    • Deep wisdom is all tied up in author's head.
    • I've learned it the hard way so you don't have to.
  • Basic data structure
    • Document IDs
    • Type-guessing
    • Mappings
    • Arrays
      • How they're searched
    • Nesting and inter-document relationships
    • How inverted indices work
  • Querying
    • Filters vs. Queries
    • Filters are cached, so filter when you can.
    • Queries are more powerful: fuzzy stuff, scoring, etc.
    • Term vs. match and why this will save you days of pain
    • Text phrase queries
  • Analysis
    • 3 stages
      • Char filter
      • Tokenizer
      • Token filter
    • Parallels with DB indexing
    • What kinds of analysis are there?
      • All the standard stuff
      • Stopwords
      • Stemming
      • Ngrams
      • Various field types
      • Multi fields
    • Choosing appropriate analysis: what kinds speed which queries?
    • Common cases
    • Testing analyzers with the _analyze API
    • Building synonym mappings
      • What happens behind the scenes (reverse indices, expansion)
    • Multi-language support
    • Query analyzers (vs. index analyzers)
    • Shrinking your index
      • What's the point?
      • Is every part of your index equally hot?
      • Is your index bigger than RAM?
      • How's your I/O speed?
      • Compression
      • _source: to store or not to store?
      • _all: need it?