DjangoCon 2015

Uso avanzado de ElasticSearch para aplicaciones Django

Honza Kral  · 

Transcripción

Extracto de la transcripción automática del vídeo realizada por YouTube.

so hello and welcome thank you for having me here and the talk is beyond the basics with Asik search and it is essentially about all the use cases where you can use elastic search but might not be immediately obvious that that is something that you can do

so we'll work through several of those and see how and why it is it is suited for for that particular scenario but before we go beyond the basics we need to talk about what are we going beyond so what is what is the base functionality of elastic search

and how does how comes that we can do all these other things like where is it coming from well it's all coming from Search Search especially full-text search is the primary function of elastic search and search is not a new problem it's been around

for awhile and it hasn't actually changed much the first essentially index over a book over some text has been created in 1230 and we still use the same data structures to this same day of course there's been plenty of improvements but the underlying

infrastructure that the inverted index remains the same it is the index that you're familiar with if you've ever read any book which I hope that you have sincerely and this is how it looks you have the list of interesting words and then for each of

these words you have a list in in the case of a book you would have a list of pages for us you would have a list of documents that actually contain this word and notice several things first of all the words are sorted of course it makes sense because you need

to be able to find the word that you're looking for so you can go to the page that actually contains it and also the pages or the documents are sorted as well and this is not this is not accidental this is very important for us and we'll we'll

see how and also when we're talking about a search using a computer there are other things involved in this data structure notably some statistics for example how many times is this word contained in this in this document or how what is the length of the

list etc things that will be very important later on so when we have a data structure like this how does the search work well it's super simple if we're looking for a document that mentions both python and django we locate these two words and we get

bagged the lists and now we just walk the lists and we merge them together so whenever we find a document that is present in both lists that is our result if we wanted to do something like a free search that we're looking only where python is immediately

followed by the word word web all we have to do is add another information into the inverted index we just need offsets what is the position of this word in the document and then when we are going through the merging process we just say we care not only that

the document is in both lists but the offset must be immediately following each other so Python would be on the position n and what would be on the position n plus 1 so you can see that actually doing a first search is not any more expensive than doing a regular

search you're like you're just adding one more comparison in numerical comparison at that so it is fairly efficient what else you can you can sort of imagine here is I can get the list of documents from anywhere it doesn't have to come from the

same index so I have multiple indices I can have index on every single field in my document and I can use them all if I have one condition title one condition on the category and one on the body I will just carry those three inverted indices to get these posting

lists is what they're called and merged merge them together so we don't have the limitation of many other data stores that we limit the number of indices you can use per query / collection and that will that is also something that we benefit greatly

from is it's this this data structure and finally the last thing that you do when you when you do this merging when you find your match you quantify how good a match it is that is the primary difference between a search engine and and a database we not

only tell you which documents matches your query but also how well does it match is it a good match or is it just me and we know that because we we have the information about the statistics so this is called relevancy we tell you how relevant the document

is to your query so how is relevancy relevancy computed at the base of it there are two there are two numbers numbers that we go TF and IDF TF is term frequency it is just the number of occurrences of that word in the given document or in the given field so

if I'm looking for Jengo in a document so how many times does this document contain the word Django is it there only once is there three times and obviously the higher the number the better the relevancy IDF is inverse document frequency which is just

a fancy word of saying how common or rare this word is in your entire data set is this a word that is contained in every single document that you have or is this a way that is only a present in one percent of your documents and we can we can get this information

right away from from the inverted index because that essentially the length of the list attached to the field compared to the number of documents that we have overall fairly easy to calculate and there's actually the exact formula if you if you're

so inclined and this number has has the opposite effect the the more common the word is the less relevant this document is for for the result because if we find a word that is in every single document yeah who cares it's have in every single document of

course we're going to find it that doesn't mean anything so this is sort of the base formula for anything that has to do with relevancy and it works very well for text now we've seen the library that does the indexing and the heavy lifting for

a sixer ch add some some more stuff on top of it you can see the exact formula there and you can see in the middle that that's the tf-idf the big big summer what it adds on top of it is it takes into account for example the length of the field because

if we find the word Jengo in the title versus in the body that also gives us different information right if we have a short field and we still find it there it's more relevant that it if we have a full text of the book and we find it there as well those

are different different types of informations so it improves on the basic on the basic tf-idf formula but it's still it's still only relying on the statistics that had learned about your data set and sometimes you want to go a little further sometimes

[ ... ]

Nota: se han omitido las otras 3.226 palabras de la transcripción completa para cumplir con las normas de «uso razonable» de YouTube.