Build a search index in Python
16 July 2024 | 12:00 am

How can search engines be so fast? While there are many parts of a search system, one of the key concepts to know is the inverted index. Suppose we have the following strings, all Taylor Swift lyrics: I made you my temple, my mural, my sky And I still talk to you when I’m screaming at the sky Started with a kiss We could manually search through these to find all lyrics that contain a word, like “window”. With that said, this solution wouldn’t scale well if the documents we had were every song lyric by every pop artist. Here, we need an inverted index. An inverted index is a data structure that maps words to the documents they are in. Here is the structure of an inverted index: { "temple": [0], "kiss": [2], "sky": [0, 1], ... } With the structure above, we can look for sky and find that it is in documents 0 and 1. Then, we could return those documents to the user. The alternative would be manually searching through every single document. This data structure is a map, also known as a dictionary. One of the benefits of a dictionary is that it has O(1) lookup times. This means that the lookup time is constant no matter how many items are in the dictionary. If our dictionary has a million records, it will be as fast to query as it would if it had three records. Let’s talk through how to create a reverse index. Define the documents Before we begin, we need to define our documents. Our search feature is going to let us find what songs contain a specific lyric. Thus, we will need two pieces of information: a song title, and the lyric. Let’s start with a list of documents: documents = [ { "title": "tolerate it", "lyric": "I made you my temple, my mural, my sky" }, { "title": "my tears ricochet", "lyric": "And I still talk to you when I'm screaming at the sky", }, { "title": "The Bolter", "lyric": "Started with a kiss" } ] Here, we have three documents, each with a title and a lyric. Define the index Now that we have a list of documents, we need to build our index. To build the index, we will loop through each item in the list of documents. For each item, we will: Change all text to lowercase, so that our index is case insensitive; Remove all symbols, so symbols do not interfere with searches; Add each word in the lyric to an index. Translated into code, this looks like: import string def transform_text(text): return text.lower().translate(str.maketrans("", "", string.punctuation)) index = {} for i, doc in enumerate(documents): lyric = transform_text(doc["lyric"]) for word in lyric.split(): if word not in index: index[word] = set({}) index[word].add(i) First, we define a function called transform_text. This code will turn all text lowercase and remove all punctuation. This is important because if our search index includes capital letters, we can only query it if the user has typed in the same capitals as the index. The same applies with punctuation. Thus, if we wanted to look for sky, we would have to search exactly for sky. If we didn’t have the apostrophe, we couldn’t search for it in our dictionary. We are using sets to keep track of documents that contain a word. This is so that a document that contains the same word multiple times will only be saved once in the index. We then loop over each document. We call transform_text to transform the lyric to lowercase and remove all punctuation. We then add each individual word to our index. We add words with this structure: { word: [documents] } Let’s run the code above on a single lyric: “I made you my temple, my mural, my sky”: {'i': {0}, 'made': {0}, 'you': {0}, 'my': {0}, 'temple': {0}, 'mural': {0}, 'sky': {0}} We now have all the words in the lyric in our index. Each word is associated with its document ID. Since this is the first lyric in our list of documents, all IDs are 0. If we index all of our documents, we will have: {'i': {0, 1}, 'made': {0}, 'you': {0, 1}, 'my': {0}, 'temple': {0}, 'mural': {0}, 'sky': {0, 1}, 'and': {1}, 'still': {1}, 'talk': {1}, 'to': {1}, 'when': {1}, 'im': {1}, 'screaming': {1}, 'at': {1}, 'the': {1}, 'started': {2}, 'with': {2}, 'a': {2}, 'kiss': {2}} Notice sky has two values: 0 and 1. This is because the word appears in both documents. We now have a search index! The next step is to write some code to query it. Search the index To query our index, let’s define a search function: def search(query): words = transform_text(query).split() results = set() for word in words: if word in index: results.update(index[word]) titles = [documents[idx] for idx in results] return titles This function accepts a query and defines an empty list of results. We first transform the query so that it is in the same form as all the documents: lowercase, and without symbols. We then split the query into a list of words. Because our index only contains single words as keys, we have to run a query for each word. We then loop over all words and check if the word is in the index. If it is, we add all of the document IDs to the set results. Finally, we get the song titles associated with each document ID. This code will return all documents that contain any word in your search query. The words don’t have to be together. There is more code that we would need to write to return documents that contain two or more words in sequence. That is out of scope for this guide. Let’s call our function with the query sky: [{'title': 'tolerate it', 'lyric': 'I made you my temple, my mural, my sky'}, {'title': 'my tears ricochet', 'lyric': "And I still talk to you when I'm screaming at the sky"}] Our code successfully identified both lyrics! If we search for kiss, we get: [{'title': 'The Bolter', 'lyric': 'Started with a kiss'}] Benchmarking Earlier in this article, we discussed how an index would be faster than searching through all words in all documents. Let’s test that theory. We can use the code above, with a few changes. First, we need to make the documents list larger: documents = documents * 5000 This will make the documents list 5,000x larger. Thus, we will have 15,000 documents (3 documents x 5,000). Second, we need to write some code that will time the use of our search query. Let’s run 20 searches of the same query and time them: import time start = time.time() for _ in range(20): search("sky") end = time.time() print("Time taken for index:", end - start) Finally, let’s write code that would manually check if every lyric value in documents contained our search query: def search_by_words(query): lyric = transform_text(query).split() results = [] for word in lyric: for doc in documents: if word in transform_text(doc["lyric"]): results.append(doc) titles = [doc["title"] for doc in results] return titles start = time.time() for _ in range(20): search_by_words("sky") end = time.time() print("Time taken for manual comparisons:", end - start) Let’s search for the term sky. When tested on an M1 Macbook, this code returns: Time taken for index: 0.0071811676025390625 Time taken for manual comparisons: 0.8166451454162598 To run 20 queries with our index of 15,000 documents, our script took 0.007s. The same 20 queries on the same index took 0.8s if we did manual comparison. Of note, our document sizes are short. If our documents were much longer – for example, the text from web pages – the manual comparison would take even longer. Querying the index, on the other hand, would take the same time no matter how long documents are. This is because the index has been pre-built and saved in a data structure with O(1) lookup time. We only need to build the index once. Whereas our manual comparison is checking everything by brute force, which is slow and does not scale well. Conclusion In this guide, we made a reverse index in Python. We defined a list of documents, then created a reverse index that maps each word in every document to an ID associated with that document. We then wrote code to query the index, and demonstrated that indexes are substantially faster than brute-force search at runtime. Our search index does have a significant limitation: we cannot search for text that is explicitly together (i.e. I can). With that said, for a few dozen lines of code, we have built an efficient data structure for searching documents. You can see the full code for this blog post on GitHub. This post was inspired by Tim Bray’s On Search: Basic Basics blog post, which was the first resource that helped me understand search indexing and how it could be so fast.

Expectations in design
16 July 2024 | 12:00 am

I have been thinking about the question what makes a “good tool”? What are the characteristics that make me comfortable depending on a tool? What makes me delighted to use a tool? Expectations play a crucial role in my experience with a tool. Does this tool do what I expect? Are the controls easy to navigate? Are navigation elements grouped together well? Do buttons do what I think they should do based on my impression with the tool? Is there language or iconography that is confusing? I recently updated two links on my website: my email and RSS link. The email link used to take you directly to your email client with the mailto: link, and the RSS link used to take you directly to the RSS feed. But, after clicking around my site for a while, looking for details that did not work as expected, I realised these two links were not intuitive. The mailto: link, while useful, is not something I use myself. I often end up having to manually copy an email address or type it out myself because my browser is not configured to open with my email client. I could change it, but herein lies a question: will most of the users of this website have their email client set up to use mailto:? The answer may be yes, but the friction I have encountered gives me an incline that perhaps there is another experience I could offer. Instead, I decided to create a page that explicitly states my email address. The Email link in the sidebar now takes you to that page. As a user, this makes it easier to navigate the website. If I click on the email link — either intentionally or accidentally — I will not be taken immediately to another place, or presented with a system dialog to choose my email client. Instead, I will remain on my website, with the email available in text ready for use. (Unfortunately, [at] is used instead of @ to prevent scraping by spammers. This is not an ideal user experience, as I want the email to be copy-pastable. Perhaps there is a solution to this problem already? If you know one, let me know!) My RSS link now takes you to a page that has my feed in plain text, as well as a link to the About Feeds website that describes what web feeds are and how you can use them. This is substantially better than the old experience, which would download the RSS feed to your machine. It was unreasonable to expect that a user would know to copy the RSS link instead of open it up. And perhaps RSS should be labeled with a label that doesn’t reference a technology, but instead describes that you can subscribe to my blog. I don’t like “Subscribe” because of the connotations with social media. I will need to think of what language to use. After these changes, I feel like the email and RSS links — which appear on every page — are more intuitive. I feel like those links help me better navigate the website without distrupting my navigation flow by taking me elsewhere, or by depending on configurations that I cannot reasonably expect (mailto: being set up and someone knowing what to do with a raw RSS feed page).

craft
16 July 2024 | 12:00 am

in polishing, craft.


More News from this Feed See Full Web Site