A Simple Web Search Engine

Every user expects a search engine on any Web site they visit. Unfortunately, search engines are extremely complex to develop. They require massive amounts of resources and time to implement. Today I’ll show you how to use a MySQL (or any SQL database) to create your own basic search engine.

Note: Today’s Code Gallery Spotlight requires that an SQL-compliant database be installed (such as MySQL). If a SQL database is not available, you can download one fromhttp://www.mysql.com/.

The search engine concept

Although search engines can be incredibly complex applications, the concept behind them is fairly easy to grasp. In its simplest form, a search engine is nothing more than a program that determines if a given item (such as a word) exists in a given set of items (such as a document).

Today, I will refer to the item that is being searched for as the “needle” and the set of items being searched as the “haystack”. Through the use of two SQL tables, I will take this needle-and-haystack concept, and turn it into a working and powerful search engine.

The SQL tables

You will be using two SQL tables to drive the backend of the search engine. The first of these tables are labeled as keywords_list, and is defined by an SQL CREATE the statement as follows:

Because this table can only be described in reference to the second of the SQL tables you will be creating, it is best that the individual descriptions of the columns in the keywords_list table be saved for later. For now, let’s continue by describing the second table, which I will call page_data:

Because the purpose of these two tables may not be immediately clear, I will describe the purpose of each briefly starting with the page_data table. As defined above, the page_data the table is used to store all of the information you would expect to find in a single result from a search engine search, including:

  • the URL to the Web site
  • the title of the page
  • a brief description or summary of the page in question

Also included is an auto-incrementing integer defined as the primary key, called page_id. This integer is a unique identifier that allows you to refer to each separate page in the table uniquely. This is a very important behavior; its purpose will be described next.

As the name implies, the keywords_list the table stores the keywords associated with the Web pages that the search engine has indexed. Specifically, the purpose of this database is to associate each keyword in the table individually with a Web page that is described in the page_data table. Setting the page_id value of a given keyword in the keywords_list table to the appropriate page_id of the Web page described in the page_data table accomplishes this task.

To better illustrate this concept, consider the following diagram:

The above is a representation of our two tables. As shown, the page_id values for those keywords on the left (the keywords_list table) are associated with the Web sites descriptions on the right (the page_data table).

In practice, the above relations are created through the use of SQL JOIN statements. Because the actual details of this concept are complex, it is recommended that you consult your SQL manual (for MySQL users, see the MySQL site) for more information on relations between tables.

Why not just one table?

If you have minimal SQL experience, you may ask: “Why not just use a single table?” Of course, the above two SQL tables can be combined into a single SQL table, which seems simpler in theory. However, in practice, as you will see, the above style is not only easier to implement but is also much more efficient than a single table.

The SQL query

With the SQL tables being used defined, the next step is to determine the format of the query used by the script to retrieve matches. For this section, I will use:

  • a simple dynamic regular expression to match keywords
  • SQL joins to retrieve the proper data from the page_data table
  • the COUNT() SQL function with the GROUP BY clause to achieve relevancy checking

The concept is fairly simple: we want to return a list of Web pages (the url, title and description) that have at least one keyword in the list of “requested” keywords. They should be sorted so that those Web pages that matched the most keywords are listed first. In SQL-speak, the same concept can be illustrated in the following SQL statement:

If the above SQL query is beyond your immediate understanding, don’t worry — simply accept it as the plain English description given. The important point beyond this query is the WHERE clause, which consists of two conditions:

  • The page_id for a given page_data must match that of a valid keywords_list page_id value
  • A given needle must exist in the keyword_list table

Also note that the only values returned by the above SELECT query are the url, title, description from the page_data table and score from the keywords_list table (where score refers to the number of keywords matched).

Using regular expressions to match keywords

In the SQL query, notice the use of the REGEXP keyword. This keyword instructs the SQL language to use a regular expression to determine if a given keyword will be allowed. For the purposes of this query, the regular expression that will suit our needs is…

[(key1)(key2)(key3)(key4)...]

…where every single keyword being searched for is enclosed in parenthesis and the entire set of keywords is enclosed in brackets. Note that there is no limit to the number of keywords that can be included in the regular expression.

The script

With the concepts behind the database backend now explained, it is time to examine the PHP script that pulls the user input and database together to create a search engine.

Today’s script consists of a single function called search() that takes two parameters. The first parameter, $db, is an array containing all of the information needed to connect to the database and perform the necessary operations. The second parameter, $query_terms, is a whitespace-separated list of keywords to match –just as it would come from an input form on a Web page.

When executed, the search() the function returns an array of matched Web sites upon success, or an error message if something goes wrong.

Now let’s look at the code.

The $db array

The first thing that must happen in our script is a definition of the database, username/password, and tables being used to perform the search. This array is an associative array of the following structure:

If this is not self-explanatory, hostuser and pass are used to connect and log in to the given database, while databasekeywords and desc are used to define the database and tables where the search engine data is located. If you’re still confused, the use of these values in the body of the search() the function should eliminate these concerns.

Connecting to the database

Now that you have defined the login information, it is time to look at the actual search() function. To start the function, you must initialize any variables that are necessary, connect to the database of your choice (I will be using MySQL), and select the proper database to use as shown:

Performing the query

At this point, you are ready to construct and execute the SQL query that will return the desired Web pages, if any exist meeting the search criteria. Since I have already explained the query itself, the only topic for discussion is the construction of the regular expression used in the WHERE clause of the query.

To refresh your memory, we must construct a regular expression of the following form:

[(key1)(key2)(key3)....]

Since the function itself is given a white space-delimited string of keywords, the first step will be to separate each keyword into an array of keywords through the use of the explode() function. Then, we will use the implode() function to convert the array of keywords back into a single string (except this time in the proper regular expression format). Once the regular expression has been constructed, it can be simply be plugged into the remainder of the SQL query and executed as shown.

Important Note: For the keywords to be searched properly, they must be properly escaped using the addslashes() function, and converted to lower case using the strtolower() function.

 

Note that a check is performed to ensure the success of the query before continuing further. Upon failure, an error message is constructed and returned instead of the expected array of results.

Returning the results

Once the query has been performed, the final step in the search() the function will be to traverse the results and return them back to the calling script in the form of an array.

This can be done with a simple while() loop that stores each row in an array $matches, which is then returned. If the SQL query returned no results, note that, since we defined $matches as an array at the start of the function, an array will still be returned (it will just be empty):

Using the script

Now that you have finished the search function, the final step is to actually use the function. This can be accomplished by simply calling search() and checking the return value. If the search()the function returns a string, an error has occurred during the attempt to search. If an array is returned, everything has completed successfully and the array will contain a list of results as shown below:

Today’s Code Gallery Spotlight is a very useful and simple way to implement the basics of a search engine on your Web pages. Although there are many more complex ways to implement a search engine — with a much more complicated script — this should provide the basics.