Postgres Full Text Search

Aug 28th, 2020 - written by Kimserey with .

In today’s post we will look at how PostgreSQL Full Text Search functionalities can be used to retrieved ranked documents. We’ll start by looking at the basics of matching documents, then we’ll move on to how we can define indexes for performance improvement and we’ll end by looking at weights and ranking.

Matching

The basics of full text search revolves around matching documents. A document represents whatever can be match on, for example if a model is an email, a document could be the title plus the body, or if the model is a TV, the document could be the title plus the description. With full text search, we want to be able to provide a text like SMART LED 55inch and the result to be a list of ranked results.

Let’s start first by creating an example database:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
CREATE TABLE televisions (
   id           SERIAL  PRIMARY KEY,
   title        TEXT    NOT NULL,
   description  TEXT    NOT NULL,
   date         DATE    NOT NULL
);

INSERT INTO televisions (title, description, date) VALUES (
    'HISENSE 50A7300FTUK 50" Smart 4K Ultra HD HDR LED TV with Amazon Alexa', 
    'Enjoy amazing visuals with the Hisense 50A7300FTUK 50" Smart 4K Ultra HD HDR LED TV. The 4K Ultra HD display means everything looks pin-sharp, and even upscales lower resolution shows to bring new levels of detail to old classics.', 
    '2020-03-01');
INSERT INTO televisions (title, description, date) VALUES (
    'HISENSE 55A7300FTUK 55" Smart 4K Ultra HD HDR LED TV with Amazon Alexa', 
    'Enjoy amazing visuals with the Hisense 55A7300FTUK 55" Smart 4K Ultra HD HDR LED TV. The 4K Ultra HD display means everything looks pin-sharp, and even upscales lower resolution shows to bring new levels of detail to old classics.', 
    '2019-07-01');
INSERT INTO televisions (title, description, date) VALUES (
    'SAMSUNG UE55TU7100KXXU 55" Smart 4K Ultra HD HDR LED TV',
    'Dont settle for anything less than crystal clear colour. The Samsung UE55TU7100KXXU 55" Smart 4K Ultra HD HDR LED TV delivers pure colours, sharp contrast, and brilliant brightness. HDR displays the brightest and darkest colours in each frame, so everything from dimly lit dialogue scenes to nature documentaries look vivid and realistic.',
    '2019-01-01');
INSERT INTO televisions (title, description, date) VALUES (
    'LG 55UN74006LB 55" Smart 4K Ultra HD HDR LED TV with Google Assistant & Amazon Alexa', 
    'Enjoy movie nights at home with the LG 55UN74006LB 55" Smart 4K Ultra HD HDR LED TV. With a 4K display and AI upscaling, everything you watch is packed with life-like detail.', 
    '2020-01-01');
INSERT INTO televisions (title, description, date) VALUES (
    'LG 50UN74006LB 50" Smart 4K Ultra HD HDR LED TV with Google Assistant & Amazon Alexa', 
    'Enjoy movie nights at home with the LG 50UN74006LB 50" Smart 4K Ultra HD HDR LED TV. With a 4K display and AI upscaling, everything you watch is packed with life-like detail.', 
    '2020-05-01');
INSERT INTO televisions (title, description, date) VALUES (
    'SAMSUNG UE50TU7100KXXU 50" Smart 4K Ultra HD HDR LED TV', 
    'You have got one of the largest collections of 4K content and catch-up apps at your fingertips, including Apple TV, Netflix, and Disney+, so you willl never be stuck for something to watch.', '2019-11-01');
INSERT INTO televisions (title, description, date) VALUES (
    'PHILIPS 50PUS7855 50" 4K Ultra HD HDR LED TV with Amazon Alexa', 
    'You will be immersed in your favourite shows with the Ambilight feature on the Philips 50PUS7855 50" 4K Ultra HD HDR LED TV with Amazon Alexa. The intelligent LEDs around three sides of the TV project coloured light onto your walls, and reacts to what is on the screen for even more atmosphere.', 
    '2019-09-01');
INSERT INTO televisions (title, description, date) VALUES (
    'SONY BRAVIA KD-55AG9BU 55" Smart 4K Ultra HD HDR OLED TV with Google Assistant', 
    'Discover just how great your entertainment can look with the Sony BRAVIA KD-55AG9BU 55" Smart 4K Ultra HD HDR OLED TV with Google Assistant.', 
    '2020-01-01');

In this context, we can think of a document as the title plus the description so for the first document, it would be:

1
SELECT title || ' ' || description AS document FROM televeisions LIMIT 1;

Which would result in the title being concatenated with the description. In order to start matching the document with a query, Postgres uses the term tsvector for text search vector and tsquery for text search query where a tsvector is a datatype representing a list of lexeme (word token appearing in the document) and a tsquesry is a datatype representing a processed query into lexeme to search for. Creating a tsvector is achieved by converting the document using the function to_tsvector, and similarly creating the tsquery is achieved by converting the query using to_tsquery.

For example:

1
SELECT to_tsvector(title || ' ' || description) FROM televisions LIMIT 1;

results in:

1
to_tsvector | '4k':5,23,30 '50':3,21 '50a7300ftuk':2,20 'alexa':13 'amaz':15 'amazon':12 'bring':47 'classic':54 'detail':51 'display':33 'enjoy':14 'even':41 'everyth':35 'hd':7,25,32 'hdr':8,26 'hisens':1,19 'led':9,27 'level':49 'look':36 'lower':43 'mean':34 'new':48 'old':53 'pin':38 'pin-sharp':37 'resolut':44 'sharp':39 'show':45 'smart':4,22 'tv':10,28 'ultra':6,24,31 'upscal':42 'visual':16

The matching symbol used for full text search in PostgreSQL is @@. Using the tsvector and the tsquery, we are able to make a filter tsvector @@ tsquery which will return true for false depending on whether the query matches the vector.

Using what we have learnt so far, we can filter

1
2
3
4
5
6
SELECT
     id,
     title
 FROM televisions
 WHERE 
    to_tsvector(title || ' ' || description) @@ websearch_to_tsquery('SMART LED 55');

We used a variation of to_tsquery with websearch_to_tsquery which allows us to type regular sentence as it would be for a websearch. The query then gets translated to 'smart' & 'led' & '55'

This search will result in the following results:

1
2
3
4
5
6
7
+------+--------------------------------------------------------------------------------------+
| id   | title                                                                                |
|------+--------------------------------------------------------------------------------------|
| 2    | HISENSE 55A7300FTUK 55" Smart 4K Ultra HD HDR LED TV with Amazon Alexa               |
| 3    | SAMSUNG UE55TU7100KXXU 55" Smart 4K Ultra HD HDR LED TV                              |
| 4    | LG 55UN74006LB 55" Smart 4K Ultra HD HDR LED TV with Google Assistant & Amazon Alexa |
+------+--------------------------------------------------------------------------------------+

Indexing

So far we’ve seen how execute a full text search on documents by converting the content of rows into a tsvector. This approach is relatively cheap for small subset of data, but we can imagine how the performance would degrade as the amount of rows grows and the larger the document is.

An approach to cater for this issue is to create a column specifically designed to hold the vectorised document, and create a GIN index on that column.

We start by creating a stored generated columns which gets computed out of the title and description using to_tsvector.

1
2
ALTER TABLE televisions
    ADD COLUMN textsearchable_index_col tsvector GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || description)) STORED;

Here we had to specify the 'english' configuration as the column has to be immutable. Previously we were using to_tsvector on the fly so the configuration used was the default configuration configured for the particular PostgreSQL client where the query was ran from. But when we are creating an index column, or an index, we have to specify the configuration.

And we then add an index on that column:

1
CREATE INDEX textsearch_idx ON televisions USING GIN (textsearchable_index_col);
1
2
3
4
5
6
7
8
9
10
11
12
+--------------------------+----------+-----------------------------------------------------------------------------------+
| Column                   | Type     | Modifiers                                                                         |
|--------------------------+----------+-----------------------------------------------------------------------------------|
| id                       | integer  |  not null default nextval('televisions_id_seq'::regclass)                         |
| title                    | text     |  not null                                                                         |
| description              | text     |  not null                                                                         |
| date                     | date     |  not null                                                                         |
| textsearchable_index_col | tsvector |  default to_tsvector('english'::regconfig, ((title || ' '::text) || description)) |
+--------------------------+----------+-----------------------------------------------------------------------------------+
Indexes:
    "televisions_pkey" PRIMARY KEY, btree (id)
    "textsearch_idx" gin (textsearchable_index_col)

We can then use the column we created:

1
2
3
4
5
6
SELECT
     id,
     title
 FROM televisions
 WHERE 
    textsearchable_index_col @@ websearch_to_tsquery('SMART LED 55');

We can see how we can trim the computation of the tsvector as it has already been generated once.

Ranking

Once we know how to retrieve matching documents to the query provided, the next step is to be able to rank documents by relevance.

For ranking, we have ts_rank(tsvector, tsquery) available.

1
2
3
4
5
6
SELECT
     id,
     title,
     ts_rank(textsearchable_index_col, query) AS rank
 FROM televisions, websearch_to_tsquery('SMART LED 55') as query
 ORDER BY rank DESC;

Using ts_rank we can score all rows and get them ranked by their relevance.

1
2
3
4
5
6
7
8
9
10
11
12
+------+--------------------------------------------------------------------------------------+------------+
| id   | title                                                                                | rank       |
|------+--------------------------------------------------------------------------------------+------------|
| 2    | HISENSE 55A7300FTUK 55" Smart 4K Ultra HD HDR LED TV with Amazon Alexa               | 0.46427712 |
| 3    | SAMSUNG UE55TU7100KXXU 55" Smart 4K Ultra HD HDR LED TV                              | 0.45246616 |
| 4    | LG 55UN74006LB 55" Smart 4K Ultra HD HDR LED TV with Google Assistant & Amazon Alexa | 0.4462062  |
| 8    | SONY BRAVIA KD-55AG9BU 55" Smart 4K Ultra HD HDR OLED TV with Google Assistant       | 0.18874387 |
| 1    | HISENSE 50A7300FTUK 50" Smart 4K Ultra HD HDR LED TV with Amazon Alexa               | 0.18807486 |
| 5    | LG 50UN74006LB 50" Smart 4K Ultra HD HDR LED TV with Google Assistant & Amazon Alexa | 0.1782009  |
| 6    | SAMSUNG UE50TU7100KXXU 50" Smart 4K Ultra HD HDR LED TV                              | 0.09148999 |
| 7    | PHILIPS 50PUS7855 50" 4K Ultra HD HDR LED TV with Amazon Alexa                       | 1e-20      |
+------+--------------------------------------------------------------------------------------+------------+

We can also filter to keep only matching document:

1
2
3
4
5
6
7
SELECT
     id,
     title,
     ts_rank(textsearchable_index_col, query) AS rank
FROM televisions, websearch_to_tsquery('SMART LED 55') as query
WHERE textsearchable_index_col @@ query
ORDER BY rank DESC;

Which will result in the following results:

1
2
3
4
5
6
7
+------+--------------------------------------------------------------------------------------+------------+
| id   | title                                                                                | rank       |
|------+--------------------------------------------------------------------------------------+------------|
| 2    | HISENSE 55A7300FTUK 55" Smart 4K Ultra HD HDR LED TV with Amazon Alexa               | 0.46427712 |
| 3    | SAMSUNG UE55TU7100KXXU 55" Smart 4K Ultra HD HDR LED TV                              | 0.45246616 |
| 4    | LG 55UN74006LB 55" Smart 4K Ultra HD HDR LED TV with Google Assistant & Amazon Alexa | 0.4462062  |
+------+--------------------------------------------------------------------------------------+------------+

Weights

Our vector was created by a concatenation of the title and the description. PostgreSQL also allows us to add weights on parts of the document. For example we could put more weight on the title using setweight.

We recreate the index column with weights:

1
2
3
4
ALTER TABLE televisions
    ADD COLUMN textsearchable_index_col tsvector GENERATED ALWAYS AS (setweight(to_tsvector('english', title), 'A') || setweight(to_tsvector('english', description), 'B')) STORED;

CREATE INDEX textsearch_idx ON televisions USING GIN (textsearchable_index_col);

And we can run the following query which looks for SMART or NETFLIX in the title or description:

1
2
3
4
5
6
7
SELECT
       id,
       title,
       ts_rank(textsearchable_index_col, query) AS rank
FROM televisions, websearch_to_tsquery('smart OR netflix') as query
WHERE textsearchable_index_col @@ query
ORDER BY rank DESC;

and results in the following ordering:

1
2
3
4
5
6
7
8
9
10
11
+------+--------------------------------------------------------------------------------------+------------+
| id   | title                                                                                | rank       |
|------+--------------------------------------------------------------------------------------+------------|
| 6    | SAMSUNG UE50TU7100KXXU 50" Smart 4K Ultra HD HDR LED TV                              | 0.42554897 |
| 2    | HISENSE 55A7300FTUK 55" Smart 4K Ultra HD HDR LED TV with Amazon Alexa               | 0.33435988 |
| 3    | SAMSUNG UE55TU7100KXXU 55" Smart 4K Ultra HD HDR LED TV                              | 0.33435988 |
| 1    | HISENSE 50A7300FTUK 50" Smart 4K Ultra HD HDR LED TV with Amazon Alexa               | 0.33435988 |
| 5    | LG 50UN74006LB 50" Smart 4K Ultra HD HDR LED TV with Google Assistant & Amazon Alexa | 0.33435988 |
| 8    | SONY BRAVIA KD-55AG9BU 55" Smart 4K Ultra HD HDR OLED TV with Google Assistant       | 0.33435988 |
| 4    | LG 55UN74006LB 55" Smart 4K Ultra HD HDR LED TV with Google Assistant & Amazon Alexa | 0.33435988 |
+------+--------------------------------------------------------------------------------------+------------+

And that concludes today’s post!

Conclusion

Today we looked at how we could use the PostgreSQL Full Text Search functionalities. We started by looking at the basics of matching, then we looked at indexing in order to improve performance, we then moved on to look at how to rank by relevance documents and we completed the post by looking at how we could apply weights on different columns. I hope you liked this post and I see you on the next one.

External Sources

Designed, built and maintained by Kimserey Lam.