Extending the PostgreSQL Server with Custom Functions

Table of contents:

An extension function is loaded into a running PostgreSQL server process as needed. If you don't actually use an extension, it will not be loaded. Extension functions must be created in the form of a dynamically loadable object module. In the Windows world, an extension is contained within a DLL. In the Linux/Unix environment, an extension is contained within a shared object module.

There are two phases to the process of adding an extension function to the PostgreSQL server. First, you create the extension function in the language of your choice, compiling it into a dynamic object module (.dll or .so). Next, tell the PostgreSQL server about the function. The CREATE FUNCTION command adds a new function to a database.

I'll show you two examples that should help clarify this process.

PostgreSQL and Portability

Some of the steps required to write a PostgreSQL extension function in C may seem rather odd at first. You may feel more comfortable with the process if you understand the problem that the PostgreSQL authors were trying to fix.

When you call a function in a typical C program, you know at the time you write your code how to call that function. You know how many arguments are required and you know the data type of each argument. If you provide an incorrect number of parameters or incorrect data types, it is highly likely that your program will crash. For example, the fopen() function (from the C Runtime Library) requires two parameters:

FILE * fopen( const char * filename, const char * mode )

If you omit the mode parameter or send a numeric data type instead of a pointer, your program will fail in some way.

Now, suppose that your program prompts the user for the name of a dynamic object module and the name of a function within that module. After you load the given module into your program, you have to call the named function. If you know which function the user will select, you can formulate your function call properly at the time you write your code. What happens if the user selects some other function that takes a completely different argument list? How can you formulate the function call if you don't know the parameter list? There is no portable way to do that, and PostgreSQL aims to be extremely portable.

So, the PostgreSQL authors decided to change the way you pass arguments to an extension function. Rather than declaring a separate formal parameter for each value passed to the function, PostgreSQL marshals all the arguments into a separate data structure and passes the address of the marshaled form to your extension. When you need to access function parameters, you get to them through the marshaled form.

This is similar in concept to the way the main() function of a C program behaves. You can't know, at the time you write the main() function, how many command-line parameters you will receive. (You might know how many parameters you should receive, but how many you will receive is not quite the same animal.) The startup routine on the C Runtime Library marshals the command-line arguments into a data structure (the argv[] array) and passes you the address of that structure. To find the actual values specified on the command line, you must use the data structure rather than formal parameters.

Older versions of PostgreSQL used a strategy that became less portable as operating systems advanced into the 64-bit arena. The old strategy is known as the "version-0 calling convention." The new strategy is called the "version-1 calling convention." PostgreSQL still supports both calling conventions, but you should stick to the version-1 convention for better portability.

For more information on the difference between the version-0 and version-1 conventions, see section 12 of the PostgreSQL Programmer's Guide.

There are two important consequences to the version-1 convention. First, all version-1 functions return the same data type: a Datum. A Datum is a sort of universal data type. Any PostgreSQL data type can be accessed through a Datum. PostgreSQL provides a set of macros that make it easy to work with Datums. Second, a version-1 function makes use of a set of macros to access function arguments. Every version-1 function is declared in the same way:

Datum function-name(PG_FUNCTION_ARGS);

As you read through the examples in this chapter, keep in mind that the PostgreSQL authors had to solve the portability problem.

The first example adds a simple function, named filesize, to the PostgreSQL server. Given the name of a file, it returns the size of the file (in bytes). If the file does not exist, cannot be examined, or is not a regular[1] file, this function returns NULL. You might find this function (and the filelist() function shown later) useful for performing system administration tasks from within a PostgreSQL application. After you have created the filesize function, you can call it like this:

[1] In this context, a file is considered "regular" if it is not a directory, named pipe, symbolic link, device file, or socket.

movies=# SELECT filesize( '/bin/bash' );
 filesize
----------
 512668

We'll develop the filesize function in C (see Listing 6.1).

The filesize function takes a single argumenta pathname in the form of a TEXT value. This function returns the size of the named file as an INTEGER value.

Listing 6.1. filesize.c

 1 /*
 2 ** Filename: filesize.c
 3 */
 4
 5 #include "postgres.h"
 6 #include "fmgr.h"
 7 #include 
 8
 9 PG_FUNCTION_INFO_V1(filesize);
10
11 Datum filesize(PG_FUNCTION_ARGS)
12 {
13 text * fileNameText = PG_GETARG_TEXT_P(0);
14 size_t fileNameLen = VARSIZE( fileNameText ) - VARHDRSZ;
15 char * fileName = (char *)palloc( fileNameLen + 1 );
16 struct stat statBuf;
17
18 memcpy( fileName, VARDATA( fileNameText ), fileNameLen );
19 fileName[fileNameLen] = '';
20
21 if( stat(fileName, &statBuf) == 0 && S_ISREG(statBuf.st_mode))
22 {
23 pfree( fileName );
24
25 PG_RETURN_INT32((int32)statBuf.st_size );
26 }
27 else
28 {
29 pfree( fileName );
30
31 PG_RETURN_NULL();
32 }
33 }

Lines 5 and 6 #include two header files supplied by PostgreSQL. These files (postgres.h and fmgr.h) provide data type definitions, function prototypes, and macros that you can use when writing extensions. The file included at line 7 defines the layout of the struct stat object used by the stat() function (described later).

Line 9 uses the PG_FUNCTION_INFO_V1() to tell PostgreSQL that the function (filesize()) uses the version-1 calling convention.

At line 11, you see the signature used for all version-1 functions. The filesize() function returns a Datum and expects a single argument. PG_FUNCTION_ARGS is a preprocessor symbol that expands to declare a consistently named parameter. So, your function definition expands from this:

Datum filesize(PG_FUNCTION_ARGS)

to this:

Datum filesize( FunctionCallInfo fcinfo )

This might seem a little strange at first, but the version-1 argument accessor macros are written so that the single function argument must be named fcinfo.

At line 13, you create a variable of type text. text is one of the data types defined in the postgres.h header file (or in a file included by postgres.h). Whenever you write an extension function, you will be working with two sets of data types. Each function parameter (and the return value) will have a SQL data type and a C data type. For example, when you call the filesize function from within PostgreSQL, you pass a TEXT parameter: TEXT is the SQL data type. When you implement the filesize function in C, you receive a text value: text is the C data type. The name for the C data type is usually similar to the name of the corresponding SQL data type. For clarity, I'll refer to the PostgreSQL data types using uppercase letters and the C data types using lowercase letters.

Notice that a macro is used to retrieve the address of the TEXT value. I mentioned earlier that an extension function must use macros to access parameters, and this is an example of such a macro. The PG_GETARG_TEXT_P(n) macro returns the nth parameter, which must be of type TEXT. The return value of PG_GETARG_TEXT_P(n) is of type text. There are many argument-accessor functions, each corresponding to a specific parameter type: PG_GETARG_INT32(n), PG_GETARG_BOOL(n), PG_GETARG_OID(n), and so on. See the fmgr.h PostgreSQL header file for a complete list.

We'll be using the stat() function (from the C Runtime library) to find the size of a given file. stat() expects to find the pathname in the form of a null-terminated string. PostgreSQL has given you a text value, and text values are not null-terminated. You will need to convert fileNameText into a null-terminated string.

If fileNameText is not null-terminated, how do you know the length of the pathname? Let's take a peek at the definition of the text data type (from the c.h PostgreSQL header file):

struct varlena
{
 int32 vl_len;
 char vl_data[1];
};

typedef struct varlena text;

You can see that a text value is defined by the struct varlena structure. The vl_len member tells you how many bytes are required to hold the entire structure. The characters that make up the text value start at the address of the v1_data[0] member. PostgreSQL supplies two macros that make it easy to work with variable-length data structures. The VARHDRSZ symbol contains the size of the fixed portion of a struct varlena. The VARSIZE() macro returns the size of the entire data structure. The VARDATA() macro returns a pointer to first byte of the TEXT value. The length of the TEXT value is VARSIZE() - VARHDRSZ. You store that length in the fileNameLen variable.

At line 15, you allocate enough space to hold a copy of the null-terminated string. The palloc() function is similar to malloc(): It allocates the requested number of bytes and returns a pointer to the new space. You should use palloc() and pfree() when you write extension functions rather than malloc() and free(). The palloc() and pfree() functions ensure that you can't create a memory leak in an extension function, which is something you can do if you use malloc() instead.

Lines 18 and 19 create a null-terminated copy of the TEXT value, and line 21 passes the null-terminated string to the stat() function. If the stat() function succeeds, it fills in the statBuf structure and returns 0.

If you succeeded in retrieving the file status information and the file is a regular file, free the null-terminated string (using pfree()) and return the file size. Notice that you must use a macro to translate the return value (an int32) into a Datum.

If the stat() function failed (or the file is not a regular file), you free the null-terminated string and return NULL. Again, you use a macro to produce the return value in the form of a Datum.

Now that you have crafted the filesize function, you need to compile it into a shared object module. You usually compile a C source file into a standalone executable program, but PostgreSQL expects to find the filesize function in a shared object module. The procedure for producing a shared object module is different for each compiler; section 31.9 of the PostgreSQL reference documentation describes the process for a number of compilers. Listing 6.2 shows the makefile that I've used to compile the filesize function using Fedora Core (Linux):

Listing 6.2. makefile

# File name: makefile
SERVER_INCLUDES += -I $(shell pg_config --includedir)
SERVER_INCLUDES += -I $(shell pg_config --includedir-server)

CFLAGS += -g $(SERVER_INCLUDES)

.SUFFIXES: .so

.c.so:
 $(CC) $(CFLAGS) -fpic -c $<
 $(CC) $(CFLAGS) -shared -o $@ $(basename $<).o

To compile filesize using this makefile, you would issue the following command:

$ make -f makefile filesize.so

After the compile step is completed, you are left with a file named filesize.so in your current directory. The preferred location for a PostgreSQL extension can be found using the pg_config command:

$ pg_config --pkglibdir
/usr/local/pg800/lib/postgresql

You can copy the filesize.so file to this directory, but I prefer to create a symbolic link pointing back to my development directory instead. After an extension is completely debugged, I delete the symbolic link and copy the final version into the preferred location. To create a symbolic link, use the following command:

$ ln -s `pwd`/filesize.so `pg_config --pkglibdir`

At this point, you have a shared object module, but you still have to tell PostgreSQL about the function that you want to import into the server.

The CREATE FUNCTION command tells PostgreSQL everything it needs to know to call your function:

movies=# CREATE OR REPLACE FUNCTION
movies-# filesize( TEXT ) RETURNS INTEGER AS
movies-# 'filesize.so', 'filesize' LANGUAGE 'C'
movies-# STRICT;
CREATE

This command defines a function named filesize( TEXT ). This function returns an INTEGER value. The function is written in C and can be found in the file filesize.so in the preferred extension directory. You can specify a complete pathname to the shared object module if you want to, but in most cases it's easier to just put it where PostgreSQL expects to find it, as I've done here. You can also omit the filename extension (the .so part), as long as you follow the shared object module-naming rules imposed by your host operating system.

I've defined filesize() as a strict function. The STRICT attribute tells PostgreSQL that this function will always return NULL if any argument is NULL. If PostgreSQL knows that a function is STRICT, it can avoid calling the function with a NULL argument (again, a performance optimization). STRICT makes it easier for you to implement your extension functions; you don't have to check for NULL arguments if you declare your functions to be STRICT.

Now you can call the function from within a PostgreSQL session:

movies=# SELECT filesize( '/bin/bash' );
 filesize
----------
 512668
(1 row)
movies=# SELECT filesize( 'non-existent file' );
 filesize
----------

(1 row)

Debugging PostgreSQL Extensions

One of the challenges you'll face in creating PostgreSQL extensions is figuring out how to debug them. Relax, it's easy. First, remember that the extension function that you create is loaded into the server (not the client). That means that when you fire up a debugger, you want to attach it to the server process. How do you find the server process? Call the pg_backend_pid() function once the server is up and runningpg_backend_pid() returns the process ID of the server that your client is connected to. Next, load the shared-object file (the file that contains your extension function) into the server with the LOAD command. At this point, your server is waiting for youit's time to attach the debugger. If you're using the gdb debugger, you can attach to a running process with the command:

gdb postgres server-process-id

But remember, the PostgreSQL server process is owned by user postgres: If you try to attach without the proper privileges, gdb will just laugh at you. Make sure you su postgres before you run gdb. When you ask gdb to attach to a running process, the second argument is ignoredit has to be there, but it really doesn't matter what you string you use.

If you have the proper privileges, gdb should now be waiting for you to enter a command. Now you can set a breakpoint at your extension function and gdb will interrupt the server when that function is invoked. Notice that the server process is suspended until you tell the debugger to continue (if you try to execute a command from your PostgreSQL client application, the client will hang until the server wakes up again). Once you've told gdb to continue, you can go back to your client application and issue a command that invokes the function that you're interested in.

If you're debugging an extension function on a Windows host, the procedure is similar: Find the process ID of the server and attach a Windows debugger to that process.

To summarize:

Start a client application (such as psql)
From within the client: SELECT pg_backend_pid();
From within the client: LOAD 'extension-file.so';
Start another terminal session and su postgres
Fire up the debugger: $ gdb postgres server-process-id
Set a breakpoint: (gdb) break my-function
Tell gdb to continue: (gdb) cont
Go back to the client application and execute a command that will call your extension function

Part I: General PostgreSQL Use

Introduction to PostgreSQL and SQL

Working with Data in PostgreSQL

PostgreSQL SQL Syntax and Use

Performance

Part II: Programming with PostgreSQL

Introduction to PostgreSQL Programming

Extending PostgreSQL

PL/pgSQL

The PostgreSQL C APIlibpq

A Simpler C APIlibpgeasy

The New PostgreSQL C++ APIlibpqxx

Embedding SQL Commands in C Programsecpg

Using PostgreSQL from an ODBC Client Application

Using PostgreSQL from a Java Client Application

Using PostgreSQL with Perl

Using PostgreSQL with PHP

Using PostgreSQL with Tcl and Tcl/Tk

Using PostgreSQL with Python

Npgsql: The .NET Data Provider

Other Useful Programming Tools

Part III: PostgreSQL Administration

Introduction to PostgreSQL Administration

PostgreSQL Administration

Internationalization and Localization

Security

Replicating PostgreSQL Data with Slony

Contributed Modules

Index