The MPI standard was carefully written to be a thread-safe specification. That means that the design of MPI doesn't include concepts such as "last message" or "current pack buffer" that are not well defined when multiple threads are present. MPI implementations can choose whether to provide thread-safe implementations. Allowing this choice is particularly important because thread safety usually comes at the price of performance due to the extra overhead required to ensure that internal data structures are not modified inconsistently by two different threads. Most early MPI implementations were not thread safe.
MPI-2 introduced four levels of thread safety that an MPI implementation could provide. The lowest level, MPI_THREAD_SINGLE, allows only single threaded programs. The next level, MPI_THREAD_FUNNELED, allows multiple threads provided that all MPI calls are made in a single thread; most MPI implementations provide MPI_THREAD_FUNNELED. The next level, MPI_THREAD_SERIALIZED, allows many user threads to make MPI calls, but only one thread at a time. The highest level of support, MPI_THREAD_MULTIPLE, allows any thread to call any MPI routine. The level of thread support can be requested by using the routine MPI_Init_thread; this routine returns the level of thread support that is available.
Understanding the level of thread support is important when combining MPI with approaches to thread-based parallelism. OpenMP  is a popular and powerful language for specifying thread-based parallelism. While OpenMP provides some tools for general threaded parallelism, one of the most common uses is to parallelize a loop. If the loop contains no MPI calls, then OpenMP may be combined with MPI. For example, in the Jacobi example, OpenMP can be used to parallelize the loop computation:
exchange_nbrs( u_local, i_start, i_end, left, right ); #pragma omp for for (i_local=1; i<=i_end-i_start+1; i++) for (j=1; j<=NY; j++) ulocal_new[i_local][j] = 0.25 * (ulocal[i_local+1][j] + ulocal[i_local-1][j] + ulocal[i_local][j+1] + ulocal[i_local][j-1] - h*h*flocal[i_local][j]);
This exploits the fact that MPI was designed to work well with other tools, leveraging improvements in compilers and threaded parallelism.