Section 11.8. Spotlight

11.8. Spotlight

As the capacities of commonly available storage devices continue to grow, we find it possible to store staggering amounts of information on personal computer systems. Besides, new information is continually being generated. Unfortunately, such information is merely "bytes" unless there are powerful and efficient ways to present it to humans. In particular, one must be able to search such information. By the arrival of the twenty-first century, searching had established itself as one of the most pervasive computing technologies in the context of the Internet. In comparison, typical search mechanisms in operating systems remained primitive.

Although a single computer system is nowhere near the Internet in terms of the amount of information it contains, it is still a daunting task for users to search for information "manually." There are several reasons why it is difficult.

Traditional file system organization, although hierarchical in nature, still requires the user to classify and organize information. Furthermore, as existing information is updated and new information is added, the user must incorporate such changes in the file organization. If multiple views into the data are required, they must be painfully constructedsay, through symbolic links or by replicating data. Even then, such views will be static.
There are simply too many files. As more computer users adopt mostly digital lifestyles, wherein music, pictures, and movies reside on their systems along with traditional data, the average number of files on a representative personal computer will continue to grow.
Historically, users have worked with very little file system metadata: primarily, filenames, sizes, and modification times. Even though typical file systems store additional metadata, it is mostly for storage bookkeepingsuch data is neither intuitive nor very useful for everyday searching. Data particularly useful for flexible searching is often the user data within files (such as the text within a text document) or is best provided as additional file-specific metadata (such as an image's dimensions and color model). Traditional file systems also do not allow users to add their own metadata to files.
In situations where several applications access or manipulate the same information, it would be beneficial for both application developers and users to have such applications share information. Although means for sharing data abound in computing, traditional APIs are rather limited in their support for sharing typed information, even on a given platform.

Memex

When Vannevar Bush was the Director of the Office of Scientific Research and Development in the United States, he published an article that described, among several visionary insights and observations, a hypothetical device Bush had conceived many years earlierthe memex.^[12] The memex was a mechanized private file and librarya supplement to human memory. It would store a vast amount of information and allow for rapid searching of its contents. Bush envisioned that a user could store books, letters, records, and any other arbitrary information on the memex, whose storage capacity would be large enough. The information could be textual or graphic.

^[12] "As We May Think," by Vannevar Bush (Atlantic Monthly 176:1, July 1945, pp. 101108).

Mac OS X 10.4 introduced Spotlighta system for extracting (or harvesting), storing, indexing, and querying metadata. It provides an integrated system-wide service for searching and indexing.

11.8.1. Spotlight's Architecture

Spotlight is a collection of both kernel- and user-level mechanisms. It can be divided into the following primary constituents:

The fsevents change notification mechanism
A per-volume metadata store
A per-volume content index
The Spotlight server (mds)
The mdimport and mdsync helper programs (which have symbolic links to them, mdimportserver and mdsyncserver, respectively)
A suite of metadata importer plug-ins
Programming interfaces, with the Metadata framework (a subframework of the Core Services umbrella framework) providing low-level access to Spotlight's functionality
End-user interfaces, including both command-line and graphical interfaces

Figure 1121 shows how various parts of Spotlight interact with each other.

Figure 1121. Architecture of the Spotlight system

The fsevents mechanism is an in-kernel notification system with a subscription interface for informing user-space subscribers of file system changes as they occur. Spotlight relies on this mechanism to keep its information currentit updates a volume's metadata store and content index if file system objects are added, deleted, or modified. We will discuss fsevents in Section 11.8.2.

On a volume with Spotlight indexing enabled, the /.Spotlight-V100 directory contains the volume's content index (ContentIndex.db), metadata store (store.db), and other related files. The content index is built atop Apple's Search Kit technology, which provides a framework for searching and indexing text in multiple languages. The metadata store uses a specially designed database in which each file, along with its metadata attributes, is represented as an MDItem object, which is a Core Foundationcompliant object that encapsulates the metadata. The MDItemCreate() function from the Metadata framework can be used to instantiate an MDItem object corresponding to a given pathname. Thereafter, one or more attributes can be retrieved or set^[13] in the MDItem by calling other Metadata framework functions. Figure 1122 shows a program that retrieves or sets an individual attribute of the MDItem associated with a given pathname.

^[13] In Mac OS X 10.4, the functions for setting MDItem attributes are not part of the public API.

Figure 1122. Retrieving and setting an `MDItem` attribute

// mditem.c #include <getopt.h> #include <CoreServices/CoreServices.h> #define PROGNAME "mditem" #define RELEASE_IF_NOT_NULL(ref) { if (ref) { CFRelease(ref); } } #define EXIT_ON_NULL(ref)        { if (!ref) { goto out; } } void MDItemSetAttribute(MDItemRef item, CFStringRef name, CFTypeRef value); usage(void) {     fprintf(stderr, "Set or get metadata. Usage:\n\n\     %s -g <attribute-name> <filename>                   # get\n\     %s -s <attribute-name>=<attribute-value> <filename> # set\n",     PROGNAME, PROGNAME); } int main(int argc, char **argv) {     int               ch, ret = -1;     MDItemRef         item = NULL;     CFStringRef       filePath = NULL, attrName = NULL;     CFTypeRef         attrValue = NULL;     char             *valuep;     CFStringEncoding  encoding = CFStringGetSystemEncoding();     if (argc != 4) {         usage();         goto out;     }     filePath = CFStringCreateWithCString(kCFAllocatorDefault,                                         argv[argc - 1], encoding);     EXIT_ON_NULL(filePath);     argc--;     item = MDItemCreate(kCFAllocatorDefault, filePath);     EXIT_ON_NULL(item);     while ((ch = getopt(argc, argv, "g:s:")) != -1) {         switch (ch) {         case 'g':             attrName = CFStringCreateWithCString(kCFAllocatorDefault,                                                  optarg, encoding);             EXIT_ON_NULL(attrName);             attrValue = MDItemCopyAttribute(item, attrName);             EXIT_ON_NULL(attrValue);             CFShow(attrValue);             break;         case 's':             if (!(valuep = strchr(argv[optind - 1], '='))) {                 usage();                 goto out;             }             *valuep++ = '\0';             attrName = CFStringCreateWithCString(kCFAllocatorDefault,                                                  optarg, encoding);             EXIT_ON_NULL(attrName);             attrValue = CFStringCreateWithCString(kCFAllocatorDefault,                                                   valuep, encoding);             EXIT_ON_NULL(attrValue);             (void)MDItemSetAttribute(item, attrName, attrValue);             break;         default:             usage();             break;         }     } out:     RELEASE_IF_NOT_NULL(attrName);     RELEASE_IF_NOT_NULL(attrValue);     RELEASE_IF_NOT_NULL(filePath);     RELEASE_IF_NOT_NULL(item);     exit(ret); } $ gcc -Wall -o mditem mditem.c -framework CoreServices $ ./mditem -g kMDItemKind ~/Desktop Folder $ ./mditem -g kMDItemContentType ~/Desktop public.folder

From a programming standpoint, an MDItem is a dictionary containing a unique abstract key, along with a value, for each metadata attribute associated with the file system object. Mac OS X provides a large number of predefined keys encompassing several types of metadata. As we will see in Section 11.8.3, we can enumerate all keys known to Spotlight by using the mdimport command-line program.

The Spotlight serverthat is, the metadata server (mds)is the primary daemon in the Spotlight subsystem. Its duties include receiving change notifications through the fsevents interface, managing the metadata store, and serving Spotlight queries. Spotlight uses a set of specialized plug-in bundles called metadata importers for extracting metadata from different types of documents, with each importer handling one or more specific document types. The mdimport program acts as a harness for running these importers. It can also be used to explicitly import metadata from a set of files. The Spotlight server also uses mdimportspecifically, a symbolic link to it (mdimportserver)for this purpose. An importer returns metadata for a file as a set of key-value pairs, which Spotlight adds to the volume's metadata store.

Custom metadata importers must be careful in defining what constitutes metadata. Although an importer can technically store any type of information in the metadata store by simply providing it to Spotlight, storing information that is unlikely to be useful in searching (for example, thumbnails or arbitrary binary data) will be counterproductive. The Search Kit may be a better alternative for application-specific indexing. Mac OS X applications such as Address Book, Help Viewer, System Preferences, and Xcode use the Search Kit for efficient searching of application-specific information.

Spotlight versus BFS

Spotlight is sometimes compared to the metadata-indexing functionality offered by BFSthe native file system in BeOS.^[14] BFS was a 64-bit journaled file system that provided native support for extended attributes. A file could have an arbitrary number of attributes that were actually stored as files within a special, internal directory associated with the file. Moreover, BFS maintained indexes for standard file system attributes (such as name and size). It also provided interfaces that could be used to create indexes for other attributes. The indexes were stored in a hidden directory that was otherwise a normal directory. The query syntax of the BFS query engine was largely identical to Spotlight's. As with Spotlight, a query could be live in that it could continue to report any changes to the query results.

As we will see in Chapter 12, HFS+ provides native support for extended attributes. In that light, the combination of Spotlight and HFS+ might appear similar to BFS. However, there are several important differences.

Perhaps the most important point to note is that as implemented in Mac OS X 10.4, Spotlight does not use the support for native extended attributes in HFS+. All harvested metadatawhether extracted from on-disk file structures or provided explicitly by the user (corresponding to the Spotlight Comments area in the Finder's information pane for a file or folder)is stored externally. In particular, Spotlight itself does not modify or add any metadata, including extended attributes, to files.

In the case of BFS, the creation of indexes occurred in the file system itself. In contrast, Spotlight builds and maintains indexes entirely in user space, although it depends on the fsevents kernel-level mechanism for timely notification of file system changes.

Purely based on theoretical grounds, the BFS approach appears more optimal. However, Spotlight has the benefit of being independent of the file systemfor example, it works on HFS+, UFS, MS-DOS, and even AFP volumes. Since the metadata store and the content index need not reside on the volume they are for, Spotlight can even be made to work on a read-only volume.

^[14] The comparison is especially interesting since the same engineer played key roles in the design and implementation of both BFS and Spotlight.

11.8.2. The Fsevents Mechanism

The fsevents mechanism provides the basis for Spotlight's live updating. The kernel exports the mechanism to user space through a pseudo-device (/dev/fsevents). A program interested in learning about file system changesa watcher in fsevents parlancecan subscribe to the mechanism by accessing this device. Specifically, a watcher opens /dev/fsevents and clones the resultant descriptor using a special ioctl operation (FSEVENTS_CLONE).

The Spotlight server is the primary subscriber of the fsevents mechanism.

The ioctl call requires a pointer to an fsevent_clone_args structure as argument. The event_list field of this structure points to an array containing up to FSE_MAX_EVENTS elements, each of which is an int8_t value indicating the watcher's interest in the event with the corresponding index. If the value is FSE_REPORT, it means the kernel should report that event type to the watcher. If the value is FSE_IGNORE, the watcher is not interested in that event type. Table 112 lists the various event types. If the array has fewer elements than the maximum number of event types, the watcher is implicitly disinterested in the remaining types. The event_queue_depth field of the fsevent_clone_args structure specifies the size of the per-watcher event queue (expressed as the number of events) that the kernel should allocate. This size is limited by MAX_KFS_EVENTS (2048).

Table 112. Event Types Supported by the Fsevents Mechanism
Event Index	Event Type	Description
0	`FSE_CREATE_FILE`	A file was created.
1	`FSE_DELETE`	A file or a folder was deleted.
2	`FSE_STAT_CHANGED`	A change was made to the `stat` structurefor example, an object's permissions were changed.
3	`FSE_RENAME`	A file or folder was renamed.
4	`FSE_CONTENT_MODIFIED`	A file's content was modifiedspecifically, a file that is being closed was written to while it was open.
5	`FSE_EXCHANGE`	The contents of two files were swapped through the `exchangedata()` system call.
6	`FSE_FINDER_INFO_CHANGED`	A file or folder's Finder information was changedfor example, the Finder label color was changed.
7	`FSE_CREATE_DIR`	A folder was created.
8	`FSE_CHOWN`	A file system object's ownership was changed.

As we will shortly see, the elements of a per-watcher event queue are not the events themselves but pointers to kfs_event structures, which are reference-counted structures that contain the actual event data. In other words, all watchers share a single event buffer in the kernel.

int ret, fd, clonefd; int8_t event_list[] = { /* FSE_REPORT or FSE_IGNORE for each event type */ } struct fsevent_clone_args fca; ... fd = open("/dev/fsevents", O_RDONLY); ... fca.event_list        = event_list; fca.num_events        = sizeof(event_list)/sizeof(int8_t); fca.event_queue_depth = /* desired size of event queue in the kernel */ fca.fd                = &clonefd; ret = ioctl(fd, FSEVENTS_CLONE, (char *)&fca); ...

Once the FSEVENTS_CLONE ioctl returns successfully, the program can close the original descriptor and read from the cloned descriptor. Note that if a watcher is interested in knowing about file system changes only on one or more specific devices, it can specify its devices of interest by using the FSEVENTS_DEVICE_FILTER ioctl on the cloned /dev/fsevents descriptor. By default, fsevents assumes that a watcher is interested in all devices.

A read call on the cloned descriptor will block until the kernel has file system changes to report. When such a read call returns successfully, the data read would contain one or more events, each encapsulated in a kfs_event structure. The latter contains an array of event arguments, each of which is a structure of type kfs_event_arg_t, containing variable-size argument data. Table 113 shows the various possible argument types. The argument array is always terminated by the special argument type FSE_ARG_DONE.

Table 113. Argument Types Contained in Events Reported by the Fsevents Mechanism
Event Type	Description
`FSE_ARG_VNODE`	A vnode pointer
`FSE_ARG_STRING`	A string pointer
`FSE_ARG_PATH`	A full pathname
`FSE_ARG_INT32`	A 32-bit integer
`FSE_ARG_INT64`	A 64-bit integer
`FSE_ARG_RAW`	A void pointer
`FSE_ARG_INO`	An inode number
`FSE_ARG_UID`	A user ID
`FSE_ARG_DEV`	A file system identifier (the first component of an `fsid_t`) or a device identifier (a `dev_t`)
`FSE_ARG_MODE`	A 32-bit number containing a file mode
`FSE_ARG_GID`	A group ID
`FSE_ARG_FINFO`	An argument used internally by the kernel to hold an object's device information, inode number, file mode, user ID, and group IDtranslated to a sequence of individual arguments for user space
`FSE_ARG_DONE`	A special type (with value `0xb33f`) that marks the end of a given event's argument list

typedef struct kfs_event_arg {     u_int16_t       type; // argument type     u_int16_t       len;  // size of argument data that follows this field     ...                   // argument data } kfs_event_arg_t; typedef struct kfs_event {     int32_t         type; // event type     pid_t           pid;  // pid of the process that performed the operation     kfs_event_arg_t args[KFS_NUM_ARGS]; // event arguments } kfs_event;

Figure 1123 shows an overview of the fsevents mechanism's implementation in the kernel. There is an fs_event_watcher structure for each subscribed watcher. The event_list field of this structure points to an array of event types. The array contains values that the watcher specified while cloning the device. The devices_to_watch field, if non-NULL, points to a list of devices the watcher is interested in. Immediately following the fs_event_watcher structure is the watcher's event queuethat is, an array of pointers to kfs_event structures, with the latter residing in the global shared event buffer (fs_event_buf). The fs_event_watcher structure's rd and wr fields act as read and write cursors, respectively. While adding an event to the watcher, if it is found that the write cursor has wrapped around and caught up with the read cursor, it means the watcher has dropped one or more events. The kernel reports dropping of events as a special event of type FSE_EVENTS_DROPPED, which has no arguments (except FSE_ARG_DONE) and contains a fake process ID of zero.

Figure 1123. An overview of the fsevents mechanism's implementation

Events can also be dropped because the fs_event_buf global shared buffer is full, which can happen because of a slow watcher. This is a more serious condition from Spotlight's standpoint. In this case, the kernel must discard an existing event to make space for the new event being added, which means at least one watcher will not see the discarded event. To simplify implementation, the kernel delivers an FSE_EVENTS_DROPPED event to all watchers.

Dropped Events and Spotlight

Since events dropped from the global shared event buffer affect all subscribers, a slow subscriber can adversely affect the primary subscriberthat is, the Spotlight server. If Spotlight misses any events, it may need to scan the entire volume looking for changes that it missed.

A typical scenario in which a subscriber's slowness will manifest itself is one involving heavy file system activity, where the meaning of "heavy" may vary greatly depending on the system and its currently available resources. Unpacking a giant archive or copying a well-populated directory hierarchy is likely to cause heavy-enough file system activity. The kauth-based mechanism developed in Section 11.10.3 may be a better alternative in many cases for monitoring file system activity.

Figure 1124 shows how events are added to the global shared event buffer. Various functions in the VFS layer call add_fsevent() [bsd/vfs/vfs_fsevents.c] to generate events based on the return value from need_fsevent(type, vp) [bsd/vfs/vfs_fsevents.c], which takes an event type and a vnode and determines whether the event needs to be generated. need_fsevent() first checks the fs_event_type_watchers global array (see Figure 1123), each of whose elements maintains a count of the number of watchers interested in that event type. If fs_event_type_watchers[type] is zero, it means that an event whose type is type need not be generated, since there are no watchers interested. Fsevents uses this array as a quick check mechanism to bail out early. Next, need_fsevent() checks each watcher to see if at least one watcher wants the event type to be reported and is interested in the device the vnode belongs to. If there is no such watcher, the event need not be generated.

Figure 1124. Event generation in the fsevents mechanism

add_fsevent() expands certain kernel-internal event arguments into multiple user-visible arguments. For example, both FSE_ARG_VNODE and the kernel-only argument FSE_ARG_FINFO cause FSE_ARG_DEV, FSE_ARG_INO, FSE_ARG_MODE, FSE_ARG_UID, and FSE_ARG_GUID to be appended to the event's argument list.

We will now write a programlet us call it fsloggerthat subscribes to the fsevents mechanism and displays the change notifications as they arrive from the kernel. The program will process the argument list of each event, enhance it in certain cases (e.g., by determining human-friendly names corresponding to process, user, and group identifiers), and display the result. Figure 1125 shows the source for fslogger.

Figure 1125. A file system change logger based on the fsevents mechanism

// fslogger.c #include <stdio.h> #include <string.h> #include <fcntl.h> #include <stdlib.h> #include <unistd.h> #include <sys/ioctl.h> #include <sys/types.h> #include <sys/sysctl.h> #include <sys/fsevents.h> #include <pwd.h> #include <grp.h> #define PROGNAME "fslogger" #define DEV_FSEVENTS     "/dev/fsevents" // the fsevents pseudo-device #define FSEVENT_BUFSIZ   131072          // buffer for reading from the device #define EVENT_QUEUE_SIZE 2048            // limited by MAX_KFS_EVENTS // an event argument typedef struct kfs_event_arg {     u_int16_t  type;         // argument type     u_int16_t  len;          // size of argument data that follows this field     union {         struct vnode *vp;         char         *str;         void         *ptr;         int32_t       int32;         dev_t         dev;         ino_t         ino;         int32_t       mode;         uid_t         uid;         gid_t         gid;     } data; } kfs_event_arg_t; #define KFS_NUM_ARGS  FSE_MAX_ARGS // an event typedef struct kfs_event {     int32_t         type; // event type     pid_t           pid;  // pid of the process that performed the operation     kfs_event_arg_t args[KFS_NUM_ARGS]; // event arguments } kfs_event; // event names static const char *kfseNames[] = {     "FSE_CREATE_FILE",     "FSE_DELETE",     "FSE_STAT_CHANGED",     "FSE_RENAME",     "FSE_CONTENT_MODIFIED",     "FSE_EXCHANGE",     "FSE_FINDER_INFO_CHANGED",     "FSE_CREATE_DIR",     "FSE_CHOWN", }; // argument names static const char *kfseArgNames[] = {     "FSE_ARG_UNKNOWN", "FSE_ARG_VNODE", "FSE_ARG_STRING", "FSE_ARGPATH",     "FSE_ARG_INT32",   "FSE_ARG_INT64", "FSE_ARG_RAW",    "FSE_ARG_INO",     "FSE_ARG_UID",     "FSE_ARG_DEV",   "FSE_ARG_MODE",   "FSE_ARG_GID",     "FSE_ARG_FINFO", }; // for pretty-printing of vnode types enum vtype {     VNON, VREG, VDIR, VBLK, VCHR, VLNK, VSOCK, VFIFO, VBAD, VSTR, VCPLX }; enum vtype iftovt_tab[] = {     VNON, VFIFO, VCHR, VNON, VDIR,  VNON, VBLK, VNON,     VREG, VNON,  VLNK, VNON, VSOCK, VNON, VNON, VBAD, }; static const char *vtypeNames[] = {     "VNON",  "VREG",  "VDIR", "VBLK", "VCHR", "VLNK",     "VSOCK", "VFIFO", "VBAD", "VSTR", "VCPLX", }; #define VTYPE_MAX (sizeof(vtypeNames)/sizeof(char *)) static char * get_proc_name(pid_t pid) {     size_t        len = sizeof(struct kinfo_proc);     static int    name[] = { CTL_KERN, KERN_PROC, KERN_PROC_PID, 0 };     static struct kinfo_proc kp;     name[3] = pid;     kp.kp_proc.p_comm[0] = '\0';     if (sysctl((int *)name, sizeof(name)/sizeof(*name), &kp, &len, NULL, 0))         return "?";     if (kp.kp_proc.p_comm[0] == '\0')         return "exited?";     return kp.kp_proc.p_comm; } int main(int argc, char **argv) {     int32_t arg_id;     int     fd, clonefd = -1;     int     i, j, eoff, off, ret;     kfs_event_arg_t *kea;     struct           fsevent_clone_args fca;     char             buffer[FSEVENT_BUFSIZ];     struct passwd   *p;     struct group    *g;     mode_t           va_mode;     u_int32_t        va_type;     u_int32_t        is_fse_arg_vnode = 0;     char             fileModeString[11 + 1];     int8_t           event_list[] = { // action to take for each event                          FSE_REPORT,  // FSE_CREATE_FILE                          FSE_REPORT,  // FSE_DELETE                          FSE_REPORT,  // FSE_STAT_CHANGED                          FSE_REPORT,  // FSE_RENAME                          FSE_REPORT,  // FSE_CONTENT_MODIFIED                          FSE_REPORT,  // FSE_EXCHANGE                          FSE_REPORT,  // FSE_FINDER_INFO_CHANGED                          FSE_REPORT,  // FSE_CREATE_DIR                          FSE_REPORT,  // FSE_CHOWN                      };     if (argc != 1) {         fprintf(stderr, "%s accepts no arguments. It must be run as root.\n",                 PROGNAME);         exit(1);     }     if (geteuid() != 0) {         fprintf(stderr, "You must be root to run %s. Try again using 'sudo'.\n",                 PROGNAME);         exit(1);     }     setbuf(stdout, NULL);     if ((fd = open(DEV_FSEVENTS, O_RDONLY)) < 0) {         perror("open");         exit(1);     }     fca.event_list = (int8_t *)event_list;     fca.num_events = sizeof(event_list)/sizeof(int8_t);     fca.event_queue_depth = EVENT_QUEUE_SIZE;     fca.fd = &clonefd;     if ((ret = ioctl(fd, FSEVENTS_CLONE, (char *)&fca)) < 0) {         perror("ioctl");         close(fd);         exit(1);     }     close(fd);     printf("fsevents device cloned (fd %d)\nfslogger ready\n", clonefd);     while (1) { // event-processing loop         if ((ret = read(clonefd, buffer, FSEVENT_BUFSIZ)) > 0)             printf("=> received %d bytes\n", ret);         off = 0;         while (off < ret) { // process one or more events received             struct kfs_event *kfse = (struct kfs_event *)((char *)buffer + off);             off += sizeof(int32_t) + sizeof(pid_t); // type + pid             if (kfse->type == FSE_EVENTS_DROPPED) { // special event                 printf("# Event\n");                 printf("  %-14s = %s\n", "type", "EVENTS DROPPED");                 printf("  %-14s = %d\n", "pid", kfse->pid);                 off += sizeof(u_int16_t); // FSE_ARG_DONE: sizeof(type)                 continue;             }             if ((kfse->type < FSE_MAX_EVENTS) && (kfse->type >= -1)) {                 printf("# Event\n");                 printf("  %-14s = %s\n", "type", kfseNames[kfse->type]);             } else { // should never happen                 printf("This may be a program bug (type = %d).\n", kfse->type);                 exit(1);             }             printf("  %-14s = %d (%s)\n", "pid", kfse->pid,                    get_proc_name(kfse->pid));             printf("  # Details\n    # %-14s%4s  %s\n", "type", "len", "data");             kea = kfse->args;             i = 0;             while ((off < ret) && (i <= FSE_MAX_ARGS)) { // process arguments                 i++;                 if (kea->type == FSE_ARG_DONE) { // no more arguments                     printf("    %s (%#x)\n", "FSE_ARG_DONE", kea->type);                     off += sizeof(u_int16_t);                     break;                 }                 eoff = sizeof(kea->type) + sizeof(kea->len) + kea->len;                 off += eoff;                 arg_id = (kea->type > FSE_MAX_ARGS) ? 0 : kea->type;                 printf("    %-16s%4hd  ", kfseArgNames[arg_id], kea->len);                 switch (kea->type) { // handle based on argument type                 case FSE_ARG_VNODE:  // a vnode (string) pointer                     is_fse_arg_vnode = 1;                     printf("%-6s = %s\n", "path", (char *)&(kea->data.vp));                     break;                 case FSE_ARG_STRING: // a string pointer                     printf("%-6s = %s\n", "string", (char *)&(kea->data.str));                     break;                 case FSE_ARG_INT32:                     printf("%-6s = %d\n", "int32", kea->data.int32);                     break;                 case FSE_ARG_RAW: // a void pointer                     printf("%-6s = ", "ptr");                     for (j = 0; j < kea->len; j++)                         printf("%02x ", ((char *)kea->data.ptr)[j]);                     printf("\n");                     break;                 case FSE_ARG_INO: // an inode number                     printf("%-6s = %d\n", "ino", kea->data.ino);                     break;                 case FSE_ARG_UID: // a user ID                     p = getpwuid(kea->data.uid);                     printf("%-6s = %d (%s)\n", "uid", kea->data.uid,                            (p) ? p->pw_name : "?");                     break;                 case FSE_ARG_DEV: // a file system ID or a device number                     if (is_fse_arg_vnode) {                         printf("%-6s = %#08x\n", "fsid", kea->data.dev);                         is_fse_arg_vnode = 0;                     } else {                         printf("%-6s = %#08x (major %u, minor %u)\n",                                "dev", kea->data.dev,                                major(kea->data.dev), minor(kea->data.dev));                     }                     break;                 case FSE_ARG_MODE: // a combination of file mode and file type                     va_mode = (kea->data.mode & 0x0000ffff);                     va_type = (kea->data.mode & 0xfffff000);                     strmode(va_mode, fileModeString);                     va_type = iftovt_tab[(va_type & S_IFMT) >> 12];                     printf("%-6s = %s (%#08x, vnode type %s)\n", "mode",                            fileModeString, kea->data.mode,                            (va_type < VTYPE_MAX) ?  vtypeNames[va_type] : "?");                     break;                 case FSE_ARG_GID: // a group ID                     g = getgrgid(kea->data.gid);                     printf("%-6s = %d (%s)\n", "gid", kea->data.gid,                            (g) ? g->gr_name : "?");                     break;                 default:                     printf("%-6s = ?\n", "unknown");                     break;                 }                 kea = (kfs_event_arg_t *)((char *)kea + eoff); // next             } // for each argument         } // for each event     } // forever     close(clonefd);     exit(0); }

Since fslogger.c includes bsd/sys/fsevents.h, a kernel-only header file, you need the kernel source to compile fslogger.

$ gcc -Wall -I /path/to/xnu/bsd/ -o fslogger fslogger.c $ sudo ./fslogger fsevents device cloned (fd 5) fslogger ready ...         # another shell         $ touch /tmp/file.txt => received 76 bytes # Event   type           = FSE_CREATE_FILE   pid            = 5838 (touch)   # Details     # type           len  data     FSE_ARG_VNODE     22  path   = /private/tmp/file.txt     FSE_ARG_DEV        4  fsid   = 0xe000005     FSE_ARG_INO        4  ino    = 3431141     FSE_ARG_MODE       4  mode   = -rw-r--r--  (0x0081a4, vnode type VREG)     FSE_ARG_UID        4  uid    = 501 (amit)     FSE_ARG_GID        4  gid    = 0 (wheel)     FSE_ARG_DONE (0xb33f)         $ chmod 600 /tmp/file.txt => received 76 bytes # Event   type           = FSE_STAT_CHANGED   pid            = 5840 (chmod)   # Details     # type           len  data     FSE_ARG_VNODE     22  path   = /private/tmp/file.txt     FSE_ARG_DEV        4  fsid   = 0xe000005     FSE_ARG_INO        4  ino    = 3431141     FSE_ARG_MODE       4  mode   = -rw-------  (0x008180, vnode type VREG)     FSE_ARG_UID        4  uid    = 501 (amit)     FSE_ARG_GID        4  gid    = 0 (wheel)     FSE_ARG_DONE (0xb33f) ...

11.8.3. Importing Metadata

Spotlight metadata includes both conventional file system metadata and other metadata that resides within files. The latter must be explicitly extracted (or harvested) from files. The extraction process must deal with different file formats and must choose what to use as metadata. For example, a metadata extractor for text files may first have to deal with multiple text encodings. Next, it may construct a list of textual keywordsperhaps even a full content indexbased on the file's content. Given that there are simply too many file formats, Spotlight uses a suite of metadata importers for metadata extraction, distributing work among individual plug-ins, each of which handles one or more specific types of documents. Mac OS X includes importer plug-ins for several common document types. The mdimport command-line program can be used to display the list of installed Spotlight importers.

$ mdimport -L ...     "/System/Library/Spotlight/Image.mdimporter",     "/System/Library/Spotlight/Audio.mdimporter",     "/System/Library/Spotlight/Font.mdimporter",     "/System/Library/Spotlight/PS.mdimporter", ...     "/System/Library/Spotlight/Chat.mdimporter",     "/System/Library/Spotlight/SystemPrefs.mdimporter",     "/System/Library/Spotlight/iCal.mdimporter" )

In a given Mac OS X file system domain, Spotlight plug-ins reside in the Library/Spotlight/ directory. An application bundle can also contain importer plug-ins for the application's document types.

An importer plug-in claims document types it wishes to handle by specifying their content types in its bundle's Info.plist file.

$ cat /System/Library/Spotlight/Image.mdimporter/Contents/Info.plist ...                         <key>LSItemContentTypes</key>                         <array>                                 <string>public.jpeg</string>                                 <string>public.tiff</string>                                 <string>public.png</string>                                 ...                                 <string>com.adobe.raw-image</string>                                 <string>com.adobe.photoshop-image</string>                         </array> ...

You can also use the lsregister support tool from the Launch Services framework to dump the contents of the global Launch Services database and therefore view the document types claimed by a metadata importer.

Mac OS X provides a simple interface for implementing metadata importer plug-ins. An importer plug-in bundle must implement the GetMetaDataForFile() function, which should read the given file, extract metadata from it, and populate the provided dictionary with the appropriate attribute key-value pairs.

If multiple importer plug-ins claim a document type, Spotlight will choose the one that matches a given document's UTI most closely. In any case, Spotlight will run only one metadata importer for a given file.

Boolean GetMetaDataForFile(     void                   *thisInterface, // the CFPlugin object that is called     CFMutableDictionaryRef  attributes,    // to be populated with metadata     CFStringRef             contentTypeUTI,// the file's content type     CFStringRef             pathToFile);   // the full path to the file

It is possible for an importer to be called to harvest metadata from a large number of filessay, if a volume's metadata store is being regenerated or being created for the first time. Therefore, importers should use minimal computing resources. It is also a good idea for an importer to perform file I/O that bypasses the buffer cache; this way, the buffer cache will not be polluted because of the one-time reads generated by the importer.

Unbuffered I/O can be enabled on a per-file level using the F_NOCACHE file control operation with the fcntl() system call. The Carbon File Manager API provides the noCacheMask constant to request that the data in a given read or write request not be cached.

Once the metadata store is populated for a volume, file system changes will typically be incorporated practically immediately by Spotlight, courtesy of the fsevents mechanism. However, it is possible for Spotlight to miss change notifications. The metadata store can become out of date in other situations as wellfor example, if the volume is written by an older version of Mac OS X or by another operating system. In such cases, Spotlight will need to run the indexing process to bring the store up to date. Note that Spotlight does not serve queries while the indexing process is running, although the volume can be written normally during this time, and the resultant file system changes will be captured by the indexing process as it runs.

The Spotlight server does not index temporary files residing in the /tmp directory. It also does not index any directory whose name contains the .noindex or .build suffixesXcode uses the latter type for storing files (other than targets) generated during a project build.

11.8.4. Querying Spotlight

Spotlight provides several ways for end users and programmers to query files and folders based on several types of metadata: importer-harvested metadata, conventional file system metadata, and file content (in the case of files whose content has been indexed by Spotlight). The Mac OS X user interface integrates Spotlight querying in the menu bar and the Finder. For example, a Spotlight search can be initiated by clicking on the Spotlight icon in the menu bar and typing a search string. Clicking on Show All in the list of search resultsif anybrings up the dedicated Spotlight search window. Programs can also launch the search window to display results of searching for a given string. Figure 1126 shows an example.

Figure 1126. Programmatically launching the Spotlight search window

// spotlightit.c #include <Carbon/Carbon.h> #define PROGNAME "spotlightit" int main(int argc, char **argv) {     OSStatus status;     CFStringRef searchString;     if (argc != 2) {         fprintf(stderr, "usage: %s <search string>\n", PROGNAME);         return 1;     }     searchString = CFStringCreateWithCString(kCFAllocatorDefault, argv[1],                                              kCFStringEncodingUTF8);     status = HISearchWindowShow(searchString, kNilOptions);     CFRelease(searchString);     return (int)status; } $ gcc -Wall -o spotlightit spotlightit.c -framework Carbon $ ./spotlightit "my query string" ...

The MDQuery API is the primary interface for programmatically querying the Spotlight metadata store. It is a low-level procedural interface based on the MDQuery object, which is a Core Foundationcompliant object.

Cocoa and Spotlight

Mac OS X also provides an Objective-C-based API for accessing the Spotlight metadata store. The NSMetadataQuery class, which supports Cocoa bindings, provides methods for creating a query, setting the search scope, setting query attributes, running the query, and retrieving query results. It is a higher-level object-oriented wrapper^[15] around the MDQuery API. The NSMetadataItem class encapsulates a file's associated metadata. Other relevant classes are NSMetadataQueryAttributeValueTuple and NSMetadataQueryResultGroup.

^[15] NSMetadataQuery does not support synchronous queries in Mac OS X 10.4. Moreover, as query results are collected, it provides only minimal feedback through notifications.

A single query expression is of the following form:

metadata_attribute_name operator "value"[modifier]

metadata_attribute_name is the name of an attribute known to Spotlightit can be a built-in attribute or one defined by a third-party metadata importer. The mdimport command can be used to enumerate all attributes available in the user's context.^[16]

^[16] If a metadata importer is installed locally in a user's home directory, any attributes it defines will not be seen by other users.

$ mdimport -A ... 'kMDItemAuthors'       'Authors'         'Authors of this item' 'kMDItemBitsPerSample' 'Bits per sample' 'Number of bits per sample' 'kMDItemCity'          'City'            'City of the item' ... 'kMDItemCopyright'     'Copyright'       'Copyright information about this item' ... 'kMDItemURL'           'Url'             'Url of this item' 'kMDItemVersion'       'Version'         'Version number of this item' 'kMDItemVideoBitRate'  'Video bit rate'  'Bit rate of the video in the media' ...

Note that the predefined metadata attributes include both generic (such as kMDItemVersion) and format-specific (such as kMDItemVideoBitRate) attributes.

operator can be one of the standard comparison operators, namely, ==, !=, <, >, <=, and >=.

value is the attribute's value, with any single- or double-quote characters escaped using the backslash character. An asterisk in a value string is treated as a wildcard character. value can be optionally followed by a modifier consisting of one or more of the following characters.

c specifies case-insensitive comparison.
d specifies that diacritical marks should be ignored in the comparison.
w specifies word-based comparison, with the definition of a "word" including transitions from lowercase to uppercase (e.g., "process" wc will match "GetProcessInfo")

Multiple query expressions can be combined using the && and || logical operators. Moreover, parentheses can be used for grouping.

Figure 1127 shows an overview of a representative use of the MDQuery API. Note that a query normally runs in two phases. The initial phase is a results-gathering phase, wherein the metadata store is searched for files that match the given query. During this phase, progress notifications are sent to the caller depending on the values of the query's batching parameters, which can be configured using MDQueryBatchingParams(). Once the initial phase has finished, another notification is sent to the caller. Thereafter, the query continues to run if it has been configured for live updates, in which case the caller will be notified if the query's results change because of files being created, deleted, or modified.

Figure 1127. Pseudocode for creating and running a Spotlight query using the `MDQuery` interface

void notificationCallback(...) {     if (notificationType == kMDQueryProgressNotification) {         // Query's result list has changed during the initial         // result-gathering phase     } else if (notificationType == kMDQueryDidFinishNotification) {         // Query has finished with the initial result-gathering phase         // Disable updates by calling MDQueryDisableUpdates()         // Process results         // Reenable updates by calling MDQueryEnableUpdates()     } else if (notificationType == kMDQueryDidUpdateNotification) {         // Query's result list has changed during the live-update phase     } } int main(...) {     // Compose query string (a CFStringRef) to represent search expression     // Create MDQueryRef from query string by calling MDQueryCreate()     // Register notification callback with the process-local notification center     // Optionally set batching parameters by calling MDQuerySetBatchingParameters()     // Optionally set the search scope by calling MDQuerySetSearchScope()     // Optionally set callback functions for one or more of the following:     //     * Creating the result objects of the query     //     * Creating the value objects of the query     //     * Sorting the results of the query     // Execute the query and start the run loop }

Let us write a program that uses the MDQuery API to execute a raw query and displays the results. The Finder's Smart Folders feature works by saving the corresponding search specification as a raw query in an XML file with a .savedSearch extension. When such a file is opened, the Finder displays the results of the query within. We will include support in our program for listing the contents of a smart folderthat is, we will parse the XML file to retrieve the raw query.

Figure 1128 shows the programit is based on the template from Figure 1127.

Figure 1128. A program for executing raw Spotlight queries

// lsmdquery.c #include <unistd.h> #include <sys/stat.h> #include <CoreServices/CoreServices.h> #define PROGNAME "lsmdquery" void exit_usage(void) {     fprintf(stderr, "usage: %s -f <smart folder path>\n"                     "       %s -q <query string>\n", PROGNAME, PROGNAME);     exit(1); } void printDictionaryAsXML(CFDictionaryRef dict) {     CFDataRef xml = CFPropertyListCreateXMLData(kCFAllocatorDefault,                                                 (CFPropertyListRef)dict);     if (!xml)         return;     write(STDOUT_FILENO, CFDataGetBytePtr(xml), (size_t)CFDataGetLength(xml));     CFRelease(xml); } void notificationCallback(CFNotificationCenterRef  center,                      void                    *observer,                      CFStringRef              name,                      const void              *object,                      CFDictionaryRef          userInfo) {     CFDictionaryRef attributes;     CFArrayRef      attributeNames;     CFIndex         idx, count;     MDItemRef       itemRef = NULL;     MDQueryRef      queryRef = (MDQueryRef)object;     if (CFStringCompare(name, kMDQueryDidFinishNotification, 0)            == kCFCompareEqualTo) { // gathered results         // disable updates, process results, and reenable updates         MDQueryDisableUpdates(queryRef);         count = MDQueryGetResultCount(queryRef);         if (count > 0) {             for (idx = 0; idx < count; idx++) {                 itemRef = (MDItemRef)MDQueryGetResultAtIndex(queryRef, idx);                 attributeNames = MDItemCopyAttributeNames(itemRef);                 attributes = MDItemCopyAttributes(itemRef, attributeNames);                 printDictionaryAsXML(attributes);                 CFRelease(attributes);                 CFRelease(attributeNames);             }             printf("\n%ld results total\n", count);         }         MDQueryEnableUpdates(queryRef);      } else if (CFStringCompare(name, kMDQueryDidUpdateNotification, 0)                    == kCFCompareEqualTo) { // live update          CFShow(name), CFShow(object), CFShow(userInfo);      }      // ignore kMDQueryProgressNotification } CFStringRef ExtractRawQueryFromSmartFolder(const char *folderpath) {     int                fd, ret;     struct stat        sb;     UInt8             *bufp;     CFMutableDataRef   xmlData  = NULL;     CFPropertyListRef  pList    = NULL;     CFStringRef        rawQuery = NULL, errorString = NULL;     if ((fd = open(folderpath, O_RDONLY)) < 0) {         perror("open");         return NULL;     }     if ((ret = fstat(fd, &sb)) < 0) {         perror("fstat");         goto out;     }     if (sb.st_size <= 0) {         fprintf(stderr, "no data in smart folder (%s)?\n", folderpath);         goto out;     }     xmlData = CFDataCreateMutable(kCFAllocatorDefault, (CFIndex)sb.st_size);     if (xmlData == NULL) {         fprintf(stderr, "CFDataCreateMutable() failed\n");         goto out;     }     CFDataIncreaseLength(xmlData, (CFIndex)sb.st_size);     bufp = CFDataGetMutableBytePtr(xmlData);     if (bufp == NULL) {         fprintf(stderr, "CFDataGetMutableBytePtr() failed\n");         goto out;     }     ret = read(fd, (void *)bufp, (size_t)sb.st_size);     pList = CFPropertyListCreateFromXMLData(kCFAllocatorDefault,                                             xmlData,                                             kCFPropertyListImmutable,                                             &errorString);     if (pList == NULL) {         fprintf(stderr, "CFPropertyListCreateFromXMLData() failed (%s)\n",                 CFStringGetCStringPtr(errorString, kCFStringEncodingASCII));         CFRelease(errorString);         goto out;     }     rawQuery = CFDictionaryGetValue(pList, CFSTR("RawQuery"));     CFRetain(rawQuery);     if (rawQuery == NULL) {         fprintf(stderr, "failed to retrieve query from smart folder\n");         goto out;     } out:     close(fd);     if (pList)         CFRelease(pList);     if (xmlData)         CFRelease(xmlData);     return rawQuery; } int main(int argc, char **argv) {     int                     i;     CFStringRef             rawQuery = NULL;     MDQueryRef              queryRef;     Boolean                 result;     CFNotificationCenterRef localCenter;     MDQueryBatchingParams   batchingParams;     while ((i = getopt(argc, argv, "f:q:")) != -1) {         switch (i) {         case 'f':             rawQuery = ExtractRawQueryFromSmartFolder(optarg);             break;         case 'q':             rawQuery = CFStringCreateWithCString(kCFAllocatorDefault, optarg,                                                  CFStringGetSystemEncoding());             break;         default:             exit_usage();             break;         }     }     if (!rawQuery)         exit_usage();     queryRef = MDQueryCreate(kCFAllocatorDefault, rawQuery, NULL, NULL);     if (queryRef == NULL)         goto out;     if (!(localCenter = CFNotificationCenterGetLocalCenter())) {         fprintf(stderr, "failed to access local notification center\n");         goto out;     }     CFNotificationCenterAddObserver(         localCenter,          // process-local center         NULL,                 // observer         notificationCallback, // to process query finish/update notifications         NULL,                 // observe all notifications         (void *)queryRef,     // observe notifications for this object         CFNotificationSuspensionBehaviorDeliverImmediately);     // maximum number of results that can accumulate and the maximum number     // of milliseconds that can pass before various notifications are sent     batchingParams.first_max_num    = 1000; // first progress notification     batchingParams.first_max_ms     = 1000;     batchingParams.progress_max_num = 1000; // additional progress notifications     batchingParams.progress_max_ms  = 1000;     batchingParams.update_max_num   = 1;    // update notification     batchingParams.update_max_ms    = 1000;     MDQuerySetBatchingParameters(queryRef, batchingParams);     // go execute the query     if ((result = MDQueryExecute(queryRef, kMDQueryWantsUpdates)) == TRUE)         CFRunLoopRun(); out:     CFRelease(rawQuery);     if (queryRef)         CFRelease(queryRef);     exit(0); } $ gcc -Wall -o lsmdquery lsmdquery.c -framework CoreServices $ ./lsmdquery -f ~/Desktop/AllPDFs.savedSearch # assuming this smart folder exists ...         <key>kMDItemFSName</key>         <string>gimpprint.pdf</string>         <key>kMDItemFSNodeCount</key>         <integer>0</integer>         <key>kMDItemFSOwnerGroupID</key>         <integer>501</integer> ...

Technically, the program in Figure 1128 does not necessarily list the contents of a smart folderit only executes the raw query corresponding to the smart folder. The folder's contents will be different from the query's result if the XML file contains additional search criteriasay, for limiting search results to the user's home directory. We can extend the program to apply such criteria if it exists.

11.8.5. Spotlight Command-Line Tools

Mac OS X provides a set of command-line programs for accessing Spotlight's functionality. Let us look at a summary of these tools.

mdutil is used to manage the Spotlight metadata store for a given volume. In particular, it can enable or disable Spotlight indexing on a volume, including volumes corresponding to disk images and external disks.

mdimport can be used to explicitly trigger importing of file hierarchies into the metadata store. It is also useful for displaying information about the Spotlight system.

The -A option lists all metadata attributes, along with their localized names and descriptions, known to Spotlight.
The -X option prints the metadata schema for the built-in UTI types.
The -L option displays a list of installed metadata importers.

mdcheckschema is used to validate the given schema filetypically one belonging to a metadata importer.

mdfind searches the metadata store given a query string, which can be either a plain string or a raw query expression. Moreover, mdfind can be instructed through its -onlyin option to limit the search to a given directory. If the -live option is specified, mdfind continues running in live-update mode, printing the updated number of files that match the query.

mdls retrieves and displays all metadata attributes for the given file.

11.8.6. Overcoming Granularity Limitations

An important aspect of Spotlight is that it works at the file levelthat is, the results of Spotlight queries are files, not locations or records within files. For example, even if a database has a Spotlight importer that can extract per-record information from the database's on-disk files, all queries that refer to records in a given file will result in a reference to that file. This is problematic for applications that do not store their searchable information as individual files. The Safari web browser, the Address Book application, and the iCal application are good examples.

Safari stores its bookmarks in a single property list file (~/Library/Safari/Bookmarks.plist).
Address Book stores its data in a single data file (~/Library/Application Support/AddressBook/AddressBook.data). It also uses two Search Kit index files (ABPerson.skIndexInverted and ABSubscribedPerson.sk-IndexInverted).
iCal maintains a directory for each calendar (~/Library/Application Support/iCal/Sources/<UUID>.calendar). Within each such directory, it maintains an index file for calendar events.

Nevertheless, Safari bookmarks, Address Book contacts, and iCal events appear in Spotlight search results as clickable entities. This is made possible by storing individual files for each of these entities and indexing these files instead of the monolithic index or data files.

$ ls ~/Library/Caches/Metadata/Safari/ ... A182FB56-AE27-11D9-A9B1-000D932C9040.webbookmark A182FC00-AE27-11D9-A9B1-000D932C9040.webbookmark ... $ ls ~/Library/Caches/com.apple.AddressBook/MetaData/ ... 6F67C0E4-F19B-4D81-82F2-F527F45D6C74:ABPerson.abcdp 80C4CD5C-F9AE-4667-85D2-999461B8E0B4:ABPerson.abcdp ... $ ls ~/Library/Caches/Metadata/iCal/<UUID>/ ... 49C9A25D-52A3-46A7-BAAC-C33D8DC56C36%2F-.icalevent 940DE117-47DB-495C-84C6-47AF2D68664F%2F-.icalevent ...

The corresponding UTIs for the .webbookmark, .abcdp, and .icalevent files are com.apple.safari.bookmark, com.apple.addressbook.person, and com.apple.ical.bookmark, respectively. The UTIs are claimed by the respective applications. Therefore, when such a file appears in Spotlight results, clicking on the result item launches the appropriate application.

Note, however, that unlike normal search results, we do not see the filename of an Address Book contact in the Spotlight result list. The same holds for Safari bookmarks and iCal events. This is because the files in question have a special metadata attribute named kMDItemDisplayName, which is set by the metadata importers to user-friendly values such as contact names and bookmark titles. You can see the filenames if you search for these entities using the mdfind command-line program.

[View full width]
$ mdfind 'kMDItemContentType == com.apple.addressbook.person && kMDItemDisplayName ==  "Amit Singh"' /Users/amit/Library/Caches/com.apple.AddressBook/Metadata/<UUID>:ABPerson.abcdp $ mdls /Users/amit/Library/Caches/com.apple.AddressBook/Metadata/<UUID>:ABPerson.abcdp ... kMDItemDisplayName             = "Amit Singh" ... kMDItemKind                    = "Address Book Person Data" ... kMDItemTitle                   = "Amit Singh"