11.8. SpotlightAs the capacities of commonly available storage devices continue to grow, we find it possible to store staggering amounts of information on personal computer systems. Besides, new information is continually being generated. Unfortunately, such information is merely "bytes" unless there are powerful and efficient ways to present it to humans. In particular, one must be able to search such information. By the arrival of the twenty-first century, searching had established itself as one of the most pervasive computing technologies in the context of the Internet. In comparison, typical search mechanisms in operating systems remained primitive. Although a single computer system is nowhere near the Internet in terms of the amount of information it contains, it is still a daunting task for users to search for information "manually." There are several reasons why it is difficult.
Mac OS X 10.4 introduced Spotlighta system for extracting (or harvesting), storing, indexing, and querying metadata. It provides an integrated system-wide service for searching and indexing. 11.8.1. Spotlight's ArchitectureSpotlight is a collection of both kernel- and user-level mechanisms. It can be divided into the following primary constituents:
Figure 1121 shows how various parts of Spotlight interact with each other. Figure 1121. Architecture of the Spotlight systemThe fsevents mechanism is an in-kernel notification system with a subscription interface for informing user-space subscribers of file system changes as they occur. Spotlight relies on this mechanism to keep its information currentit updates a volume's metadata store and content index if file system objects are added, deleted, or modified. We will discuss fsevents in Section 11.8.2. On a volume with Spotlight indexing enabled, the /.Spotlight-V100 directory contains the volume's content index (ContentIndex.db), metadata store (store.db), and other related files. The content index is built atop Apple's Search Kit technology, which provides a framework for searching and indexing text in multiple languages. The metadata store uses a specially designed database in which each file, along with its metadata attributes, is represented as an MDItem object, which is a Core Foundationcompliant object that encapsulates the metadata. The MDItemCreate() function from the Metadata framework can be used to instantiate an MDItem object corresponding to a given pathname. Thereafter, one or more attributes can be retrieved or set[13] in the MDItem by calling other Metadata framework functions. Figure 1122 shows a program that retrieves or sets an individual attribute of the MDItem associated with a given pathname.
Figure 1122. Retrieving and setting an MDItem attribute
From a programming standpoint, an MDItem is a dictionary containing a unique abstract key, along with a value, for each metadata attribute associated with the file system object. Mac OS X provides a large number of predefined keys encompassing several types of metadata. As we will see in Section 11.8.3, we can enumerate all keys known to Spotlight by using the mdimport command-line program. The Spotlight serverthat is, the metadata server (mds)is the primary daemon in the Spotlight subsystem. Its duties include receiving change notifications through the fsevents interface, managing the metadata store, and serving Spotlight queries. Spotlight uses a set of specialized plug-in bundles called metadata importers for extracting metadata from different types of documents, with each importer handling one or more specific document types. The mdimport program acts as a harness for running these importers. It can also be used to explicitly import metadata from a set of files. The Spotlight server also uses mdimportspecifically, a symbolic link to it (mdimportserver)for this purpose. An importer returns metadata for a file as a set of key-value pairs, which Spotlight adds to the volume's metadata store.
Custom metadata importers must be careful in defining what constitutes metadata. Although an importer can technically store any type of information in the metadata store by simply providing it to Spotlight, storing information that is unlikely to be useful in searching (for example, thumbnails or arbitrary binary data) will be counterproductive. The Search Kit may be a better alternative for application-specific indexing. Mac OS X applications such as Address Book, Help Viewer, System Preferences, and Xcode use the Search Kit for efficient searching of application-specific information.
11.8.2. The Fsevents MechanismThe fsevents mechanism provides the basis for Spotlight's live updating. The kernel exports the mechanism to user space through a pseudo-device (/dev/fsevents). A program interested in learning about file system changesa watcher in fsevents parlancecan subscribe to the mechanism by accessing this device. Specifically, a watcher opens /dev/fsevents and clones the resultant descriptor using a special ioctl operation (FSEVENTS_CLONE).
The Spotlight server is the primary subscriber of the fsevents mechanism. The ioctl call requires a pointer to an fsevent_clone_args structure as argument. The event_list field of this structure points to an array containing up to FSE_MAX_EVENTS elements, each of which is an int8_t value indicating the watcher's interest in the event with the corresponding index. If the value is FSE_REPORT, it means the kernel should report that event type to the watcher. If the value is FSE_IGNORE, the watcher is not interested in that event type. Table 112 lists the various event types. If the array has fewer elements than the maximum number of event types, the watcher is implicitly disinterested in the remaining types. The event_queue_depth field of the fsevent_clone_args structure specifies the size of the per-watcher event queue (expressed as the number of events) that the kernel should allocate. This size is limited by MAX_KFS_EVENTS (2048).
As we will shortly see, the elements of a per-watcher event queue are not the events themselves but pointers to kfs_event structures, which are reference-counted structures that contain the actual event data. In other words, all watchers share a single event buffer in the kernel. int ret, fd, clonefd; int8_t event_list[] = { /* FSE_REPORT or FSE_IGNORE for each event type */ } struct fsevent_clone_args fca; ... fd = open("/dev/fsevents", O_RDONLY); ... fca.event_list = event_list; fca.num_events = sizeof(event_list)/sizeof(int8_t); fca.event_queue_depth = /* desired size of event queue in the kernel */ fca.fd = &clonefd; ret = ioctl(fd, FSEVENTS_CLONE, (char *)&fca); ... Once the FSEVENTS_CLONE ioctl returns successfully, the program can close the original descriptor and read from the cloned descriptor. Note that if a watcher is interested in knowing about file system changes only on one or more specific devices, it can specify its devices of interest by using the FSEVENTS_DEVICE_FILTER ioctl on the cloned /dev/fsevents descriptor. By default, fsevents assumes that a watcher is interested in all devices. A read call on the cloned descriptor will block until the kernel has file system changes to report. When such a read call returns successfully, the data read would contain one or more events, each encapsulated in a kfs_event structure. The latter contains an array of event arguments, each of which is a structure of type kfs_event_arg_t, containing variable-size argument data. Table 113 shows the various possible argument types. The argument array is always terminated by the special argument type FSE_ARG_DONE.
typedef struct kfs_event_arg { u_int16_t type; // argument type u_int16_t len; // size of argument data that follows this field ... // argument data } kfs_event_arg_t; typedef struct kfs_event { int32_t type; // event type pid_t pid; // pid of the process that performed the operation kfs_event_arg_t args[KFS_NUM_ARGS]; // event arguments } kfs_event; Figure 1123 shows an overview of the fsevents mechanism's implementation in the kernel. There is an fs_event_watcher structure for each subscribed watcher. The event_list field of this structure points to an array of event types. The array contains values that the watcher specified while cloning the device. The devices_to_watch field, if non-NULL, points to a list of devices the watcher is interested in. Immediately following the fs_event_watcher structure is the watcher's event queuethat is, an array of pointers to kfs_event structures, with the latter residing in the global shared event buffer (fs_event_buf). The fs_event_watcher structure's rd and wr fields act as read and write cursors, respectively. While adding an event to the watcher, if it is found that the write cursor has wrapped around and caught up with the read cursor, it means the watcher has dropped one or more events. The kernel reports dropping of events as a special event of type FSE_EVENTS_DROPPED, which has no arguments (except FSE_ARG_DONE) and contains a fake process ID of zero. Figure 1123. An overview of the fsevents mechanism's implementationEvents can also be dropped because the fs_event_buf global shared buffer is full, which can happen because of a slow watcher. This is a more serious condition from Spotlight's standpoint. In this case, the kernel must discard an existing event to make space for the new event being added, which means at least one watcher will not see the discarded event. To simplify implementation, the kernel delivers an FSE_EVENTS_DROPPED event to all watchers.
Figure 1124 shows how events are added to the global shared event buffer. Various functions in the VFS layer call add_fsevent() [bsd/vfs/vfs_fsevents.c] to generate events based on the return value from need_fsevent(type, vp) [bsd/vfs/vfs_fsevents.c], which takes an event type and a vnode and determines whether the event needs to be generated. need_fsevent() first checks the fs_event_type_watchers global array (see Figure 1123), each of whose elements maintains a count of the number of watchers interested in that event type. If fs_event_type_watchers[type] is zero, it means that an event whose type is type need not be generated, since there are no watchers interested. Fsevents uses this array as a quick check mechanism to bail out early. Next, need_fsevent() checks each watcher to see if at least one watcher wants the event type to be reported and is interested in the device the vnode belongs to. If there is no such watcher, the event need not be generated. Figure 1124. Event generation in the fsevents mechanismadd_fsevent() expands certain kernel-internal event arguments into multiple user-visible arguments. For example, both FSE_ARG_VNODE and the kernel-only argument FSE_ARG_FINFO cause FSE_ARG_DEV, FSE_ARG_INO, FSE_ARG_MODE, FSE_ARG_UID, and FSE_ARG_GUID to be appended to the event's argument list. We will now write a programlet us call it fsloggerthat subscribes to the fsevents mechanism and displays the change notifications as they arrive from the kernel. The program will process the argument list of each event, enhance it in certain cases (e.g., by determining human-friendly names corresponding to process, user, and group identifiers), and display the result. Figure 1125 shows the source for fslogger. Figure 1125. A file system change logger based on the fsevents mechanism
Since fslogger.c includes bsd/sys/fsevents.h, a kernel-only header file, you need the kernel source to compile fslogger. $ gcc -Wall -I /path/to/xnu/bsd/ -o fslogger fslogger.c $ sudo ./fslogger fsevents device cloned (fd 5) fslogger ready ... # another shell $ touch /tmp/file.txt => received 76 bytes # Event type = FSE_CREATE_FILE pid = 5838 (touch) # Details # type len data FSE_ARG_VNODE 22 path = /private/tmp/file.txt FSE_ARG_DEV 4 fsid = 0xe000005 FSE_ARG_INO 4 ino = 3431141 FSE_ARG_MODE 4 mode = -rw-r--r-- (0x0081a4, vnode type VREG) FSE_ARG_UID 4 uid = 501 (amit) FSE_ARG_GID 4 gid = 0 (wheel) FSE_ARG_DONE (0xb33f) $ chmod 600 /tmp/file.txt => received 76 bytes # Event type = FSE_STAT_CHANGED pid = 5840 (chmod) # Details # type len data FSE_ARG_VNODE 22 path = /private/tmp/file.txt FSE_ARG_DEV 4 fsid = 0xe000005 FSE_ARG_INO 4 ino = 3431141 FSE_ARG_MODE 4 mode = -rw------- (0x008180, vnode type VREG) FSE_ARG_UID 4 uid = 501 (amit) FSE_ARG_GID 4 gid = 0 (wheel) FSE_ARG_DONE (0xb33f) ... 11.8.3. Importing MetadataSpotlight metadata includes both conventional file system metadata and other metadata that resides within files. The latter must be explicitly extracted (or harvested) from files. The extraction process must deal with different file formats and must choose what to use as metadata. For example, a metadata extractor for text files may first have to deal with multiple text encodings. Next, it may construct a list of textual keywordsperhaps even a full content indexbased on the file's content. Given that there are simply too many file formats, Spotlight uses a suite of metadata importers for metadata extraction, distributing work among individual plug-ins, each of which handles one or more specific types of documents. Mac OS X includes importer plug-ins for several common document types. The mdimport command-line program can be used to display the list of installed Spotlight importers. $ mdimport -L ... "/System/Library/Spotlight/Image.mdimporter", "/System/Library/Spotlight/Audio.mdimporter", "/System/Library/Spotlight/Font.mdimporter", "/System/Library/Spotlight/PS.mdimporter", ... "/System/Library/Spotlight/Chat.mdimporter", "/System/Library/Spotlight/SystemPrefs.mdimporter", "/System/Library/Spotlight/iCal.mdimporter" )
In a given Mac OS X file system domain, Spotlight plug-ins reside in the Library/Spotlight/ directory. An application bundle can also contain importer plug-ins for the application's document types. An importer plug-in claims document types it wishes to handle by specifying their content types in its bundle's Info.plist file. $ cat /System/Library/Spotlight/Image.mdimporter/Contents/Info.plist ... <key>LSItemContentTypes</key> <array> <string>public.jpeg</string> <string>public.tiff</string> <string>public.png</string> ... <string>com.adobe.raw-image</string> <string>com.adobe.photoshop-image</string> </array> ... You can also use the lsregister support tool from the Launch Services framework to dump the contents of the global Launch Services database and therefore view the document types claimed by a metadata importer. Mac OS X provides a simple interface for implementing metadata importer plug-ins. An importer plug-in bundle must implement the GetMetaDataForFile() function, which should read the given file, extract metadata from it, and populate the provided dictionary with the appropriate attribute key-value pairs.
If multiple importer plug-ins claim a document type, Spotlight will choose the one that matches a given document's UTI most closely. In any case, Spotlight will run only one metadata importer for a given file. Boolean GetMetaDataForFile( void *thisInterface, // the CFPlugin object that is called CFMutableDictionaryRef attributes, // to be populated with metadata CFStringRef contentTypeUTI,// the file's content type CFStringRef pathToFile); // the full path to the file It is possible for an importer to be called to harvest metadata from a large number of filessay, if a volume's metadata store is being regenerated or being created for the first time. Therefore, importers should use minimal computing resources. It is also a good idea for an importer to perform file I/O that bypasses the buffer cache; this way, the buffer cache will not be polluted because of the one-time reads generated by the importer.
Unbuffered I/O can be enabled on a per-file level using the F_NOCACHE file control operation with the fcntl() system call. The Carbon File Manager API provides the noCacheMask constant to request that the data in a given read or write request not be cached. Once the metadata store is populated for a volume, file system changes will typically be incorporated practically immediately by Spotlight, courtesy of the fsevents mechanism. However, it is possible for Spotlight to miss change notifications. The metadata store can become out of date in other situations as wellfor example, if the volume is written by an older version of Mac OS X or by another operating system. In such cases, Spotlight will need to run the indexing process to bring the store up to date. Note that Spotlight does not serve queries while the indexing process is running, although the volume can be written normally during this time, and the resultant file system changes will be captured by the indexing process as it runs.
The Spotlight server does not index temporary files residing in the /tmp directory. It also does not index any directory whose name contains the .noindex or .build suffixesXcode uses the latter type for storing files (other than targets) generated during a project build. 11.8.4. Querying SpotlightSpotlight provides several ways for end users and programmers to query files and folders based on several types of metadata: importer-harvested metadata, conventional file system metadata, and file content (in the case of files whose content has been indexed by Spotlight). The Mac OS X user interface integrates Spotlight querying in the menu bar and the Finder. For example, a Spotlight search can be initiated by clicking on the Spotlight icon in the menu bar and typing a search string. Clicking on Show All in the list of search resultsif anybrings up the dedicated Spotlight search window. Programs can also launch the search window to display results of searching for a given string. Figure 1126 shows an example. Figure 1126. Programmatically launching the Spotlight search window
The MDQuery API is the primary interface for programmatically querying the Spotlight metadata store. It is a low-level procedural interface based on the MDQuery object, which is a Core Foundationcompliant object.
A single query expression is of the following form: metadata_attribute_name operator "value"[modifier] metadata_attribute_name is the name of an attribute known to Spotlightit can be a built-in attribute or one defined by a third-party metadata importer. The mdimport command can be used to enumerate all attributes available in the user's context.[16]
$ mdimport -A ... 'kMDItemAuthors' 'Authors' 'Authors of this item' 'kMDItemBitsPerSample' 'Bits per sample' 'Number of bits per sample' 'kMDItemCity' 'City' 'City of the item' ... 'kMDItemCopyright' 'Copyright' 'Copyright information about this item' ... 'kMDItemURL' 'Url' 'Url of this item' 'kMDItemVersion' 'Version' 'Version number of this item' 'kMDItemVideoBitRate' 'Video bit rate' 'Bit rate of the video in the media' ...
Note that the predefined metadata attributes include both generic (such as kMDItemVersion) and format-specific (such as kMDItemVideoBitRate) attributes. operator can be one of the standard comparison operators, namely, ==, !=, <, >, <=, and >=. value is the attribute's value, with any single- or double-quote characters escaped using the backslash character. An asterisk in a value string is treated as a wildcard character. value can be optionally followed by a modifier consisting of one or more of the following characters.
Multiple query expressions can be combined using the && and || logical operators. Moreover, parentheses can be used for grouping. Figure 1127 shows an overview of a representative use of the MDQuery API. Note that a query normally runs in two phases. The initial phase is a results-gathering phase, wherein the metadata store is searched for files that match the given query. During this phase, progress notifications are sent to the caller depending on the values of the query's batching parameters, which can be configured using MDQueryBatchingParams(). Once the initial phase has finished, another notification is sent to the caller. Thereafter, the query continues to run if it has been configured for live updates, in which case the caller will be notified if the query's results change because of files being created, deleted, or modified. Figure 1127. Pseudocode for creating and running a Spotlight query using the MDQuery interface
Let us write a program that uses the MDQuery API to execute a raw query and displays the results. The Finder's Smart Folders feature works by saving the corresponding search specification as a raw query in an XML file with a .savedSearch extension. When such a file is opened, the Finder displays the results of the query within. We will include support in our program for listing the contents of a smart folderthat is, we will parse the XML file to retrieve the raw query. Figure 1128 shows the programit is based on the template from Figure 1127. Figure 1128. A program for executing raw Spotlight queries
Technically, the program in Figure 1128 does not necessarily list the contents of a smart folderit only executes the raw query corresponding to the smart folder. The folder's contents will be different from the query's result if the XML file contains additional search criteriasay, for limiting search results to the user's home directory. We can extend the program to apply such criteria if it exists. 11.8.5. Spotlight Command-Line ToolsMac OS X provides a set of command-line programs for accessing Spotlight's functionality. Let us look at a summary of these tools. mdutil is used to manage the Spotlight metadata store for a given volume. In particular, it can enable or disable Spotlight indexing on a volume, including volumes corresponding to disk images and external disks. mdimport can be used to explicitly trigger importing of file hierarchies into the metadata store. It is also useful for displaying information about the Spotlight system.
mdcheckschema is used to validate the given schema filetypically one belonging to a metadata importer. mdfind searches the metadata store given a query string, which can be either a plain string or a raw query expression. Moreover, mdfind can be instructed through its -onlyin option to limit the search to a given directory. If the -live option is specified, mdfind continues running in live-update mode, printing the updated number of files that match the query. mdls retrieves and displays all metadata attributes for the given file. 11.8.6. Overcoming Granularity LimitationsAn important aspect of Spotlight is that it works at the file levelthat is, the results of Spotlight queries are files, not locations or records within files. For example, even if a database has a Spotlight importer that can extract per-record information from the database's on-disk files, all queries that refer to records in a given file will result in a reference to that file. This is problematic for applications that do not store their searchable information as individual files. The Safari web browser, the Address Book application, and the iCal application are good examples.
Nevertheless, Safari bookmarks, Address Book contacts, and iCal events appear in Spotlight search results as clickable entities. This is made possible by storing individual files for each of these entities and indexing these files instead of the monolithic index or data files. $ ls ~/Library/Caches/Metadata/Safari/ ... A182FB56-AE27-11D9-A9B1-000D932C9040.webbookmark A182FC00-AE27-11D9-A9B1-000D932C9040.webbookmark ... $ ls ~/Library/Caches/com.apple.AddressBook/MetaData/ ... 6F67C0E4-F19B-4D81-82F2-F527F45D6C74:ABPerson.abcdp 80C4CD5C-F9AE-4667-85D2-999461B8E0B4:ABPerson.abcdp ... $ ls ~/Library/Caches/Metadata/iCal/<UUID>/ ... 49C9A25D-52A3-46A7-BAAC-C33D8DC56C36%2F-.icalevent 940DE117-47DB-495C-84C6-47AF2D68664F%2F-.icalevent ... The corresponding UTIs for the .webbookmark, .abcdp, and .icalevent files are com.apple.safari.bookmark, com.apple.addressbook.person, and com.apple.ical.bookmark, respectively. The UTIs are claimed by the respective applications. Therefore, when such a file appears in Spotlight results, clicking on the result item launches the appropriate application. Note, however, that unlike normal search results, we do not see the filename of an Address Book contact in the Spotlight result list. The same holds for Safari bookmarks and iCal events. This is because the files in question have a special metadata attribute named kMDItemDisplayName, which is set by the metadata importers to user-friendly values such as contact names and bookmark titles. You can see the filenames if you search for these entities using the mdfind command-line program.
|