THE USER S INFERENCE MODEL | Multidimensional Databases: Problems and Solutions

THE USER'S INFERENCE MODEL

Suppose that our user is interested in the value of a sensitive query Q, which henceforth will be referred to as the target-query. If he asks Q, the query-answering system of the statistical database will refuse to answer Q. But, as shown in the previous examples, the user could exactly compute or estimate the value of Q from the values of his (non-sensitive) queries that have been previously answered. We now introduce a formal method which allows the user to decide if the target-query is indirectly answered in an exact or approximate way given the set of previously answered queries. In what follows, we limit our considerations to the case that all queries posed by the user are either count queries or sum queries sharing the summary attribute, so that queries are fully specified by their category predicates; moreover, we assume that the values of each descriptive attribute are exhaustive and mutually exclusive. By b = (b′, b′′,…) we denote the multidimensional attribute whose components are the descriptive attributes of the records in the database; we shall refer to b as the category variable. Let B be the domain of b; that is, B is the Cartesian product of the domains of b′, b′′,…. A query is called disjunctive if its category predicate is given by

where B is a nonempty subset of B; such a query will be denoted by Q(B). Note that the number of possible distinct disjunctive queries is 2^|B|−1. Owing to the exhaustive and mutually exclusive nature of values of the descriptive attributes, every nonnull query is equivalent to a disjunctive query (Malvestuto & Moscarini, 1990) since its category predicate is logically equivalent to a formula such as (1). Therefore, without loss of generality, we shall consider disjunctive queries only, and in what follows the adjective "disjunctive" is omitted.

Suppose that, before asking the target-query, the user posed n queries Q₁,…, Q_n which were all answered. Let Q = {Q₁,…, Q_n}, let Q_v = Q(B_v) where B_v is a nonempty subset of B, and let q_v be the value of Q_v, 1 ≤ v ≤ n. The set

will be referred to as the active domain of the category variable (with respect to Q). By the characteristic partition of the active domain of b, we mean its coarsest partition γ = {C₁,…, C_m} such that the sets B_v belong to the set field generated by γ, that is, each B_v can be obtained as the union of one or more classes of γ. Note that m is not greater than the size of the active domain of b. Let V = {1,…, n} and E = {e₁,…, e_m} where

The sets V and E will be respectively viewed as being the vertex set and the edge set of a hypergraph H = (V, E), which will be referred to as the hypergraph associated with Q. Note that, if each edge of H contains less than three vertices, then H can be pictured as an ordinary graph and the query set Q is called graphical.

Example 2 (continued): Consider again the three sum queries Q₁, Q₂, and Q₃, and let Q = {Q₁, Q₂, Q₃}. Then, with our notation we have

B₁ = {Direction,Administration},

B₂ = {Administration,Services},

B₃ = {Direction,Services}.

The active domain of department is {Direction, Administration, Services}, and its characteristic partition is the point partition; that is, γ = {C₁, C₂, C₃} where

C₁ = {Direction}

C₁ = {Administration}

C₁ = {Services}.

Thus, V = {1, 2, 3} and E = {e₁, e₂, e₃} with e₁ = {1, 3}, e₂ = {1, 2}, and e₃ = {2, 3}. The hypergraph associated with Q is the graph shown in Figure 1 where, for each vertex v, the value of Q_v is also reported.

Figure 1: The Graph Associated with Three Queries

Let D be the data type of the values q_v. Thus, if the queries Q_v are count queries, then D is the set Z₀⁺ of nonnegative integers, and if the queries Q_v are sum queries, then D is the domain of the summary attribute (typically, D is the set of real numbers, or the set of nonnegative real numbers, or the set Z₀⁺ of nonnegative integers). The constraint system associated with Q is the equation system

where E(v) is the set of edges of H containing vertex v, with the addition of the m type constraints

Note that, if H = (h_v)_v ∈ V is the incidence matrix of the hypergraph H and q = (q_v)_v ∈ V, then the equation system above can be written in matrix notation as

The constraint system associated with Q can be summarized in the vertex-weighted hypergraph (H, q), which will be referred to as the answer map of Q. We shall show how it allows the user to decide if the target-query Q is indirectly answered given Q, by which we mean that either the value of Q can be exactly computed from Q or the range of the feasible values of Q given Q is bounded. Before giving the decision algorithms for these two cases, we need to introduce some further definitions.

Given a nonempty set F of edges of H, let

By the query corresponding to F, we mean the query Q(C[F]), that is, the query with category predicate

Note that, if F = E(v) for some vertex v of H, then C[F] = B_v and the query corresponding to F is exactly Q_v. Moreover, if F is a singleton (that is, if F contains exactly one edge), then the query corresponding to F is called an elementary query.

The closure of Q is the set of the queries of the type Q(C[F]) such that the sum expression ∑_e ∊ F x(e) is a D-invariant in the answer map (H, q), by which we mean that, for every two solutions x₁ and x₂ of the constraint system associated with Q, one has ∑_e ∊ F x₁(e) = ∼_e ∊ F x₂(e). Thus, if the target-query Q belongs to the closure of Q, then the value of Q can be exactly computed from Q, that is, Q is implicitly answered given Q.

Example 2 (continued): Assume that the domain of salary is. Then, the system of linear constraints associated with Q reads

Since there is exactly one solution with

every sum query on salary whose category predicate contains only values of department from {Direction, Administration, Services} belongs to the closure of Q. It follows that the sensitive query Q4 is implicitly answered given Q.

Consider now the case that the target-query Q does not belong to the closure of Q but is covered by Q in the sense that there exists a nonempty set F of edges of H such that Q = Q(C[F]). Then the user can determine the tightest lower bound, say α₁, and the tightest upper bound, say α₂, on the value of Q by solving the following two problems:

If α₁ ≠ −∞ and α₂ ≠ +∞, then the target-query Q is indirectly answered in an approximate way and the intersection of D with the real interval [α₁, α₂] is called the feasibility range for the value of Q.

Example 3: Let Q = {Q₁, Q₂, Q₃} be a set of additive queries with values from D whose answer map is the vertex-weighted graph shown in Figure 2.

Figure 2

Thus, the equation system in the constraint system associated with Q contains the following three equations

and the parametric expression of the general solution of this equation system is

Consider now the type constraints

We now separately discuss the three cases and D = Z₀⁺.

If , then the parameter ρ ranges from −∞ to +∞. In this case, it is easy to see that the closure of Q only contains the queries corresponding to the edge sets E(1), E(2), and E(3) and, hence, it coincides with Q. Moreover, for any other nonempty edge set, the tightest lower and upper bounds on the value of the corresponding query are −∞ to +∞, respectively.

If then the parameter ρ ranges from 0 to 1/2. Again, the closure of Q coincides with Q. But, for each nonempty edge set F, the feasibility range for the value of the query corresponding to F is bounded. For example, the feasibility range for the value of the query corresponding to F = {e₁, e₃} is the real interval [1/2, 2]. To sum up, every query covered by Q and not in Q is indirectly answered in an approximate way.

If D = Z₀⁺, then ρ = 0 and, hence, the closure of Q coincides with the set of queries covered by Q.

In the rest of this section, we discuss the computational aspects of the problem of testing the membership of the target-query Q in the closure of Q. Let Q = Q(B) where B is a nonempty subset of the domain B of b. The following procedure allows the user to decide if Q is covered by Q:

COVERING TEST

Step 1. Set F := Ø.

Step 2. For k = 1,…, m do

if Ck is a subset of B then set F := F ∪ {ek} and B := B − Ck.

Step 3. If B = Ø then and only then conclude that Q is covered by Q.

Assume now that the target-query Q is covered by Q and that B = C(F) where F is the output of Covering Test. Then Q belongs to the closure of Q if and only if the sum expression Σe ∈ F x(e) is a D-invariant of the answer map (H, q). We now examine the three cases , and D =Z₀⁺.

The Case

To decide if the sum expression Σe ∈ F x(e) is an -invariant of (H, q), it is sufficient to check that the incidence vector f of F in H is a linear combination of the rows h_v of the incidence matrix H of H, that is, that the equation system

is consistent (Malvestuto & Moscarini, 1990). Note that if the sum expression Σe ∈ F x(e) is an -invariant of (H, q) and y is a solution of equation system (3), then the target-query belongs to the closure of Q and the value of Q is given by

It should be noted that if the sum expression Σe ∈ F x(e) is not an -invariant of (H, q), then the tightest lower and upper bounds on the value of the target-query are −∞ to +∞, respectively.

The Case

One can apply linear programming methods to compute the tightest lower bound α₁ and the tightest upper bound α₂ on the value of the sum expression Σe ∈ F x(e), and conclude that it is an -invariant of (H, q) if and only if α₁ = α₂. We now present an alternative method. Let kernel(H) denote the set of edges e of H such that x(e) = 0 for every solution x of the constraint system associated with Q. The set kernel(H) can be determined by solving for each edge e of H the following linear programming problem:

find the maximum value α(e) of the variable x(e) subject to the constraint system associated with Q.

Then, kernel(H) is given by the set of edges e of H for which α(e) = 0. Let K be the incidence matrix of the hypergraph K obtained from H by deleting all the edges in kernel(H), and let g be the incidence vector of the edge set F−kernel(H) in K. Then, the sum expression Σe ∈ F x(e) is an -invariant of (H, q) if and only if the equation system

is consistent (Malvestuto, 1993).

The Case D = Z₀⁺

One can apply integer linear programming methods to compute the tightest lower bound α₁ and the tightest upper bound α₂ on the value of the sum expression Σe ∈ F x(e), and conclude that it is a Z₀⁺-invariant of (H, q) if and only if α₁ = α₂. Note that if β₁ and β₂ are respectively the tightest lower and upper bounds on the value of the sum expression Σe ∈ F x(e) under the assumption D = Z₀⁺, then

It is also worth mentioning that if the incidence matrix of H is "totally unimodular" (Garfinkel & Nemhauser, 1972)—and this is the case if H is a bipartite graph—then relaxing the integrality constraints does not affect the values of the tightest lower and upper bounds on the value of the sum expression Σe ∈ F x(e), that is, α₁ = β₁ and α₂ = β₂.