

Suppose that our user is interested in the value of a sensitive query Q, which henceforth will be referred to as the targetquery. If he asks Q, the queryanswering system of the statistical database will refuse to answer Q. But, as shown in the previous examples, the user could exactly compute or estimate the value of Q from the values of his (nonsensitive) queries that have been previously answered. We now introduce a formal method which allows the user to decide if the targetquery is indirectly answered in an exact or approximate way given the set of previously answered queries. In what follows, we limit our considerations to the case that all queries posed by the user are either count queries or sum queries sharing the summary attribute, so that queries are fully specified by their category predicates; moreover, we assume that the values of each descriptive attribute are exhaustive and mutually exclusive. By b = (b′, b′′,…) we denote the multidimensional attribute whose components are the descriptive attributes of the records in the database; we shall refer to b as the category variable. Let B be the domain of b; that is, B is the Cartesian product of the domains of b′, b′′,…. A query is called disjunctive if its category predicate is given by
where B is a nonempty subset of B; such a query will be denoted by Q(B). Note that the number of possible distinct disjunctive queries is 2^{B}−1. Owing to the exhaustive and mutually exclusive nature of values of the descriptive attributes, every nonnull query is equivalent to a disjunctive query (Malvestuto & Moscarini, 1990) since its category predicate is logically equivalent to a formula such as (1). Therefore, without loss of generality, we shall consider disjunctive queries only, and in what follows the adjective "disjunctive" is omitted.
Suppose that, before asking the targetquery, the user posed n queries Q_{1},…, Q_{n} which were all answered. Let Q = {Q_{1},…, Q_{n}}, let Q_{v} = Q(B_{v}) where B_{v} is a nonempty subset of B, and let q_{v} be the value of Q_{v}, 1 ≤ v ≤ n. The set
will be referred to as the active domain of the category variable (with respect to Q). By the characteristic partition of the active domain of b, we mean its coarsest partition γ = {C_{1},…, C_{m}} such that the sets B_{v} belong to the set field generated by γ, that is, each B_{v} can be obtained as the union of one or more classes of γ. Note that m is not greater than the size of the active domain of b. Let V = {1,…, n} and E = {e_{1},…, e_{m}} where
The sets V and E will be respectively viewed as being the vertex set and the edge set of a hypergraph H = (V, E), which will be referred to as the hypergraph associated with Q. Note that, if each edge of H contains less than three vertices, then H can be pictured as an ordinary graph and the query set Q is called graphical.
Example 2 (continued): Consider again the three sum queries Q_{1}, Q_{2}, and Q_{3}, and let Q = {Q_{1}, Q_{2}, Q_{3}}. Then, with our notation we have
B_{1} = {Direction,Administration},
B_{2} = {Administration,Services},
B_{3} = {Direction,Services}.
The active domain of department is {Direction, Administration, Services}, and its characteristic partition is the point partition; that is, γ = {C_{1}, C_{2}, C_{3}} where
C_{1} = {Direction}
C_{1} = {Administration}
C_{1} = {Services}.
Thus, V = {1, 2, 3} and E = {e_{1}, e_{2}, e_{3}} with e_{1} = {1, 3}, e_{2} = {1, 2}, and e_{3} = {2, 3}. The hypergraph associated with Q is the graph shown in Figure 1 where, for each vertex v, the value of Q_{v} is also reported.
Figure 1: The Graph Associated with Three Queries
Let D be the data type of the values q_{v}. Thus, if the queries Q_{v} are count queries, then D is the set Z_{0}^{+} of nonnegative integers, and if the queries Q_{v} are sum queries, then D is the domain of the summary attribute (typically, D is the set of real numbers, or the set of nonnegative real numbers, or the set Z_{0}^{+} of nonnegative integers). The constraint system associated with Q is the equation system
where E(v) is the set of edges of H containing vertex v, with the addition of the m type constraints
Note that, if H = (h_{v})_{v} ∈ V is the incidence matrix of the hypergraph H and q = (q_{v})_{v} ∈ V, then the equation system above can be written in matrix notation as
The constraint system associated with Q can be summarized in the vertexweighted hypergraph (H, q), which will be referred to as the answer map of Q. We shall show how it allows the user to decide if the targetquery Q is indirectly answered given Q, by which we mean that either the value of Q can be exactly computed from Q or the range of the feasible values of Q given Q is bounded. Before giving the decision algorithms for these two cases, we need to introduce some further definitions.
Given a nonempty set F of edges of H, let
By the query corresponding to F, we mean the query Q(C[F]), that is, the query with category predicate
Note that, if F = E(v) for some vertex v of H, then C[F] = B_{v} and the query corresponding to F is exactly Q_{v}. Moreover, if F is a singleton (that is, if F contains exactly one edge), then the query corresponding to F is called an elementary query.
The closure of Q is the set of the queries of the type Q(C[F]) such that the sum expression ∑_{e} ∊ F x(e) is a Dinvariant in the answer map (H, q), by which we mean that, for every two solutions x_{1} and x_{2} of the constraint system associated with Q, one has ∑_{e} ∊ F x_{1}(e) = ∼_{e} ∊ F x_{2}(e). Thus, if the targetquery Q belongs to the closure of Q, then the value of Q can be exactly computed from Q, that is, Q is implicitly answered given Q.
Example 2 (continued): Assume that the domain of salary is. Then, the system of linear constraints associated with Q reads
Since there is exactly one solution with
every sum query on salary whose category predicate contains only values of department from {Direction, Administration, Services} belongs to the closure of Q. It follows that the sensitive query Q4 is implicitly answered given Q.
Consider now the case that the targetquery Q does not belong to the closure of Q but is covered by Q in the sense that there exists a nonempty set F of edges of H such that Q = Q(C[F]). Then the user can determine the tightest lower bound, say α_{1}, and the tightest upper bound, say α_{2}, on the value of Q by solving the following two problems:
If α_{1} ≠ −∞ and α_{2} ≠ +∞, then the targetquery Q is indirectly answered in an approximate way and the intersection of D with the real interval [α_{1}, α_{2}] is called the feasibility range for the value of Q.
Example 3: Let Q = {Q_{1}, Q_{2}, Q_{3}} be a set of additive queries with values from D whose answer map is the vertexweighted graph shown in Figure 2.
Figure 2
Thus, the equation system in the constraint system associated with Q contains the following three equations
and the parametric expression of the general solution of this equation system is
Consider now the type constraints
We now separately discuss the three cases and D = Z_{0}^{+}.
If , then the parameter ρ ranges from −∞ to +∞. In this case, it is easy to see that the closure of Q only contains the queries corresponding to the edge sets E(1), E(2), and E(3) and, hence, it coincides with Q. Moreover, for any other nonempty edge set, the tightest lower and upper bounds on the value of the corresponding query are −∞ to +∞, respectively.
If then the parameter ρ ranges from 0 to 1/2. Again, the closure of Q coincides with Q. But, for each nonempty edge set F, the feasibility range for the value of the query corresponding to F is bounded. For example, the feasibility range for the value of the query corresponding to F = {e_{1}, e_{3}} is the real interval [1/2, 2]. To sum up, every query covered by Q and not in Q is indirectly answered in an approximate way.
If D = Z_{0}^{+}, then ρ = 0 and, hence, the closure of Q coincides with the set of queries covered by Q.
In the rest of this section, we discuss the computational aspects of the problem of testing the membership of the targetquery Q in the closure of Q. Let Q = Q(B) where B is a nonempty subset of the domain B of b. The following procedure allows the user to decide if Q is covered by Q:
COVERING TEST
Step 1. Set F := Ø.
Step 2. For k = 1,…, m do
if Ck is a subset of B then set F := F ∪ {ek} and B := B − Ck.
Step 3. If B = Ø then and only then conclude that Q is covered by Q.
Assume now that the targetquery Q is covered by Q and that B = C(F) where F is the output of Covering Test. Then Q belongs to the closure of Q if and only if the sum expression Σe ∈ F x(e) is a Dinvariant of the answer map (H, q). We now examine the three cases , and D =Z_{0}^{+}.
The Case
To decide if the sum expression Σe ∈ F x(e) is an invariant of (H, q), it is sufficient to check that the incidence vector f of F in H is a linear combination of the rows h_{v} of the incidence matrix H of H, that is, that the equation system
is consistent (Malvestuto & Moscarini, 1990). Note that if the sum expression Σe ∈ F x(e) is an invariant of (H, q) and y is a solution of equation system (3), then the targetquery belongs to the closure of Q and the value of Q is given by
It should be noted that if the sum expression Σe ∈ F x(e) is not an invariant of (H, q), then the tightest lower and upper bounds on the value of the targetquery are −∞ to +∞, respectively.
The Case
One can apply linear programming methods to compute the tightest lower bound α_{1} and the tightest upper bound α_{2} on the value of the sum expression Σe ∈ F x(e), and conclude that it is an invariant of (H, q) if and only if α_{1} = α_{2}. We now present an alternative method. Let kernel(H) denote the set of edges e of H such that x(e) = 0 for every solution x of the constraint system associated with Q. The set kernel(H) can be determined by solving for each edge e of H the following linear programming problem:
find the maximum value α(e) of the variable x(e) subject to the constraint system associated with Q.
Then, kernel(H) is given by the set of edges e of H for which α(e) = 0. Let K be the incidence matrix of the hypergraph K obtained from H by deleting all the edges in kernel(H), and let g be the incidence vector of the edge set F−kernel(H) in K. Then, the sum expression Σe ∈ F x(e) is an invariant of (H, q) if and only if the equation system
is consistent (Malvestuto, 1993).
The Case D = Z_{0}^{+}
One can apply integer linear programming methods to compute the tightest lower bound α_{1} and the tightest upper bound α_{2} on the value of the sum expression Σe ∈ F x(e), and conclude that it is a Z_{0}^{+}invariant of (H, q) if and only if α_{1} = α_{2}. Note that if β_{1} and β_{2} are respectively the tightest lower and upper bounds on the value of the sum expression Σe ∈ F x(e) under the assumption D = Z_{0}^{+}, then
It is also worth mentioning that if the incidence matrix of H is "totally unimodular" (Garfinkel & Nemhauser, 1972)—and this is the case if H is a bipartite graph—then relaxing the integrality constraints does not affect the values of the tightest lower and upper bounds on the value of the sum expression Σe ∈ F x(e), that is, α_{1} = β_{1} and α_{2} = β_{2}.

