DATA MINING


Initial interrogation and analysis of the cross-cultural dilemma database has followed convention using statistical confidence tests.

However the classical approach does not extract all the potential information waiting to be discovered . Are there hidden patterns and exciting concepts deep within the database that would only be revealed if we knew in advance what they were and could test for them? Ongoing research with the database is concerned with mining for such information by using more recent techniques of analysis including neural networks and data mining.

Because much of our primary data on cross-cultural issues of marketing contains much categorical and ordinal data, classical parametric methods could not be used and many such problems are overcome by this new approach. The technique enables the relative contribution of different factors to be determined and is easily transferable to other problems, e.g., the relative importance of attributes of products (such as price, color , style, availability) or services. It thereby achieves the same results for our categorical data as traditional conjoint analysis, so loved by marketers, can play for their parametric market research data.

Trompenaars' model of cross culture is based on seven scales (dimensions) and this serves as the basic framework for conceiving a new approach to marketing across cultures. These scales are composed of a number of combinations of smaller components . We need a method of analysis to probe the relative influence of the different items such as age, gender, religion that were collected. In this respect, even "country" can be considered as simply another categorical item.

In this discussion, our model can be considered in the following form (for each dimension):

  • dimension score =

  • c 1 — country + c 2 — age + c 3 — religion + c 4 — gender + c 5 + ...

It is tempting to "throw" established statistical techniques at the data to identify possible coefficients (cl, c2, c3 etc.) using correlation and partial-correlation analysis or factor analysis. Some other authors have often done just that with their own more limited data sets or incomplete or extracted sets of our earlier data that we have previously published. This has been especially true of researchers whose skills are limited to classical parametric methods - and these are not valid for our data.

On examination of the data, we note that these parametric methods are not appropriate. Many of the data items are simply categories (nominal data) such as gender, religion, or management function. Classical statistical non-parametric methods are not readily available for our particular problem and certainty none are included in industry standard statistical software. Whilst analysis of variance and (categories) conjoint analysis can help with questionnaire design and testing, it cannot produce the analysis we require here.

In order to explore our data set we therefore need to apply a different body of mathematics which is appropriate for our cause. Recent developments in relational database technology, database mining methods, and knowledge elicitation (Expert Systems) came to our rescue. The following treatment is based on the ID3 induction algorithm. Because these new techniques may not be familiar to the reader and because of their importance in our debate, we will give a short explanation rather than simply quote the results.

For the purpose of discussion, consider a very small but typical portion of our database based on ten cases. (Note: these are for the illustration of these new methods of analysis and these cases are not intended to imply or categorize any stereotypes through these examples.)

Case

Purchasing decision

Country

Function

Gender

1

universalist

US

senior manager

male

2

universalist

UK

junior manager

male

3

particularist

UK

senior manager

female

4

universalist

US

senior manager

female

5

particularist

VEN

senior manager

female

6

particularist

VEN

senior manager

male

7

particularist

UK

senior manager

male

8

particularist

VEN

junior manager

male

9

universalist

UK

junior manager

female

10

universalist

US

junior manager

male

In the domain of data mining, the various items are called "attributes" rather than factors. This helps to differentiate between parametric factor analysis methods or variables . For simplification at this stage, the first attribute, "dimension score," has been given only two values; namely whether a respondent is likely to adopt a "universalist" or "particularist" purchasing decision. This is called the goal attribute.

We shall see later how we can use data mining where the goal attribute is not restricted in this way to two extreme values. Indeed, any of the attributes can be multistate.

The basic principle is to find the relative importance of the various attributes in determining the goal attribute. If we normalize (arrange) the data to the so-called third normal form in separate tables (as we would for representation in a relational database), we obtain:

1: Cases Sorted by Country

5

particularist

VEN

6

particularist

VEN

8

particularist

VEN

2

universalist

UK

3

particularist

UK

7

particularist

UK

9

universalist

UK

1

universalist

US

4

universalist

US

10

universalist

US

2: Cases Sorted by Manager Function

3

particularist

senior

1

universalist

senior

5

particularist

senior

6

particularist

senior

7

particularist

senior

4

universalist

senior

2

universalist

junior

8

particularist

junior

9

universalist

junior

10

universalist

junior

3: Cases Sorted by Gender

1

universalist

male

2

universalist

male

6

particularist

male

7

particularist

male

8

particularist

male

10

universalist

male

3

particularist

female

4

universalist

female

5

particularist

female

9

universalist

female

When we look at the attribute gender in table 3, we see that we can't determine the goal attribute - i.e., whether males or females are universalistic or particularistic in their purchasing decisions - from a given gender.

Similarly, for either a junior or senior manager function, the goal attribute can't be uniquely determined from table 2. When we look at the attribute country in table 1 we find that in all cases where, for example country = US, we can correctly determine that the goal is universalistic. If we know "country," we can correctly classify six of the ten examples in our data set. In data mining terminology, the attribute "country" is therefore said to have the highest information content.

For the full database, we can compute the amount of entropy for each attribute. This gives us a measure of the uncertainty of classification of our goal by each attribute. As the entropy increases, the amount of uncertainty we gain by adding each attribute increases . However, what we really want to know is how much information there is when we know the value(s) of any particular attribute.

If HC (attribute value) is the entropy of attribute of class "c" then this is given by:

Thus, the entropy of classification for Management Function is 'senior manager' is:

HC (function is senior)

 

- f (particularist(function is senior) — log f (particularist(function is senior) - f (universalist(function is senior) — log f (universalist(function is senior)

 

=

-4/6log(4/6)-2/6log(2/6)

 

=

0.918

Similarly,

HC (function is junior)

= - f (particularist(function is junior) — log f (particularist(function is junior) - f (universalist(function is senior)junior) — log f (universalist(function is junior)

 

= -l/4log(l/4)-3/4log(3/4)

= 0.811

Hence, for the overall value of H(function), we simply weight these by the ten cases:

HC (manager function) = 6/10 x 0.918 + 4/10 x 0.811 = 0.8752

Repeating this procedure for the other attributes we obtain:

HC (gender) = 1.0

HC (country) = 0.4

Since HC (gender) = 1.0, i.e. maximum uncertainty, this tells us that there is no information about the goal contained in the attribute "gender." This is consistent with Table 3 which shows that half the males and half the females are of each goal.

Because HC (country) has the lowest entropy of classification, then this corresponds to the least uncertainty. In other words, "country" has the highest information content and thus "country" is the major contributor in explaining the cultural orientation on this dimension of this consumer. Manager function has a smaller contribution.

Implementing the Induction Algorithm

Although it is computationally intensive , it is desirable to apply the ID3 algorithm directly to the original total database. We can use as the goal attributes the complete range of responses for each dimension scale and not simply "universalistic" or "particularistic." For example, when examining the information content of the database with respect to " individualism -collectivism," we note the five contributing questions on our scale means there are 32 (=2 5 ) possible states for the goal attribute. This was effected using the well-established computational method of list processing which has the further advantage of being applicable to string data. For this reason it was not necessary to recede the original database with pseudonumeric codes to represent each categorical item. Furthermore, this type of analysis does not lend itself to SPSS recode like procedures readily.

Because the ID3 algorithm is concerned with the frequency of occurrence of each combination of attributes and not the value of the attributes, it is not necessary to re-scale attributes. Thus age is processed as a category, not a scaled variable. The ID3 algorithm automatically takes care of different types of variable for each attribute and enables us to explore our full database directly.

A recursive procedure, "CATEGORIZE," was constructed to process each iteration. After one iteration on our example set, this produces:

(country (USA CATEGORIZE (((universalist USA senior male) (universalist USA senior female) (universalist USA junior male)))))

(VEN CATEGORIZE (((particularist VEN senior female) (particularist VEN senior male) (particularist VEN senior male)))))

(UK CATEGORIZE (((universalist UK junior male) (particularist UK senior female) (particularist UK senior male) (universalist UK junior female)))))

The above list of lists was split by CATEGORIZE at "country" because application of the ID3 algorithm revealed that "country" had the lowest entropy.

The final list returned by CATEGORIZE is:

(country(USA(status(universalist)))

(VEN(status(particularist)))

(UK(function(senior(status(particularist)))

(junior(status(universalist)))))

In some situations there may be cases where the same attribute values produce different goals. These are known as data conflicts. Thus not every American (male) senior manager may have responded as a universalist. These are accommodated simply by weighting these cases and the basic ID3 algorithm is applied accordingly .

To explain the total variety, it would be necessary to use the same variety as there are cases. This is the same as saying that the 65,000 respondents are all individuals and we could require 65,000 attributes to describe them. Alternatively, we could use one attribute with 65,000 values (such as their name) to uniquely identify them. In the above parlance, their " name " has the highest information content and lowest entropy. However this is not our aim. We refer to earlier discussion repeated throughout this book, namely that we are seeking to develop a model based on a number of dimensions (attributes) that help structure managers' experiences. The analysis we are attempting here is intended to support this aim by exploring the relative importance of different attributes rather than containing the total variety within the data set as a ideological statistician may prefer.

The outcomes of this analysis applied to the whole database reveals the following: It is to be noted that "country" has the lowest entropy for each dimension which is very good evidence to support the main thesis of Tromepanaars' work.

Entropy

unpa

indcol

neaf

spdi

achasc

intex

time

lowest

country

country

country

country

country

country

country

 

industry

religion

industry

industry

industry

industry

industry

 

religion

industry

job

religion

religion

job

religion

 

job

education

religion

age

job

religion

education

 

age

age

corporate

gender

age

gender

job

 

corporate

gender

age

education

education

age

age

 

education

job

gender

job

corporate

education

gender

highest

gender

corporate

education

corporate

gender

corporate

corporate

Whilst this discussion might be viewed as an exercise in statisticulation, it is consistent with the face validity of the dimensions and Ashby's law of Requisite Variety - too few dimensions would not account for the richness of cultural diversity we see in the world.

If we apply other methods such as factor analysis, image factoring, and Kohenen neural networks then the conclusions are identical.




Marketing Across Cultures
Marketing Across Cultures (Culture for Business Series)
ISBN: 1841124710
EAN: 2147483647
Year: 2004
Pages: 82

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net