|
Unique Features of STATISTICA Data Miner
The most comprehensive and effective system of
user-friendly tools for the entire data mining process -
from querying database to generating final reports.To the
best of our knowledge the most comprehensive selection of
data mining methods available on the market (e.g., by
far the most comprehensive selection of neural networks
architectures, classification/regression trees, multivariate
modeling, and many other predictive techniques; the largest
selection of graphics and visualization procedures of any
competing products); A selection of comprehensive, complete
data mining projects (solutions), ready to run, and set up
to competitively evaluate alternative models (using
bagging (voting, averaging), boosting, stacking, , etc.),
and to produce presentation-quality summary reports; An
extremely easy to use, drag-and-drop based user interface
that can be used even by novices, but is still highly
flexible, customizable, and provides one-click access to the
underlying scripts; Powerful, interactive data
exploration (drilling, slicing, dicing) tools, including
the most comprehensive selection of interactive,
exploratory graphics-visualization tools available in
any product; Optimized for processing extremely large
data sets (including options to pre-screen even millions
of variables, and/or draw truly random samples of records
using DIEHARD-certified random sampling procedures; see
Comparative performance benchmarks using large data sets); Highly
optimized access to large data bases, including the IDP
(In-Place Database Processing) technology that reads data
asynchronously directly from remote database servers (using
distributed processing if supported by the server), and
bypassing the need to “import” data and create a local
copy;
|
 |
- Flexible deployment engine (included) can automatically
create deployable solutions (for advanced users, the engine
integrates with custom development environment allowing you to
manage automatically generated C++ or VB code, create custom
deployment nodes, e.g., using VB built into the system, etc.);
- Multiple data streams can be simultaneously processed
through the same selection of predictive models;
- Support for auto-updating systems, easily build systems
that will automatically update all analyses and results when
data change;
- Open, COM-based architecture, unlimited automation options,
and support for custom extensions (using industry standard
VB (built in), Java, or C++);
- Client-Server or Desktop options, the enterprise
Client-Server version supports multithreading, distributed
processing and scales to multiple server computers;
- Complete web-enablement options (via WebSTATISTICA
offering support for all data mining operations, including the
interactive model building, via Internet browser using any
computer connected to the Web); this ultimate enterprise data
analysis/mining system features workgroup/access management
options and allows you to manage projects over the Web and work
collaboratively "across the hall or across
continents."
The desktop version of STATISTICA Data Miner is
designed for the Windows environment. The Client-Server version
of STATISTICA Data Miner is platform independent on the
Client side and features an Internet browser based user
interface; the Server side works with all major Web server
operating systems (e.g., UNIX Apache) and Wintel server
computers.
STATISTICA Data Miner is designed for two general
categories of users, those who need:
1. A complete, deployed, and ready to use solution,
designed to solve a specific type of problem (e.g., such as
customer credit scoring, predicting specific aspects of customer
behavior or providing answers to specific CRM questions,
managing the risk of an equipment failure using a model based on
the mining of a very complex set of historical data).
For these customers, StatSoft offers a complete installation
and deployment of data mining solutions that will draw data from
an existing corporate database or data warehouse and generate
predictions or ratings using a specific model that StatSoft
consultants will deploy on-site (services to develop a data
warehouse solution or restructure the existing one are also
available).
These specialized data mining solutions can later be modified
(by StatSoft or other consultants) as the needs of the company
change. The modification of such already deployed systems are
very easy because all STATISTICA solutions are stored in
the form of industry standard code (VB, C++).
2. A general powerful data mining solution development
system, to be used to design and deploy custom systems
(in-house) by the corporate analysts and IS/IT personnel. These
customers will license the same set of tools, following the same
price structure as the customers from the previous category (see
above) except that they will not order the deployment and
consulting services.
Advanced Software Technology = Efficient and Elegant User
Interface
STATISTICA Data Miner is based on a technology that
offers both (a) the full advantages of the interactive,
"point and click" user interface and (b) complete
programmability and customizability.
STATISTICA
analysis "objects" and nodes. At the heart of STATISTICA
Data Miner are a set of over 260 highly optimized,
efficient, and extremely fast STATISTICA procedures,
called from Visual Basic scripts (available to you in
source-code format) which are used to specify the relations
between the procedures (objects) and control the logic of the
project (and the "flow" of data). This flexible,
customizable architecture delivers the full functionality of all
statistical and analytic procedures to the data mining
environment as self-contained analysis objects. These scripts
(analysis objects) serve as the "wrappers" or glue for
defining the flow of data through the project, while the actual
numerical analyses are performed via the extremely fast analytic
procedures of STATISTICA. The objects, which can be used
as the nodes for data cleaning and/or filtering, and for
analyzing the data, are organized in the Node Browser.
The nodes available in the node browser (and, hence, available
to the data mining project) are:
- Nodes for data input and data acquisition. Here you
can create and store the scripts necessary to connect to
remote (protected) data sources on a server. Of course, you
can also analyze STATISTICA data files or place
holders for in-place processing of remote databases (see
IDP), in which case no special nodes (scripts) have to be
created.
- Nodes for data filtering, cleaning, verification,
feature selection, and sub-sampling. These options are
essential to data mining to detect and correct erroneous
information that may bias final conclusions. The
sub-sampling facilities are useful for analyzing very large
data sets to extract random samples for further
analyses. The feature selection options allow you to
automatically select informative variables (predictors) from
among, for example, hundreds of thousands of possible
predictors
- Nodes for data analyses. These nodes contain the
full functionality of all STATISTICA analyses and
graphics capabilities; hundreds of procedures are available
to address essentially all analytic needs that can possibly
arise in your data mining project.

Creating the data mining project. These nodes can simply
be connected in the data mining workspace.
The data mining workspace is a structured, highly efficient,
user-friendly data analysis environment, where you can move
around and interconnect data, analyses, and results by simply
dragging icons and connecting arrows. You can simultaneously
open, modify, and run as many data mining workspaces as you like
and drag nodes (objects) between workspaces and node browsers.
The workspace area is pre-divided to make room for:
- Data acquisition. Here is where the data sources can be
specified (e.g., STATISTICA data files, place-holders for
in-place processing of data on remote servers, programs that
generate data programmatically, for use in advanced modeling).
- Data preparation, cleaning, transformation. The nodes
in this area will accept one or more data sources for input, and
create one or more (filtered, cleaned, transformed) data sources
for further "downstream" analyses.
- Data analysis, modeling, classification, forecasting.
The nodes in this area will perform the numeric analyses.
- Reports. This area will show the results of the
analyses.

Creating a Data Mining project is easy: First select a data
source; second, apply any data preparation, cleaning, or
transformation; third, connect the desired analyses to the
cleaned data, and, fourth, review and/or publish the results.
Many users of STATISTICA Data Miner will never need to go
beyond this simple interactive, "point and click" user
interface.
Specifying
complex models. The simple user interface -- based on
point-and-click selections from menus and browsers -- will allow
you to apply even very advanced methods. Several comprehensive
and flexible project "templates" can be selected to
address common data mining tasks. For example, in order to find
a good model for predicting credit risk of new clients based on
historical data that includes various potentially useful
predictors, you could simply select the template for the Advanced
Comprehensive Regression Models project.
All you need to do next is connect your historical data, specify
the variables of interest, and "train" the project;
thus, in just a few seconds (select data file, select variables,
select the arrow tool to connect the data), the program will
automatically:
- Create two samples for training and for cross-validation,
to avoid over-fitting;
- Apply best subset linear regression, standard regression
trees algorithms, CHAID and exhaustive CHAID, a 3-layer
multilayer perceptron neural network, and a radial basis
function neural network to find a good model for predicting
credit risk;
- Combine all responses into a meta-learner that picks the
best model, or combines the predictions from multiple
models.
After applying these cutting-edge techniques for modeling
linear, nonlinear, or even chaotic relationships, you are ready
for deployment: Simply connect the data source for the new data
(new customers) to the Compute Best Prediction From All
Models node, and the program will automatically apply the
fully trained models to derive the best prediction possible.
Speed. The analysis nodes (objects) contain the full
functionality of STATISTICA, encapsulated into calls made
from the standardized STATISTICA Visual Basic node
scripts. However, the actual analyses are performed via the
highly optimized STATISTICA analysis modules, which have
been refined for almost two decades to deliver maximum speed,
capacity, and accuracy
Large data sets. STATISTICA Data Miner includes
designated analytic facilities specifically optimized for
selecting subsets of variables from among hundreds of thousands
or even over one million variables on input; for example, data
filtering nodes are available for selecting the best k
predictors (features) for classification from a huge number of
available predictors (see also Feature
Selection and Variable Filtering ).
Customizing analyses. The analyses or data
cleaning/filtering operations implemented by the nodes of STATISTICA
Data Miner can further be customized by simply
double-clicking on the respective icons: Every icon contains the
options to fully customize the respective operations; for
example, clicking on a neural network node will bring up a
dialog (and dialog help) for customizing the specific analysis
(to change the number of iterations, number of layers in the
network, the detail of reported results, etc.).
Saving the project. The entire project (workspace) can
be saved, along with all customization, intermediate data
sources, comments, etc. Routine analyses (e.g., for regular
updating of a trained complex set of models for voted
classification based on various methods) can be saved and later
applied by clicking on a single button ("update").
Technical Note: STATISTICA Data Miner
Node Scripts. Each node in STATISTICA Data Miner
consists of a standardized STATISTICA Visual Basic script
(that calls the respective STATISTICA procedures), with
access to additional functions to provide the user interface to
further customize analyses. It may never be necessary to modify
or customize these scripts; however, if your in-house IT
department or consultants want to insert proprietary algorithms
into STATISTICA Data Miner, this can very easily be
accomplished. A simple node script may look like this:
Private Sub SubsetNode( _
DataIn() As InputDescriptor, _
DataOut() As InputDescriptor)
ReDim DataOut(LBound(DataIn()) To
UBound(DataIn())) _
As InputDescriptor
For i=LBound(DataIn()) To UBound(DataIn())
Set
DataOut(i)=DataIn(i).Clone()
Next i
End Sub
This program will simply copy the data source information
(element in input array DataIn(i)) from each data source,
and pass it on for further processing in DataOut(i). Any
number of proprietary or highly customized numeric operations
could be performed inside the script, to change practically all
aspects of the data, or to apply any of the thousands of
analytic functions available in STATISTICA Visual Basic.
This general open architecture of STATISTICA Data Miner
provides numerous unique (to data mining software) advantages
(also further elaborated in the section on Unique
Features).
- Each node can handle multiple data sources on input, and
multiple data sources on output; identical operations can be
applied to multiple data sources via a single node.
- A data source can be a mapping into a database that does
not need to actually (physically) reside on the machine
running STATISTICA Data Miner, nor does it have to be
copied; this is extremely important for the processing of
large data sets, as they commonly occur in data mining
- You can perform operations within and between data
sources; for example, you could merge data in different
remote databases into a single data file, for further
processing with STATISTICA Data Miner analytic nodes.
- Visual Basic itself is a simple, object-oriented language,
available for most industry-standard application programs;
there is a virtually limitless supply of programming
resources, talented and experienced programmers, and
ready-to-use third-party applications that can be integrated
with STATISTICA Data Miner. Likewise, STATISTICA
Data Miner can be integrated with other applications,
for example, to automatically deliver results to the WEB or
email, or to export results into other applications. Also, a
fully Web-based version of STATISTICA Data Miner,
powered by WebSTATISTICA is available.
- STATISTICA's macro recording facilities will
automatically record interactive analyses; these recordings
can easily be converted into scripts for custom nodes.
- Where applicable, STATISTICA's analyses contain
options for generating STATISTICA Visual Basic code
for deployment (e.g., of trained neural networks); those
scripts can be directly used in scripts for custom
deployment nodes.
Deploying solutions. The results of analyses via STATISTICA
Data Miner can be deployed (applied to new data or inside
other automated data processing systems) in several ways.
- Automatic deployment of models. Data mining templates with
deployment for standard types of analyses can be chosen as
options from pulldown menus: Select a template, connect
training data to estimate models, and you are ready to apply
the best solution (average solution, voted solution, etc.)
to new data; the end user only needs to connect new data to
the deployment node to compute predictions, classifications,
forecasts, etc.
- C, C++, Visual Basic code generator options.
Code-generator options are available for regression
(prediction of continuous variables) and classification
(prediction of categorical variables) types of problems; for
example, you can save C++ code or Visual Basic code that
implements the prediction from tree-classification
algorithms, linear discriminant function analysis,
generalized linear models, neural networks, etc. The code
generated by these options can quickly be integrated into
custom programs for deployment.
- Deployment via STATISTICA Visual Basic. The Visual
Basic code generated from STATISTICA analysis modules
will seamlessly integrate into the STATISTICA Data Miner
architecture ; based on the Visual Basic code generated by STATISTICA,
custom deployment nodes can be programmed in minutes, even
by inexperienced programmers.
Using STATISTICA Data Miner with Extremely Large Data
Sets
The entire STATISTICA family of products and STATISTICA
Data Miner in particular are specifically optimized to
efficiently process extremely large data sets , with millions of
observations (records) and millions of variables (fields).
Processing databases that are larger than the local
storage device. STATISTICA Data Miner (and optionally
other STATISTICA products) can process data in (remote)
databases "in-place" via its highly optimized In-place
Database Processing (IDP) technology, which combines the
processing resources of the database server and the local
computer to (a) perform the queries (using the database server
CPU) while simultaneously (b) processing the fetched records
"on-the-fly" on the local machine (using the local
computer (client) CPU). This way, databases that are larger than
what could fit on the local machine can be processed, and
significant performance gains can be achieved by saving the time
that would normally be required to first import the data to the
local device and only then process them locally. Practically all
common database formats are supported, and powerful tools are
provided for defining the database connection (query).
Processing databases with extremely large numbers of
variables (fields): The unique feature
selection and variable screening facilities. When the
number of variables in the input data file is extremely large, STATISTICA
Data Miner can automatically select subsets of variables
from among even millions of variables (candidates) for
predictive data mining. The extremely fast and efficient
algorithm will select variables (features) that are likely to be
the most relevant predictors in the current data set, without
introducing biases into subsequent model building for predictive
data mining.
Processing data files with extremely large numbers of
cases (records): Flexible and efficient random sampling. STATISTICA
products (including STATISTICA Data Miner) can process
data files with practically unlimited numbers of cases
(records), and STATISTICA's data access procedures are
highly optimized. However, including all records in the analyses
when the number of records is extremely large is (a) entirely
unnecessary, (b) time consuming, and (c) often impractical or
impossible (in extreme cases it could take hours merely to read
all records). In order to speed up the analytic process, STATISTICA
Data Miner includes sophisticated tools for drawing
representative, perfectly random samples from huge data sets
(databases). The user can quickly extract simple or systematic
random samples of appropriate sizes, with or without
replacement, from huge data sets (e.g., with many millions of
records) for further analyses with sophisticated modeling tools
that may require multiple passes through the data (e.g., neural
networks, generalized linear models, etc.). The random
sub-sampling is based on STATISTICA's validated random
number generator. Note that STATISTICA is one of only few
commercially available software products that have passed the
most advanced and most recognized tests for randomness
General Categories of Data Mining Techniques
STATISTICA Data Miner offers the most comprehensive
selection of statistical, exploratory, and visualization
techniques available on the market, including leading edge and
highly efficient neural network/machine learning and
classification procedures. Also, the complete analytic
functionality of STATISTICA is available for data mining,
encapsulated in over 260 nodes that can be selected in a
structured and customizable Node Browser, and dragged
into the data mining workspace.
The specialized tools for data mining are optimized for speed
and efficiency and can be classified into the following five
general "areas" (each comprising of a set of STATISTICA
modules, some of them offered only in the STATISTICA Data
Miner environment):
General Slicer/Dicer and Drill-Down Explorer.
A large number of analysis nodes are available for creating
exploratory graphs, to compute descriptive statistics,
tabulations, etc. These nodes can be connected to input data
sources, or to all intermediate results. A specialized STATISTICA
application module is available (STATISTICA
Drill-Down Explorer) for interactively exploring your
data by drilling down on selected variables, and categories or
ranges of values in those variables. For example, you can
drill-down on Gender, to display the distribution for a variable
Income for females only; next you could drill down on a specific
income group, to explore (e.g., create graphical summaries for)
selected variables, for females in the selected income group
only. A unique feature of STATISTICA Drill-Down Explorer
is the ability to select and deselect drill-down variables and
categories in any order; so you could next deselect variable
Gender and thus display selected graphs and statistics for the
selected Income group, but now for both males and females.
Another unique feature of the Drill-Down Explorer is its variety
of categorization ("slicing") methods. Hence, the STATISTICA
Drill-Down Explorer offers tremendous flexibility for
"slicing-and-dicing" the data. The STATISTICA
Drill-Down Explorer can be applied to raw data, database
connections for in-place processing of data in remote databases,
or to any intermediate result computed in a STATISTICA Data
Miner project. (A fully integrated OLAP application is also
available (as an optional add-on module for enterprise
installations), please contact StatSoft for details.)
General Classifier. STATISTICA Data Miner
offers the widest selection of tools to perform data mining
classification techniques (and build related deployable models)
available on the market, including generalized linear models
(for binomial and multinomial responses), classification trees,
general classification and regression tree modeling (GTrees),
general CHAID models, cluster analysis techniques (including
"large capacity" implementations of tree-clustering
and k-means clustering methods), and general discriminant
analysis models (including best-subset selection of predictors).
Also, the numerous advanced neural network classifiers available
in STATISTICA Neural Networks are available in STATISTICA
Data Miner, and can be used in conjunction or competition
with other classification techniques.
- Deployment. Where applicable, the program includes
options for generating C, C++, or STATISTICA Visual
Basic code for deployment of final solutions in your custom
programs; models are also automatically available for
deployment after training, so all you need to do is connect
new data to the special deployment node, to compute
predicted classifications.
General Modeler/Multivariate Explorer. STATISTICA
Data Miner offers the widest selection of tools to build
deployable data mining models, based on linear, nonlinear, or
neural network techniques and tools to explore data; the user
can also build predictive models based on general multivariate
techniques. In summary, STATISTICA offers the full range
of techniques, from linear and nonlinear regression models,
advanced generalized linear and generalized
additive models, to advanced neural network methods. STATISTICA
Data Miner also includes techniques that are not usually
found in data mining software, such as partial least squares
methods (for reducing large numbers of variables), survival
analysis (for analyzing data containing censored observations;
e.g. for medical research data and data from industrial
reliability and quality control studies), structural equation
modeling techniques (to build and evaluate confirmatory linear
models), correspondence analysis (for analyzing the structure of
complex tables), factor analysis and multidimensional scaling
(for exploring structure in large numbers of variables), and
many others.
- Deployment. Where applicable, the program includes
options for generating C, C++, or STATISTICA Visual
Basic code for deployment of final solutions in your custom
programs. Models are also automatically available for
deployment after training, so all you need to do is connect
new data to the special deployment node to compute predicted
values.
General Forecaster. STATISTICA Data Miner
includes a broad selection of traditional (i.e., non-neural
networks-based) forecasting techniques (including ARIMA,
exponential smoothing with seasonal components, Fourier spectral
decomposition, seasonal decomposition, regression- and
polynomial lags analysis, etc.), as well as neural network
methods for time series data.
- Deployment. Forecasts can automatically be computed
for multiple models in data mining project, and plotted in a
single graph for comparative evaluation. For example, you
can compute and compare predictions from multiple ARIMA
models, different methods for seasonal and non-seasonal
exponential smoothing, and the best time-series neural
network architectures (after searching over 100 different
architectures).
General Neural Networks
Explorer. This tool contains the most comprehensive
selection of neural network methods available on the market.
This powerful component of STATISTICA Data Miner offers
tools to approach virtually any data mining problem (including
classification, hidden structure detection, and powerful
forecasting). One of the unique features of the NN Explorer
is the selection of intelligent problem solvers and automatic
wizards that use Artificial Intelligence methods to help you
solve the most demanding problems involved in advanced NN
analysis (such as selecting the best network architecture and
the best subset of variables). The Explorer offers the widest
selection of cutting-edge NN architectures and procedures and
highly optimized algorithms that include: Multilayer
perceptrons, radial basis function networks, probabilistic
neural networks, generalized regression neural networks,
self-organizing feature maps, linear models, principal
components network, and cluster networks. Network ensembles of
these architectures can also be evaluated. Estimation methods
include back propagation, conjugate gradient decent,
quasi-Newton, Levenberg-Marquardt, quick propagation,
delta-bar-delta, LVQ, pruning algorithms, and more; options are
available for cross validation, bootstrapping, subsampling,
sensitivity analysis, etc.
- Deployment. STATISTICA
Neural Networks includes code generator options to
produce C, C++, and STATISTICA Visual Basic code for
one or more trained networks as well as ensembles of
networks. This code can be quickly incorporated into your
own custom deployment programs. In addition, fully trained
neural networks and ensembles of neural networks can be
saved, to be applied later for computing predicted responses
or classifications for new data. A deployment node can be
dragged into the data miner workspace to perform prediction
and predictive classification based on trained neural
networks automatically; all you have to do (after the
participating network architectures are trained) is connect
the data for deployment.
Specialized Data Mining Modules
A large portion of analytic functionality used by STATISTICA
Data Miner is driven by the computational engines of modules
that are included in various other STATISTICA products
(refer to STATISTICA
Products for detailed information about those modules):
- Neural Networks techniques (the largest selection of
architectures available, automatic problem solver tools,
advanced feature selection techniques).
- All STATISTICA Graphics Tools and interactive
exploration/visualization tools; Descriptive statistics,
breakdowns, and exploratory data analysis; Frequency Tables,
Crosstabulations, Tables and Stub-and-Banner Tables,
Multiple Response Analysis; Nonparametric Statistics;
Distribution Fitting; Power Analysis Techniques.
- General Linear Models (GLM); General Regression Models
(GRM); Generalized Linear Models (GLZ); General Partial
Least Squares Models (PLS); Variance Components and Mixed
Model ANOVA/ANCOVA; Survival/Failure Time Analysis; General
Nonlinear Estimation with Logit and Probit Regression;
Log-Linear Analysis of Frequency Tables; Time Series
Analysis/Forecasting; Structural Equation Modeling/Path
Analysis (SEPATH).
- Cluster Analysis Techniques; Factor Analysis; Principal
Components & Classification Analysis; Canonical
Correlation Analysis; Reliability/Item Analysis;
Classification Trees; Correspondence Analysis;
Multidimensional Scaling; Discriminant Analysis; General
Discriminant Analysis Models (GDA).
- Quality Control Charts techniques, Process Analysis, and
Experimental Design (DOE) procedures.
However, several modules include selections of highly
specialized data mining and data mining modeling techniques that
are offered only as part of STATISTICA Data Miner. The
following sections include technical information about these
modules.
FEATURE SELECTION AND VARIABLE FILTERING.
This module will automatically select subsets of variables from
extremely large data files or databases connected for in-place
processing . The module can handle a practically unlimited
number of variables: Literally millions (!) of input variables
can be scanned to select predictors for regression or
classification. Specifically, the program includes several
options for selecting variables ("features") that are
likely to be useful or informative in specific subsequent
analyses. The unique algorithms implemented in the Feature
Selection and Variable Filtering module will select
continuous and categorical predictor variables which show a
relationship to the continuous or categorical dependent
variables of interest, regardless of whether that relationship
is simple (e.g., linear) or complex (nonlinear, non-monotone).
Hence, the program does not bias the selection in favor of any
particular model that you may use to find a final best rule,
equation, etc. for prediction or classification. Various
advanced feature selection options are also available. This
module is particularly useful in conjunction with the in-place
processing of databases (without the need to copy or import the
input data to the local machine), when it can be used to scan
huge lists of input variables, select likely candidates that
contain information relevant to the analyses of interest, and
automatically select those variables for further analyses with
other nodes in the data miner project. For example, a subset of
variables based on an initial scan via this module can be
submitted to the STATISTICA Neural
Networks feature selection options for further review.
These options allow STATISTICA Data Miner to handle data
sets in the giga- and terabyte range

ASSOCIATION RULES. This module
contains a complete implementation of the so-called A-priori
algorithm for detecting ("mining for") association
rules such as "customers who order product A, often also
order product B or C" or "employees who said positive
things about initiative X, also frequently complain about issue
Y but are happy with issue Z" (see Agrawal and Swami, 1993;
Agrawal and Srikant, 1994; Han and Lakshmanan, 2001; see also
Witten and Frank, 2000). The STATISTICA Association Rules
module allows you to process rapidly huge data sets for
associations (relationships), based on pre-defined
"threshold" values for detection. Specifically, the
program will detect relationships or associations between
specific values of categorical variables in large data sets.
This is a common task in many data mining projects applied to
databases containing records of customer transactions (e.g.,
items purchased by each customer), and also in the area of text
mining. Like all modules of STATISTICA, data in external
databases can be processed by the STATISTICA Association
Rules module in-place so the program is prepared to
handle efficiently extremely large analysis tasks. The
results can be displayed in tables, and also in unique 2D and 3D
graphs where strong associations are highlighted by thick lines
connecting the respective items.
INTERACTIVE DRILL-DOWN EXPLORER. A first step of many
data mining projects is to explore the data interactively, to
gain a first "impression" of the types of variables in
the analyses, and their possible relationships. The purpose of
the Interactive Drill-Down Explorer is to provide a
combined graphical, exploratory data analysis, and tabulation
tool that will allow you to quickly review the distributions of
variables in the analyses, their relationships to other
variables, and to identify the actual observations belonging to
specific subgroups in the data.

How the Drill-Down Explorer Works. The
"drill-down" metaphor within the data mining context
summarizes the basic operation of this analytic process quite
well: The program allows you to select observations from larger
data sets by selecting subgroups based on specific values or
ranges of values of particular variables of interest (e.g., Gender
and Average Purchase in the example above); in a sense
you can expose the "deeper layers" or
"strata" in the data by reviewing smaller and smaller
subsets of observations selected by increasingly complex logical
selection conditions.
Drilling "up." The interactive nature of the
Drill Down explorer allows you not only to drill down
into the data or database (select groups of observations with
increasingly specific logical selection conditions), but also to
"drill up": At any time, you can select one of the
previously specified variable (category) groups
and de-select it from the list of drill-down conditions; while
processing the data the program will then only select those
observations that fit the remaining logical (case) selection
conditions, and update the results accordingly.
Applications of the Interactive Drill-Down Explorer.
The example shown earlier is very simple, exposing only the
basic functionality of the program. The real power of the STATISTICA
Interactive Drill-Down Explorer lies in the various
auxiliary results which can automatically be updated during the
interactive drill-down/up exploration: You can select a list of
variables for review, and compute for the selected cases:
- Descriptive statistics and frequency tables;
- Box-and-whiskers plots summarizing the distributions of
continuous variables;
- Scatterplot matrices summarizing the relationships between
continuous variables;
- All of the other statistical and graphical analyses available
in STATISTICA by extracting the observations belonging to
the current subset;
So for example, you could review the types of purchases that
customers made with different demographic characteristics, study
the effectiveness of certain drugs within different treatment
groups, ages, etc., or extract likely customers for a new
product from a database of previous customers based on careful
study of apparent (market) segments exposed by the drill-down
analysis.
Interactive Drill-Down Explorer and OLAP
(On-Line Analytic Processing). On the surface, the operation
of the simplest aspect of the Interactive Drill-Down Explorer
(exploration of multidimensional tables) is very similar to the
functionality offered by designated OLAP tools (such as those
offered in the optional OLAP add-on module for STATISTICA
Data Miner). OLAP tools allow users to quickly query a
database to extract observations and summary information about
those observations taking advantage of the optimized OLAP Server
facilities offered for a specific database platform (e.g.,
Oracle, or MS SQL Server), and often providing significant
performance advantages over tools based on traditional (non-OLAP
driven) query tools. However, the main advantages STATISTICA
Interactive Drill-Down Explorer over OLAP are:
(a) its tight integration with STATISTICA's flexible
categorization tools and exploratory environment (the analytic
capabilities provided in the STATISTICA Interactive
Drill-Down Explorer are much more comprehensive and also
general than typical OLAP tools, supporting flexible "drill
up" operations, and allowing you to quickly review custom,
complex summary graphs, detailed descriptive statistics, etc.),
and
(b) the fact that the STATISTICA Interactive Drill-Down
Explorer is not limited to any particular database platform
and does not require a designated OLAP Server to be present
(e.g., it can operate directly on STATISTICA data files).
At the same time, by connecting to the STATISTICA
application a (remote) database for in-place processing, you can
efficiently perform drill-down operations on any data source,
regardless of whether or not designated OLAP tools are available
on the server.
GENERALIZED
ADDITIVE MODELS (GAM). The STATISTICA Generalized
Additive Models facilities are an implementation of methods
developed and popularized by Hastie and Tibshirani (1990);
additional detailed discussion of these methods can also be
found in Schimek (2000). The program will handle continuous and
categorical predictor variables. Note that STATISTICA
includes a comprehensive selection of methods for fitting
non-linear models to data, such as the Nonlinear
Estimation module, Generalized Linear Models, General
Classification and Regression Trees (below), etc.
Distributions and link functions. The program allows
the user to choose from a wide variety of distributions for the
dependent variable, and link functions for the effects of the
predictor variables on the dependent variable:
Normal, Gamma, and Poisson distributions:
| Log link: |
f(z) = log(z) |
| Inverse link: |
f(z) = 1/z |
| Identity link: |
f(z) = z |
Binomial distribution:
| Logit link: |
f(z)=log(z/(1-z)) |
Scatterplot smoother. The program uses the cubic
spline smoother with user-defined degrees of freedom to find an
optimum transformation (function) of the predictor variables.
Results statistics. The program will report a
comprehensive set of results statistics to aid in the evaluation
of the model-adequacy, model fit, and interpretation of results;
specifically, results include: the iteration history for the
model fitting computations, summary statistics including the
overall R-square value (computed from the deviance statistic)
model degrees of freedom, and detailed observational statistics
pertaining to the predicted response, residuals, and the
smoothing of the predictor variables. Results graphs include
plots of observed responses vs. residual responses, predicted
values vs. residuals, histograms of observed and residual
values, normal probability plots of residual values, and partial
residual plots for each predictor, indicating the cubic spline
smoothing fit for the final solution; for binary responses
(e.g., logit-models) lift charts can also be computed.
GENERAL
CLASSIFICATION AND REGRESSION TREES (GTrees). This module is
a comprehensive implementation of the methods described as
C&RT by Breiman, Friedman, Olshen, and Stone (1984).
However, the GTrees module contains various extensions
and options that are typically not found in implementations of
this algorithm, and that are particularly useful for data mining
applications.
User interface; specifying "models." In
addition to standard analyses (as described by Breiman, et al.),
the implementation of these methods in STATISTICA allow
you to specify ANOVA/ANCOVA-like designs with continuous and/or
categorical predictor variables, and their interactions. Three
alternative user interfaces are provided to allow you to specify
such designs; these are analogous to the methods provided in GLM
(General Linear Models), GLZ (Generalized Linear Models),
GRM (General Regression Models), GDA (General
Discriminant Analysis Models), and PLS (General Partial
Least Squares Models), and are described in detail in the
respective sections. In short, ANOVA/ANCOVA-like predictor
designs can be specified via dialogs, Wizards, or (design)
command syntax; moreover, the command syntax is compatible
across modules, so you can quickly apply identical designs to
very different analyses (e.g., compare the quality of
classification using GDA vs. GTrees).
Tree
pruning, selection, validation. The program provides a large
number of options for controlling the building of the tree(s),
the pruning of the tree(s), and the selection of the
best-fitting solution. For continuous dependent (criterion)
variables, pruning of the tree can be based on the variance, or
on FACT-style pruning. For categorical dependent (criterion)
variables, pruning of the tree can be based on misclassification
errors, variance, or FACT-style pruning. You can specify the
maximum number of nodes for the tree or the minimum n per node.
Options are provided for validating the best decision tree,
using V-fold cross validation, or by applying the decision tree
to new observations in a validation sample. For categorical
dependent (criterion) variables, i.e., for classification
problems, various measures can be chosen to modify the algorithm
and to evaluate the quality of the final classification tree:
Options are provided to specify user-defined prior
classification probabilities and misclassification costs;
goodness-of-fit measures include the Gini measure, Chi-square,
and G-Square.
Missing
data and surrogate splits. Missing data values in the
predictors can be handled by allowing the program to determine
splits for surrogate variables, i.e., variables that are similar
to the respective variable used for a particular split (node).
ANOVA/ANCOVA-like designs. In addition to the
traditional C&RT-style analysis, you can combine categorical
and continuous predictor variables into ANOVA/ANCOVA-like
designs and perform the analysis using a design matrix for the
predictors. This allows you to evaluate and compare complex
predictor models, and their efficacy for prediction and
classification using various analytic techniques (e.g., General
Linear Models, Generalized Linear Models, General
Discriminant Analysis Models, etc.).
Tree
browser. In addition to simple summary tree graphs, you can
display the results trees in intuitive interactive tree-browsers
that allow you to collapse or expand the nodes of the tree, and
to quickly review the most salient information regarding the
respective tree node or classification. For example, you can
highlight (click on) a particular node in the browser-panel and
immediately see the classification and misclassification rates
for that particular node. The tree-browser provides a very
efficient and intuitive facility for reviewing complex
tree-structures, using methods that are commonly used in
windows-based computer application to review hierarchically
structured information. Multiple tree-browser can be displayed
simultaneously, containing the final tree, and different
sub-trees pruned from the larger tree, and by placing multiple
browsers side-by-side it is easy to compare different tree
structures and sub-trees. The STATISTICA Tree Browser is
an important innovation to aid with the interpretation of
complex decision trees.
Interactive trees. Options are also provided to review
trees interactively, either using STATISTICA Graphics
brushing tools or by placing large tree graphs into scrollable
graphics windows where large graphs can be inspected
"behind" a smaller (scrollable) window.
Results statistics. The STATISTICA GTrees
module provides a very large number of results options. Summary
results for each node are accessible, detailed statistics are
computed pertaining to classification, classification costs,
gain, and so on. Unique graphical summaries are also available,
including histograms (for classification problems) for each
node, detailed summary plots for continuous dependent variables
(e.g., normal probability plots, scatterplots), and parallel
coordinate plots for each node, providing an efficient summary
of patterns of responses for large classification problems. As
in all statistical procedures of STATISTICA, all
numerical results can be used as input for further analyses,
allowing you to quickly explore and further analyze observations
classified into particular nodes (e.g., you could use the GTrees
module to produce an initial classification of cases, and then
use best-subset selection of variables in GDA to find
additional variables that may aid in the further
classification).
C, C++, STATISTICA Visual Basic, SQL Code
generators. The information contained in the final tree can
be quickly incorporated into your own custom programs or
database queries via the optional C, C++, STATISTICA
Visual Basic, or SQL query code generator options. The STATISTICA
Visual Basic will be generated in form that is particularly well
suited for inclusion in custom nodes for STATISTICA Data
Miner.
GENERAL
CHAID (Chi-square Automatic Interaction Detection) MODELS.
Like the implementation of General
Classification and Regression Trees GTrees (above) in STATISTICA,
the General CHAID module provides not only a
comprehensive implementation of the original technique, but
extends these methods to the analysis of ANOVA/ANCOVA - like
designs.
Standard
CHAID. The CHAID analysis can be performed for both
continuous and categorical dependent (criterion) variables.
Numerous options are available to control the construction of
hierarchical trees: the user has control over the minimum n per
node, maximum number of nodes, and probabilities for splitting
and for merging categories; the user can also request exhaustive
searches for the best solution (Exhaustive CHAID); V-fold
validation statistics can be computed to evaluate the stability
of the final solution; for classification problems, user-defined
misclassification costs can be specified.
ANOVA/ANCOVA-like designs. In addition to the
traditional CHAID analysis, you can combine categorical
and continuous predictor variables into ANOVA/ANCOVA-like
designs and perform the analysis using a design matrix for the
predictors. This allows you to evaluate and compare complex
predictor models, and their efficacy for prediction and
classification using various analytic techniques (e.g., General
Linear Models, Generalized Linear Models, General Discriminant
Analysis Models, General Classification and Regression Tree
Models, etc.). Refer also to the description of GLM
(General Linear Models) and General
Classification and Regression Trees (GTrees), above, for
details.
Tree
browser. Like the binary results tree used to summarize
binary classification and regression trees (see GTrees),
the results of the CHAID analysis can be reviewed in the STATISTICA
Tree Browser. This unique tree browser provides a very efficient
and intuitive facility for reviewing complex tree-structures and
for comparing multiple tree-solutions side-by-side (in multiple
tree-browsers), using methods that are commonly used in
windows-based computer applications to review hierarchically
structured information. The STATSTICA Tree Browser is an
important innovation to aid with the interpretation of complex
decision trees. For additional details, see also the description
the tree browser in the context of the General
Classification and Regression Trees (GTrees), above.
Results statistics. The STATISTICA General CHAID
Models module provides a very large number of results
options. Summary results for each node are accessible, detailed
statistics are computed pertaining to classification,
classification costs, and so on. Unique graphical summaries are
also available, including histograms (for classification
problems) for each node, detailed summary plots for continuous
dependent variables (e.g., normal probability plots,
scatterplots), and parallel coordinate plots for each node,
providing an efficient summary of patterns of responses for
large classification problems. As in all statistical procedures
of STATISTICA, all numerical results can be used as input
for further analyses, allowing you to quickly explore and
further analyze observations classified into particular nodes
(e.g., you could use the GTrees
module to produce an initial classification of cases, and then
use best-subset selection of variables in GDA to find
additional variables that may aid in the further
classification).
GOODNESS OF FIT COMPUTATIONS. The STATISTICA
Goodness of Fit module will compute various goodness of fit
statistics for continuous and categorical response variables
(for regression and classification problems). This module is
specifically designed for data mining applications to be
included in "competitive evaluation of models"
projects as a tool to choose the best solution. The program uses
as input the predicted values or classifications as computed
from any of the STATISTICA modules for regression and
classification, and computes a wide selection fit statistics as
well as graphical summaries for each fitted response or
classification. Goodness of fit statistics for continuous
responses include least squares deviation (LSD), average
deviation, relative squared error, relative absolute error, and
the correlation coefficient. For classification problems (for
categorical response variables), the program will compute
Chi-square, G-square (maximum likelihood chisquare), percent
disagreement (misclassification rate), quadratic loss, and
information loss statistics
|