Extended Isolation Forest for Distributed Spark/Scala Anomaly Detection

March 18, 2026 11 min read

Extended Isolation Forest support for LinkedIn's open-source Spark/Scala isolation-forest library, including sparse random hyperplane splits, benchmark parity checks, synthetic score maps, and validation evidence. The work also became a useful case study in how to validate AI-assisted production code with evidence rather than trust.

I added Extended Isolation Forest (EIF) to LinkedIn’s open-source Spark/Scala isolation-forest library. EIF keeps the same isolation-score idea as standard Isolation Forest, but changes the split geometry: instead of partitioning one feature at a time, it partitions with random hyperplanes.

I originally created and open-sourced this Spark/Scala implementation in 2019. It is used widely in production and supports distributed training and scoring, Spark ML pipeline integration, model persistence, and ONNX export for standard Isolation Forest. EIF landed in PR #79 on March 18, 2026, was introduced in v4.1.0, and is available in the isolation-forest repository.

The change is additive. Existing standard Isolation Forest APIs, Spark ML pipelines, saved-model loading, and standard-IF ONNX export behavior remain backward-compatible. The release also tightens validation for edge cases such as empty ensembles, too-small maxSamples values, and feature vectors whose dimension does not match the model’s training dimension.

The Scoring Model

Standard Isolation Forest, introduced by Liu, Ting, and Zhou in 2008, scores points by how quickly random trees isolate them. Points isolated by shorter paths receive higher anomaly scores; points that require longer paths receive lower scores.

The usual score is:

s(x, psi) = 2^(-E[h(x)] / c(psi))

where h(x) is path length, E[h(x)] is the ensemble-average path length, psi is the subsample size, and c(psi) is the average unsuccessful-search path length used for normalization.

EIF does not change that interpretation. It changes the tree geometry used to produce the path lengths.

The Axis-Aligned Bias Problem

Standard Isolation Forest builds each tree with axis-aligned splits: choose one feature, choose a split value inside that feature’s observed range, and send the point left or right based on that coordinate.

That works well in many settings, but it gives the score map a directional bias. In two dimensions, the artifacts are visible as rectangular bands and ghost-like regions where similarly unusual points receive inconsistent scores. The problem is most obvious when features are correlated or when the data distribution is rotated relative to the coordinate axes.

How EIF Changes the Split

Extended Isolation Forest, proposed by Hariri, Carrasco Kind, and Brunner, replaces axis-aligned splits with random hyperplane splits. Each split samples a normal vector and a point in the node’s bounding box. A point is routed by the sign of:

(x - p) · n

where x is the scored point, p is the sampled point on the split plane, and n is the random normal vector.

The main parameter is extensionLevel, which controls how many coordinates participate in each hyperplane:

extensionLevel = 0: one coordinate is non-zero, giving axis-aligned EIF behavior.
extensionLevel = numFeatures - 1: all coordinates can be non-zero, giving fully extended hyperplanes.
Intermediate values: provide a continuum between the two.

Concretely, on a 10-feature dataset, extensionLevel = 3 means each split uses 4 non-zero coordinates in its hyperplane normal vector. extensionLevel = 9 means each split can use all 10 features.

In the implementation, the valid range is based on the resolved feature subspace for each tree. If maxFeatures restricts each tree to a subset of features, extensionLevel is interpreted relative to that subspace rather than the original input dimensionality.

One detail is worth making explicit: extensionLevel = 0 is intentionally close to standard Isolation Forest, but it is not identical. Standard IF retries when it samples a constant feature; EIF follows the reference EIF split semantics, which matters for parity with the original Python and C++ implementations.

Seeing the Difference

The effect is especially easy to see in two dimensions. The following heatmaps, generated with the library, show outlier scores across the feature space for three synthetic datasets.

In each case, Standard Isolation Forest on the left shows cross-shaped artifacts along the feature axes. Extended Isolation Forest on the right produces smoother, less axis-biased contours.

Single blob heatmap: Standard Isolation Forest vs Extended Isolation Forest

Single blob: EIF produces more radial score contours where standard IF shows axis-aligned artifacts.

Two blobs heatmap: Standard Isolation Forest vs Extended Isolation Forest

Two blobs: EIF produces fewer ghost-like score artifacts between and around the clusters.

Sinusoid heatmap: Standard Isolation Forest vs Extended Isolation Forest

Sinusoid: EIF better tracks the non-axis-aligned data distribution.

These artifacts are not just cosmetic. They correspond to regions where the model assigns inconsistent anomaly scores, including under-scoring unusual points that fall in ghost-like low-score regions.

The plots also served as validation artifacts. If the EIF implementation had produced the same cross-shaped artifacts as standard IF, or if the score contours failed to track the known synthetic structure, that would have been evidence that the implementation was wrong.

Benchmark Results

I benchmarked three configurations across 13 standard outlier-detection datasets:

Standard Isolation Forest
EIF with extensionLevel = 0
Fully extended EIF

I compared the results against the original Liu et al. Isolation Forest paper and the reference Python EIF implementation from Hariri et al. All experiments used 100 trees, 256 samples per tree, and 10 trials with distinct random seeds.

These benchmarks are endpoint comparisons, not an exhaustive extensionLevel sweep. They validate the standard IF implementation, the axis-aligned EIF endpoint, and the fully extended EIF endpoint against published and reference implementations.

I have not yet systematically benchmarked intermediate extension levels across all 13 datasets, but I did run a targeted sweep on Ionosphere. In that sweep, AUROC increased from about 0.86 at extensionLevel = 0 to about 0.91 at full extension, with intermediate values improving along the way.

The result is not that EIF is universally better. The result is more specific: fully extended EIF helps most when axis-aligned bias is actually a limitation.

The higher-dimensional datasets showed the strongest case for fully extended EIF. In the representative rounded results below, fully extended EIF improved Ionosphere and Satellite, was comparable on Arrhythmia and Cardio, and was worse on the lower-dimensional Mulcross and HTTP datasets.

A few representative results:

Dataset	Dim	Standard IF AUROC	Standard IF AUPRC	Fully extended EIF AUROC	Fully extended EIF AUPRC	Takeaway
Ionosphere	33	0.84	0.80	0.91	0.88	EIF higher
Satellite	36	0.72	0.67	0.73	0.70	EIF higher
Arrhythmia	274	0.81	0.49	0.81	0.50	Comparable
Cardio	21	0.93	0.57	0.93	0.54	Comparable
Mulcross	4	0.99	0.85	0.94	0.44	Standard IF higher
HTTP (KDDCUP99)	3	0.9997	0.93	0.994	0.38	Standard IF higher

The parity checks were as important as the headline comparisons:

My standard IF results closely match the original Liu et al. paper.
My fully extended EIF results closely match the reference Python EIF implementation.
EIF with extensionLevel = 0 closely matches the reference EIF implementation at the same extension level.

That gives me confidence that the implementation is behaving as intended, rather than merely producing plausible-looking scores.

The full benchmark table, including AUROC/AUPRC results, standard errors, Liu et al. comparisons, and reference Python EIF comparisons, is available in the repository README.

Validating AI-Produced Code With Evidence

Because much of this implementation was AI-assisted, I treated generated code as untrusted until it produced independent evidence that matched the algorithm, reference implementation, and expected edge-case behavior. Validation artifacts became first-class outputs.

Instead of relying on line-by-line review alone, I asked for evidence that would be easy to inspect and hard for a broken implementation to satisfy:

score heatmaps on synthetic datasets where the expected behavior is visually obvious
benchmark comparisons against the original Isolation Forest paper and the reference Python EIF implementation
edge-case tests for degenerate splits, persistence, constant features, tiny datasets, seed reproducibility, and invalid parameter values

The plots were especially useful. On simple two-dimensional datasets, standard Isolation Forest should show axis-aligned artifacts, and EIF should reduce them. That gives a fast visual check that the implementation has the expected qualitative behavior before reviewing lower-level code paths.

The validation artifacts also caught real issues.

One example was degenerate split handling. An early implementation attempted to avoid empty partitions and retry degenerate splits. Benchmark mismatches showed that the correct behavior was to match the EIF reference implementation and allow zero-size leaves.

Another example was persistence. Spark 4.x save/load validation exposed a precision mismatch in the Avro-backed model representation: hyperplane weights did not round-trip at full double precision.

A third example was extensionLevel = 0. It produces axis-aligned EIF splits, but it is not identical to standard Isolation Forest. The split direction is similar, but retry behavior, intercept sampling, and random-number consumption differ. The benchmark and edge-case comparisons made that distinction visible.

The checked-in tests covered the parts that visual inspection cannot: training and scoring, parameter validation, persistence, zero contamination, saved-model structure, sparse hyperplane invariants, zero-size leaves, feature-dimension validation, and constant-feature edge cases.

I also ran a separate edge-case study outside the main benchmark table. It covered hyperparameter sweeps, contamination behavior, seed reproducibility, save/load equality, low-dimensional data, constant-feature data, all-constant data, and tiny datasets. All 61 / 61 checks passed.

The point was not to choose between tests and plots. The tests protected invariants; the plots made model behavior visually verifiable.

The workflow that worked best was:

define what correctness should look like
generate artifacts that would falsify the implementation if it were wrong
inspect those artifacts first
review the code paths behind surprising results

AI did not replace review. It changed the review target from “read every generated line” to “evaluate the evidence, then inspect the code paths behind surprising results.”

Implementation Highlights

The EIF implementation keeps the public Spark ML surface aligned with standard Isolation Forest, while isolating the new behavior to split generation, split representation, and node scoring.

Sparse hyperplane representation. Each EIF split stores only the active coordinates of the random hyperplane: feature indices, weights, and offset. Dense normal vectors are not materialized. Model size and per-node scoring cost therefore scale with extensionLevel + 1, not with the full input dimensionality. With extensionLevel = 3, a node evaluates a four-term dot product.

Spark ML integration. EIF uses the same Spark ML Estimator / Model contract as standard Isolation Forest. It works in Spark ML Pipelines and follows the same distributed model persistence pattern.

Persistence across Spark versions. Spark 4.x save/load validation exposed a precision mismatch in the Avro-backed model representation: hyperplane weights did not round-trip at full double precision.

The persisted schema now makes the intended precision explicit: hyperplane weights are stored as floats, offsets are stored as doubles, and scoring accumulates the sparse dot product in double precision.

That split matches the role of each value. The weights encode a random direction, where float precision is sufficient; the offset fixes the split location and remains double precision so the decision boundary is stable across save/load.

Choosing between IF and EIF

I treat standard Isolation Forest as the default baseline, especially when the data is low-dimensional, axis-aligned splits already work well, or ONNX export is required.

I reach for Extended Isolation Forest when my data has correlated features, when standard Isolation Forest produces suspicious axis-aligned score patterns, or when I want a more orientation-robust anomaly detector.

When I use EIF, I treat extensionLevel as a hyperparameter. extensionLevel = numFeatures - 1 is a useful default and an important reference point, but it is not a guarantee of best performance. Intermediate values can be the right compromise when fully extended splits are not the best empirical fit.

Getting Started

The library is available in the isolation-forest repository, with artifacts published to Maven Central. Full documentation, examples, and benchmark details are in the project README. The synthetic plot scripts are linked from the Resources section below.

Resources

References

F. T. Liu, K. M. Ting, and Z.-H. Zhou. “Isolation Forest.” 2008 Eighth IEEE International Conference on Data Mining, 2008.
S. Hariri, M. Carrasco Kind, and R. J. Brunner. “Extended Isolation Forest.” IEEE Transactions on Knowledge and Data Engineering, 2021. Also available as arXiv:1811.02141.
S. Hariri. “eif: Extended Isolation Forest for Anomaly Detection.”
J. Verbus. “isolation-forest.” Software, 2019. BSD-2-Clause.

Category
AI and Machine Learning 8