Apache Spark Engineering Intelligence Digest
February 14 -- 21, 2026 (7-day window)
Summary
Apache Spark's review layer is doing the engineering work that dashboards cannot see. In a week where 163 PRs touched the repository, zero show as "merged" in the GitHub API (all are merged through a rebase-and-close workflow that Apache projects commonly use), but the real story is that a small group of 5-6 senior reviewers shaped every significant change while authoring almost no code themselves. cloud-fan reviewed 81 PRs with 76 review comments while opening only 1 PR; anishshri-db reviewed 44 PRs with 52 comments while opening zero. The project's highest-complexity work this week centered on Structured Streaming state store redesign, where HeartSaVioR authored three PRs with an average complexity score of 0.506 and absorbed 64 review comments, the most in-depth review discussion of any contributor.
Note on data windows: GitHub data covers Feb 14-21, 2026 (7 days). Jira data covers Jan 22 -- Feb 21, 2026 (30 days) with 18,609 total issues across 3,171 contributors. Statistics from each source are window-specific and not directly comparable.
Note on merge workflow: Apache Spark uses a rebase-and-close workflow where maintainers cherry-pick commits and close PRs without GitHub's "merge" button. The GitHub API reports 0 merged PRs, but many PRs in this window were accepted and integrated. PR acceptance is confirmed by maintainer comments such as "Merged to master" or "Merged to branch-4.1."
Highlights
1. Structured Streaming State Store Redesign: The Week's Hardest Engineering
The highest-complexity work this week was HeartSaVioR's (Jungtaek Lim) overhaul of the Structured Streaming state store. Three PRs, all medium-to-high complexity, introduced fundamental changes to how Spark handles state in stream-stream joins:
-
#53930 (SPARK-55144, complexity 0.570, probing ratio 0.30): Introduced a new state format version for performant stream-stream joins. +1,406/-382 lines across 6 files. This PR accumulated 39 review comments, with anishshri-db pushing back on naming conventions ("Could we make this more generic? Don't think we should call out eventTime as such") and micheal-o probing the abstraction boundary ("We want the APIs at the state store layer to be generic enough for reusability. This implementation is too tied to EventTime which is an operator level detail that shouldn't be coupled into the state store").
-
#53911 (SPARK-55129, complexity 0.498, probing ratio 0.13): Introduced new key encoders for timestamp as a first-class type. +725/-2 lines. 42 review comments through 19 review rounds.
-
#54278 (SPARK-55494, complexity 0.451): Introduced iterator/prefixScan with multi-values in the StateStore API. +313/-9 lines. 10 review comments over 7 rounds.
anishshri-db was the primary reviewer for all three PRs (18 review interactions with HeartSaVioR), functioning as an architectural gatekeeper for the streaming subsystem. Zero PRs authored, 44 PRs reviewed, 52 review comments. This is pure review-and-shape work, the kind of contribution that PR count metrics erase entirely.
2. Custom Metrics Fix: A Collaborative Chain Across Three Contributors
The SPARK-55302 / SPARK-55619 bug fixes demonstrate how Spark's review culture resolves complex multi-part bugs. peter-toth identified that custom metrics were calculated incorrectly in KeyGroupedPartitioning (SPARK-55302) and coalesced partitions (SPARK-55619). The resolution involved a productive design disagreement:
- peter-toth proposed a
ThreadLocal-based approach in #54396. - viirya proposed an alternative using
ConcurrentHashMapin #54399, adding peter-toth as co-author. - peter-toth accepted: "Thanks @viirya, I'm ok with using ConcurrentHashMap instead of ThreadLocal" (#54396 comment).
- dongjoon-hyun merged the viirya approach, confirming CI failures were unrelated.
The fix was backported across three release branches (4.1, 4.0, 3.5). This is cooperative debugging: one finds the bug, another proposes a cleaner solution, a third facilitates the merge.
3. Geospatial Type System: New Subsystem Taking Shape
uros-db opened 4 PRs introducing geospatial support. cloud-fan probed the core design in #54040: "why geo type can't support lazy decoding?" uros-db explained the WKBConverterStrategy approach. Additional PRs: #54325 (Hive/Thrift, 19 comments), #54328 (SRID validation), #54333 (catalyst types), #54331 (type casting).
4. PySpark and Pandas 3 Compatibility
Yicong-Huang led serialization refactor (#54125, 62 review comments, 37 rounds). ueshin handled pandas 3 CoW (#54375). gaogaotiantian contributed 12 stewardship PRs.
5. cloud-fan: 81 Reviews, 1 PR
Net reviewer ratio +76. Shaped holdenk's SQL optimization (#46143, 31 reviews), uros-db's geo PRs (14 reviews), ksbeyer's DataTypeUtils (#52016, 13 reviews). Jira: 622 assigned, 3,453 comments.
6. dongjoon-hyun: 15 PRs, 68 Reviews
All authored PRs were infra/test. 29 unique authors reviewed, widest breadth. Jira: 1,082 assigned, 20,787 status changes.
7. Notable Technical Discussions
MaxBy/MinBy K-element (#54134 by AlSchlo, +1,850/-12). BLAS Configuration (#49986 by pan3793, complexity 0.633). Long-running #46143 (holdenk, 58 review rounds).
8. Stewardship Work
46.6% of PRs (76 of 163) involved maintenance. Top: gaogaotiantian (12), dongjoon-hyun (11), peter-toth (5), pan3793 (5), Yicong-Huang (5).
PR Subsystem Distribution
| Subsystem | PR Count | % |
|---|---|---|
| SQL | 55 | 33.7% |
| Python/PySpark | 49 | 30.1% |
| Streaming | 14 | 8.6% |
| Infra/Build | 12 | 7.4% |
| K8s | 7 | 4.3% |
| Geo | 5 | 3.1% |
| Other | 21 | 12.8% |
Complexity Analysis
70 PRs scored: 2 high (>=0.6), 20 medium, 48 low. Comments: 83 PROBING (10.5%), 629 DIRECTING (79.9%), 75 POLISHING (9.5%). High-complexity: #49986 (pan3793, 0.633) and #54277 (liviazhu, 0.633).
Jira Cross-Reference
30-day window: 18,609 issues, 3,171 contributors. dongjoon-hyun: 20,787 status changes (6x next). 161/163 PRs reference SPARK tickets.
Three-View Ranking
Complexity: HeartSaVioR (0.506), liviazhu (0.633), pan3793 (0.633), dichlorodiphen (0.550), casgie (0.511). Stewardship: gaogaotiantian (12), dongjoon-hyun (11), peter-toth (5), pan3793 (5), Yicong-Huang (5). Review Depth: cloud-fan (81, +76 net), anishshri-db (44, +44), dongjoon-hyun (68, +52).
Dashboard vs. Reality
| Dashboard | Reality |
|---|---|
| 0 PRs merged | 163 active via rebase-and-close |
| cloud-fan: 1 PR | 81 reviews, 76 comments, most influential |
| anishshri-db: 0 PRs | 44 reviews, streaming gatekeeper |
| HeartSaVioR: 0 opened | 3 PRs (+2,444/-393), highest complexity |
| dongjoon-hyun: 15 PRs | All infra; 68 reviews, 20,787 Jira changes |
| peter-toth: 6 PRs | Found bug, deferred to viirya's fix, 3 backports |