[LLHD] Run Mem2Reg per slot to fix cubic scaling#10321
Open
fabianschuiki wants to merge 1 commit intomainfrom
Open
[LLHD] Run Mem2Reg per slot to fix cubic scaling#10321fabianschuiki wants to merge 1 commit intomainfrom
fabianschuiki wants to merge 1 commit intomainfrom
Conversation
Reapply commit bcc1685 with an additional fix to make the block-entry merge logic monotone. The previous attempt was reverted in 82ec37f because it caused Mem2Reg to hang on cyclic CFGs: `mergeFlavor` could return either the unique non-null predecessor def (`common`) or a cached merge def, and across iterations the value at an entry would flip between the two as back-edge state propagated. Make the merge sticky by always returning the cached merge def once it has been created, so each block entry moves through null -> common -> merged at most once and the fixpoint terminates. The lattice used to track sets and maps of all slots in the region at every program point, and every propagation update copied and compared the full state. Composing three O(N) factors yielded roughly O(N^3) total work, with a 1000-signal stress test taking on the order of two minutes. Run the analysis once per slot instead, creating a separate tiny lattice that focuses only on interactions with that single slot. State in the LatticeValue collapses to a single `needed` flag and a pair of reaching-def pointers (one for the blocking flavor, one for the delayed flavor of assignments), so every propagation update is O(1). Block-entry merge tracking, inserted-probe tracking, and the loops in `insertProbes`, `insertDrives`, and `insertBlockArgs` simplify correspondingly. Cross-slot state shrinks to a small cache of the `llhd.constant_time` ops inserted at block terminators, so we don't end up with one constant op per promoted slot. After the change the 1000-signal stress test runs in around 700 ms -- roughly two orders of magnitude faster. Add four scaling stress tests of increasing size to guard against regressions. Fixes #10314. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Member
|
Kind of an extreme nit, but could you structure this PR as two revert commits and a patch commit on top of it? |
Contributor
Author
Member
|
@jpienaar: Could you test that this fixes the internal hang in your flow? |
Contributor
Author
That's what I initially tried. But the second of the two reverted commits fully supersedes the first one. So this is more like a re-implementation that gets us to the same end point as the two commits combined, with a different implementation. |
Member
|
Sorry for delay, will try tonight |
Member
|
Yes looked like it resolved the previous issue we ran into. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reapply commit bcc1685 with an additional fix to make the block-entry merge logic monotone. The previous attempt was reverted in 82ec37f because it caused Mem2Reg to hang on cyclic CFGs:
mergeFlavorcould return either the unique non-null predecessor def (common) or a cached merge def, and across iterations the value at an entry would flip between the two as back-edge state propagated. Make the merge sticky by always returning the cached merge def once it has been created, so each block entry moves through null -> common -> merged at most once and the fixpoint terminates.The lattice used to track sets and maps of all slots in the region at every program point, and every propagation update copied and compared the full state. Composing three O(N) factors yielded roughly O(N^3) total work, with a 1000-signal stress test taking on the order of two minutes.
Run the analysis once per slot instead, creating a separate tiny lattice that focuses only on interactions with that single slot. State in the LatticeValue collapses to a single
neededflag and a pair of reaching-def pointers (one for the blocking flavor, one for the delayed flavor of assignments), so every propagation update is O(1). Block-entry merge tracking, inserted-probe tracking, and the loops ininsertProbes,insertDrives, andinsertBlockArgssimplify correspondingly. Cross-slot state shrinks to a small cache of thellhd.constant_timeops inserted at block terminators, so we don't end up with one constant op per promoted slot.After the change the 1000-signal stress test runs in around 700 ms -- roughly two orders of magnitude faster. Add four scaling stress tests of increasing size to guard against regressions.
Fixes #10314.
Assisted-by: Claude Opus 4.7