rustfft-6.2.0/.cargo_vcs_info.json0000644000000001360000000000100125360ustar { "git": { "sha1": "c720147135788798f28e70be7fbfeefed294b54f" }, "path_in_vcs": "" }rustfft-6.2.0/.github/workflows/run_test.yml000064400000000000000000000071150072674642500173450ustar 00000000000000on: [pull_request] name: CI jobs: check: name: Check+Test default features runs-on: ubuntu-latest strategy: matrix: rust: - stable - beta - nightly - 1.61 steps: - name: Checkout sources uses: actions/checkout@v3 - name: Install toolchain uses: dtolnay/rust-toolchain@master with: toolchain: ${{ matrix.rust }} - name: Run cargo check run: cargo check - name: Run cargo test run: cargo test fmt: name: Rustfmt runs-on: ubuntu-latest steps: - name: Checkout sources uses: actions/checkout@v3 - name: Install toolchain uses: dtolnay/rust-toolchain@nightly with: components: rustfmt - name: Print rustfmt version run: cargo fmt -- --version - name: Run cargo fmt run: cargo fmt -- --check check_no_features: name: Check+Test no features runs-on: ubuntu-latest strategy: matrix: rust: - stable - beta - nightly - 1.61 steps: - name: Checkout sources uses: actions/checkout@v3 - name: Install toolchain uses: dtolnay/rust-toolchain@master with: toolchain: ${{ matrix.rust }} - name: Run cargo check run: cargo check --no-default-features - name: Run cargo test run: cargo test --no-default-features check_arm64_neon: name: Check and test Linux arm 64bit with neon runs-on: ubuntu-latest strategy: matrix: rust: - stable - beta - nightly - 1.61 steps: - name: Checkout sources uses: actions/checkout@v3 - name: Install toolchain uses: dtolnay/rust-toolchain@master with: toolchain: ${{ matrix.rust }} targets: aarch64-unknown-linux-gnu - name: Install cross run: cargo install cross --version 0.2.5 --locked - name: Run cargo check run: cross check --features neon --target aarch64-unknown-linux-gnu - name: Run cargo test for arm run: cross test --release --features neon --target aarch64-unknown-linux-gnu check_x86: name: Check and test Linux x86 32bit runs-on: ubuntu-latest steps: - name: Checkout sources uses: actions/checkout@v3 - name: Install stable toolchain uses: dtolnay/rust-toolchain@stable with: targets: i586-unknown-linux-gnu - name: Install cross run: cargo install cross --version 0.2.5 --locked - name: Run cargo check run: cross check --target i586-unknown-linux-gnu - name: Run cargo test for i586 run: cross test --target i586-unknown-linux-gnu check_wasm32: name: Check and test WebAssembly with SIMD runs-on: ubuntu-latest strategy: matrix: rust: - stable - beta - nightly - 1.61 steps: - name: Checkout sources uses: actions/checkout@v3 - name: Install toolchain uses: dtolnay/rust-toolchain@master with: toolchain: ${{ matrix.rust }} targets: wasm32-unknown-unknown - name: Install wasm-pack uses: jetli/wasm-pack-action@v0.4.0 with: version: "latest" - name: Run test suites with wasm-pack run: wasm-pack test --node -- --features wasm_simd rustfft-6.2.0/.gitignore000064400000000000000000000000470072674642500133470ustar 00000000000000/target /Cargo.lock *.swp .vscode/ rustfft-6.2.0/CHANGELOG.md000064400000000000000000000141420072674642500131710ustar 00000000000000## [6.2] Released 22nd January 2024 ### Minimum Rustc Version - The MSRV for RustFFT is now 1.61.0 ### Added - Implemented a code path for SIMD-optimized FFTs on WASM targets (Thanks to @pr1metine) (#120) ### Fixed - Fixed pointer aliasing causing unsoundness and miri check failures (#113) - Fixed computation of size-1 FFTs (#119) - Fixed readme type (#121) ## [6.1] Released 7th Novemeber 2022 ### Added - Implemented a code path for Neon-optimized FFTs on AArch64 (Thanks to Henrik Enquist!) (#84 and #78) ### Changed - Improved performance of power-of-3 FFTs when not using SIMD-accelerated code paths (#80) - Reduced memory usage for some FFT sizes (#81) ## [6.0.1] Released 10 May 2021 ### Fixed - Fixed a compile-time divide by zero error on nightly Rust in `stdarch\crates\core_arch\src\macros.rs` (#75) - Increased the minimum version of `strength_reduce` to 0.2.3 ## [6.0.0] Released 16 April 2021 ### Breaking Changes - Increased the version of the num-complex dependency to 0.4. - This is a breaking change because we have a public dependency on num-complex. - See the [num-complex changelog](https://github.com/rust-num/num-complex/blob/master/RELEASES.md) for a list of breaking changes in num-complex 0.4 - As a high-level summary, most users will not need to do anything to upgrade to RustFFT 6.0: num-complex 0.4 re-exports a newer version of `rand`, and that's num-complex's only documented breaking change. ## [5.1.1] Released 10 May 2021 ### Fixed - Fixed a compile-time divide by zero error on nightly Rust in `stdarch\crates\core_arch\src\macros.rs` (Backported from v6.0.1) - Increased the minimum version of `strength_reduce` to 0.2.3 (Backported from v6.0.1) ## [5.1.0] Released 16 April 2021 ### Added - Implemented a code path for SSE-optimized FFTs (Thanks to Henrik Enquist!) (#60) - Plan a FFT using the `FftPlanner` (or the new `FftPlannerSse`) on a machine that supports SSE4.1 (but not AVX) and you'll see a 2-3x performance improvement over the default scalar code. ### Fixed - Fixed underflow when planning an AVX FFT of size zero (#56) - Fixed the FFT planner not being Send, due to internal use of Rc<> (#55) - Fixed typo in documentation (#54) - Slightly improved numerical precision of Rader's Algorithm and Bluestein's Algorithm (#66, #68) - Minor optimizations to Rader's Algorithm and Bluestein's Algorithm (#59) - Minor optimizations to MixedRadix setup time (#57) - Optimized performance of Radix4 (#65) ## [5.0.1] Released 8 January 2021 ### Fixed - Fixed the FFT planner not choosing an obviously faster plan in some rare cases (#46) - Documentation fixes and clarificarions (#47, #48, #51) ## [5.0.0] Released 4 January 2021 ### Breaking Changes - Several breaking changes. See the [Upgrade Guide](/UpgradeGuide4to5.md) for details. ### Added - Added support for the `Avx` instruction set. Plan a FFT with the `FftPlanner` on a machine that supports AVX, and you'll get a 5x-10x speedup in FFT performance. ### Changed - Even though the main focus of this release is on AVX, most users should see moderate performance improvements due to a new internal architecture that reduces the amount of internal copies required when computing a FFT. ## [4.1.0] Released 24 December 2020 ### Added - Added a blanket impl of `FftNum` to any type that implements the required traits (#7) - Added butterflies for many prime sizes, up to 31, and optimized the size-3, size-5, and size-7 buitterflies (#10) - Added an implementation of Bluestein's Algorithm (#6) ### Changed - Improved the performance of GoodThomasAlgorithm re-indexing (#20) ## [4.0.0] Released 8 October 2020 This release moved the home repository of RustFFT from https://github.com/awelkie/RustFFT to https://github.com/ejmahler/RustFFT ### Breaking Changes - Increased the version of the num-complex dependency to 0.3. This is a breaking change because we have a public dependency on num-complex. See the [num-complex changelog](https://github.com/rust-num/num-complex/blob/master/RELEASES.md) for a list of breaking changes in num-complex 0.3. - Increased the minimum required Rust version from 1.26 to 1.31. This was required by the upgrade to num-complex 0.3. ## [3.0.1] Released 27 December 2019 ### Fixed - Fixed warnings regarding "dyn trait", and warnings regarding inclusive ranges - Several documentation improvements ## [3.0.0] Released 4 January 2019 ### Changed - Reduced the setup time and memory usage of GoodThomasAlgorithm - Reduced the setup time and memory usage of RadersAlgorithm ### Breaking Changes - Documented the minimum rustsc version. Before, none was specified. now, it's 1.26. Further increases to minimum version will be a breaking change. - Increased the version of the num-complex dependency to 0.2. This is a breaking change because we have a public dependency on num-complex. See the [num-complex changelog](https://github.com/rust-num/num-complex/blob/master/RELEASES.md) for a list of breaking changes in num-complex 0.2 ## [2.1.0] Released 30 July 2018 ### Added - Added a specialized implementation of Good Thomas Algorithm for when both inner FFTs are butterflies ### Changed - Documentation typo fixes - Increased minimum version of num_traits and num_complex. Notably, Complex is now guaranteed to be repr(C) - Significantly improved the performance of the Radix4 algorithm - Reduced memory usage of prime-sized FFTs - Incorporated the Good-Thomas Double Butterfly algorithm into the planner, improving performance for most composite and prime FFTs ## [2.0.0] Released 22 May 2017 ### Added - Added implementation of Good Thomas algorithm. - Added implementation of Raders algorithm. - Added implementation of Radix algorithm for power-of-two lengths. - Added `FFTPlanner` to choose the fastest algorithm for a given size. ### Changed - Changed API to take the "signal" as mutable and use it for scratch space. ## [1.0.1] Released 15 January 2016 ### Changed - Relicensed to dual MIT/Apache-2.0. ## [1.0.0] Released 4 October 2015 ### Added - Added initial implementation of Cooley-Tukey. rustfft-6.2.0/Cargo.lock0000644000000207610000000000100105170ustar # This file is automatically @generated by Cargo. # It is not intended for manual editing. version = 3 [[package]] name = "autocfg" version = "1.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d468802bab17cbc0cc575e9b053f41e72aa36bfa6b7f55e3529ffa43161b97fa" [[package]] name = "bumpalo" version = "3.14.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7f30e7476521f6f8af1a1c4c0b8cc94f0bee37d91763d0ca2665f299b6cd8aec" [[package]] name = "cfg-if" version = "1.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "baf1de4339761588bc0619e3cbc0120ee582ebb74b53b4efbf79117bd2da40fd" [[package]] name = "console_error_panic_hook" version = "0.1.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "a06aeb73f470f66dcdbf7223caeebb85984942f22f1adb2a088cf9668146bbbc" dependencies = [ "cfg-if", "wasm-bindgen", ] [[package]] name = "getrandom" version = "0.2.8" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c05aeb6a22b8f62540c194aac980f2115af067bfe15a0734d7277a768d396b31" dependencies = [ "cfg-if", "js-sys", "libc", "wasi", "wasm-bindgen", ] [[package]] name = "js-sys" version = "0.3.67" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9a1d36f1235bc969acba30b7f5990b864423a6068a10f7c90ae8f0112e3a59d1" dependencies = [ "wasm-bindgen", ] [[package]] name = "libc" version = "0.2.140" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "99227334921fae1a979cf0bfdfcc6b3e5ce376ef57e16fb6fb3ea2ed6095f80c" [[package]] name = "log" version = "0.4.20" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b5e6163cb8c49088c2c36f57875e58ccd8c87c7427f7fbd50ea6710b2f3f2e8f" [[package]] name = "num-complex" version = "0.4.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "02e0d21255c828d6f128a1e41534206671e8c3ea0c62f32291e808dc82cff17d" dependencies = [ "num-traits", ] [[package]] name = "num-integer" version = "0.1.45" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "225d3389fb3509a24c93f5c29eb6bde2586b98d9f016636dff58d7c6f7569cd9" dependencies = [ "autocfg", "num-traits", ] [[package]] name = "num-traits" version = "0.2.15" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "578ede34cf02f8924ab9447f50c28075b4d3e5b269972345e7e0372b38c6cdcd" dependencies = [ "autocfg", ] [[package]] name = "once_cell" version = "1.19.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "3fdb12b2476b595f9358c5161aa467c2438859caa136dec86c26fdd2efe17b92" [[package]] name = "paste" version = "1.0.12" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9f746c4065a8fa3fe23974dd82f15431cc8d40779821001404d10d2e79ca7d79" [[package]] name = "ppv-lite86" version = "0.2.17" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5b40af805b3121feab8a3c29f04d8ad262fa8e0561883e7653e024ae4479e6de" [[package]] name = "primal-check" version = "0.3.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9df7f93fd637f083201473dab4fee2db4c429d32e55e3299980ab3957ab916a0" dependencies = [ "num-integer", ] [[package]] name = "proc-macro2" version = "1.0.78" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e2422ad645d89c99f8f3e6b88a9fdeca7fabeac836b1002371c4367c8f984aae" dependencies = [ "unicode-ident", ] [[package]] name = "quote" version = "1.0.35" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "291ec9ab5efd934aaf503a6466c5d5251535d108ee747472c3977cc5acc868ef" dependencies = [ "proc-macro2", ] [[package]] name = "rand" version = "0.8.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "34af8d1a0e25924bc5b7c43c079c942339d8f0a8b57c39049bef581b46327404" dependencies = [ "libc", "rand_chacha", "rand_core", ] [[package]] name = "rand_chacha" version = "0.3.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e6c10a63a0fa32252be49d21e7709d4d4baf8d231c2dbce1eaa8141b9b127d88" dependencies = [ "ppv-lite86", "rand_core", ] [[package]] name = "rand_core" version = "0.6.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ec0be4795e2f6a28069bec0b5ff3e2ac9bafc99e6a9a7dc3547996c5c816922c" dependencies = [ "getrandom", ] [[package]] name = "rustfft" version = "6.2.0" dependencies = [ "getrandom", "num-complex", "num-integer", "num-traits", "paste", "primal-check", "rand", "strength_reduce", "transpose", "version_check", "wasm-bindgen-test", ] [[package]] name = "scoped-tls" version = "1.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e1cf6437eb19a8f4a6cc0f7dca544973b0b78843adbfeb3683d1a94a0024a294" [[package]] name = "strength_reduce" version = "0.2.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "fe895eb47f22e2ddd4dabc02bce419d2e643c8e3b585c78158b349195bc24d82" [[package]] name = "syn" version = "2.0.48" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "0f3531638e407dfc0814761abb7c00a5b54992b849452a0646b7f65c9f770f3f" dependencies = [ "proc-macro2", "quote", "unicode-ident", ] [[package]] name = "transpose" version = "0.2.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e6522d49d03727ffb138ae4cbc1283d3774f0d10aa7f9bf52e6784c45daf9b23" dependencies = [ "num-integer", "strength_reduce", ] [[package]] name = "unicode-ident" version = "1.0.12" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "3354b9ac3fae1ff6755cb6db53683adb661634f67557942dea4facebec0fee4b" [[package]] name = "version_check" version = "0.9.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "49874b5167b65d7193b8aba1567f5c7d93d001cafc34600cee003eda787e483f" [[package]] name = "wasi" version = "0.11.0+wasi-snapshot-preview1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9c8d87e72b64a3b4db28d11ce29237c246188f4f51057d65a7eab63b7987e423" [[package]] name = "wasm-bindgen" version = "0.2.90" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b1223296a201415c7fad14792dbefaace9bd52b62d33453ade1c5b5f07555406" dependencies = [ "cfg-if", "wasm-bindgen-macro", ] [[package]] name = "wasm-bindgen-backend" version = "0.2.90" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "fcdc935b63408d58a32f8cc9738a0bffd8f05cc7c002086c6ef20b7312ad9dcd" dependencies = [ "bumpalo", "log", "once_cell", "proc-macro2", "quote", "syn", "wasm-bindgen-shared", ] [[package]] name = "wasm-bindgen-futures" version = "0.4.40" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "bde2032aeb86bdfaecc8b261eef3cba735cc426c1f3a3416d1e0791be95fc461" dependencies = [ "cfg-if", "js-sys", "wasm-bindgen", "web-sys", ] [[package]] name = "wasm-bindgen-macro" version = "0.2.90" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "3e4c238561b2d428924c49815533a8b9121c664599558a5d9ec51f8a1740a999" dependencies = [ "quote", "wasm-bindgen-macro-support", ] [[package]] name = "wasm-bindgen-macro-support" version = "0.2.90" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "bae1abb6806dc1ad9e560ed242107c0f6c84335f1749dd4e8ddb012ebd5e25a7" dependencies = [ "proc-macro2", "quote", "syn", "wasm-bindgen-backend", "wasm-bindgen-shared", ] [[package]] name = "wasm-bindgen-shared" version = "0.2.90" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "4d91413b1c31d7539ba5ef2451af3f0b833a005eb27a631cec32bc0635a8602b" [[package]] name = "wasm-bindgen-test" version = "0.3.40" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "139bd73305d50e1c1c4333210c0db43d989395b64a237bd35c10ef3832a7f70c" dependencies = [ "console_error_panic_hook", "js-sys", "scoped-tls", "wasm-bindgen", "wasm-bindgen-futures", "wasm-bindgen-test-macro", ] [[package]] name = "wasm-bindgen-test-macro" version = "0.3.40" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "70072aebfe5da66d2716002c729a14e4aec4da0e23cc2ea66323dac541c93928" dependencies = [ "proc-macro2", "quote", "syn", ] [[package]] name = "web-sys" version = "0.3.67" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "58cd2333b6e0be7a39605f0e255892fd7418a682d8da8fe042fe25128794d2ed" dependencies = [ "js-sys", "wasm-bindgen", ] rustfft-6.2.0/Cargo.toml0000644000000032010000000000100105300ustar # THIS FILE IS AUTOMATICALLY GENERATED BY CARGO # # When uploading crates to the registry Cargo will automatically # "normalize" Cargo.toml files for maximal compatibility # with all versions of Cargo and also rewrite `path` dependencies # to registry (e.g., crates.io) dependencies. # # If you are reading this file be aware that the original Cargo.toml # will likely look very different (and much more reasonable). # See Cargo.toml.orig for the original contents. [package] edition = "2018" name = "rustfft" version = "6.2.0" authors = [ "Allen Welkie ", "Elliott Mahler ", ] description = "High-performance FFT library written in pure Rust." documentation = "https://docs.rs/rustfft/" readme = "README.md" keywords = [ "fft", "dft", "discrete", "fourier", "transform", ] categories = [ "algorithms", "compression", "multimedia::encoding", "science", ] license = "MIT OR Apache-2.0" repository = "https://github.com/ejmahler/RustFFT" [dependencies.num-complex] version = "0.4" [dependencies.num-integer] version = "^0.1.40" [dependencies.num-traits] version = "0.2" [dependencies.primal-check] version = "0.3.3" [dependencies.strength_reduce] version = "0.2.4" [dependencies.transpose] version = "0.2" [dev-dependencies.getrandom] version = "^0.2" features = ["js"] [dev-dependencies.paste] version = "1.0.9" [dev-dependencies.rand] version = "0.8" [dev-dependencies.wasm-bindgen-test] version = "^0.3.36" [build-dependencies.version_check] version = "0.9" [features] avx = [] default = [ "avx", "sse", "neon", ] neon = [] sse = [] wasm_simd = [] rustfft-6.2.0/Cargo.toml.orig000064400000000000000000000033520072674642500142500ustar 00000000000000[package] name = "rustfft" version = "6.2.0" authors = ["Allen Welkie ", "Elliott Mahler "] edition = "2018" description = "High-performance FFT library written in pure Rust." documentation = "https://docs.rs/rustfft/" repository = "https://github.com/ejmahler/RustFFT" keywords = ["fft", "dft", "discrete", "fourier", "transform"] categories = ["algorithms", "compression", "multimedia::encoding", "science"] license = "MIT OR Apache-2.0" [features] default = ["avx", "sse", "neon"] # On x86_64, the "avx" feature enables compilation of AVX-acclerated code. # Similarly, the "sse" feature enables compilation of SSE-accelerated code. # Enabling these improves performance if the client CPU supports AVX or SSE, while disabling them reduces compile time and binary size. # If both are enabled, RustFFT will use AVX if the CPU supports it. If not, it will check for SSE4.1. # If neither instruction set is available, it will fall back to the scalar code. # # On AArch64, the "neon" feature enables compilation of Neon-accelerated code. # # On wasm32, the "wasm_simd" feature enables compilation of Wasm SIMD accelerated code. # # For all of the above features, on every platform other than the intended platform for the feature, these features do nothing, and RustFFT will behave like they are not set. avx = [] sse = [] neon = [] wasm_simd = [] [dependencies] num-complex = "0.4" num-traits = "0.2" num-integer = "^0.1.40" strength_reduce = "0.2.4" transpose = "0.2" primal-check = "0.3.3" [dev-dependencies] rand = "0.8" paste = "1.0.9" getrandom = {version = "^0.2", features = ["js"]} wasm-bindgen-test = "^0.3.36" [build-dependencies] version_check = "0.9" rustfft-6.2.0/LICENSE-APACHE000064400000000000000000000264500072674642500133110ustar 00000000000000 Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "{}" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright {yyyy} {name of copyright owner} Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. rustfft-6.2.0/LICENSE-MIT000064400000000000000000000020750072674642500130160ustar 00000000000000Copyright (c) 2015 The RustFFT Developers Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. rustfft-6.2.0/README.md000064400000000000000000000133440072674642500126420ustar 00000000000000# RustFFT [![CI](https://github.com/ejmahler/RustFFT/workflows/CI/badge.svg)](https://github.com/ejmahler/RustFFT/actions?query=workflow%3ACI) [![](https://img.shields.io/crates/v/rustfft.svg)](https://crates.io/crates/rustfft) [![](https://img.shields.io/crates/l/rustfft.svg)](https://crates.io/crates/rustfft) [![](https://docs.rs/rustfft/badge.svg)](https://docs.rs/rustfft/) ![minimum rustc 1.61](https://img.shields.io/badge/rustc-1.61+-red.svg) RustFFT is a high-performance, SIMD-accelerated FFT library written in pure Rust. It can compute FFTs of any size, including prime-number sizes, in O(nlogn) time. ## Usage ```rust // Perform a forward FFT of size 1234 use rustfft::{FftPlanner, num_complex::Complex}; let mut planner = FftPlanner::::new(); let fft = planner.plan_fft_forward(1234); let mut buffer = vec![Complex{ re: 0.0, im: 0.0 }; 1234]; fft.process(&mut buffer); ``` ## SIMD acceleration ### x86_64 Targets RustFFT supports the AVX instruction set for increased performance. No special code is needed to activate AVX: Simply plan a FFT using the `FftPlanner` on a machine that supports the `avx` and `fma` CPU features, and RustFFT will automatically switch to faster AVX-accelerated algorithms. For machines that do not have AVX, RustFFT also supports the SSE4.1 instruction set. As for AVX, this is enabled automatically when using the FftPlanner. If both AVX and SSE4.1 support are enabled, the planner will automatically choose the fastest available instruction set. ### AArch64 Targets RustFFT supports the NEON instruction set in 64-bit Arm, AArch64. As with AVX and SSE, no special code is needed to activate NEON-accelerated code paths: Simply plan a FFT using the `FftPlanner` on an AArch64 target, and RustFFT will automatically switch to faster NEON-accelerated algorithms. ### WebAssembly Targets RustFFT supports [the fixed-width SIMD extension for WebAssembly](https://github.com/WebAssembly/spec/blob/main/proposals/simd/SIMD.md). Just like AVX, SSE, and NEON, no special code is needed to take advantage of this code path: All you need to do is plan a FFT using the `FftPlanner`. **Note:** There is an important caveat when compiling WASM SIMD accelerated code: Unlike AVX, SSE, and NEON, WASM does not allow dynamic feature detection. Because of this limitation, RustFFT **cannot** detect CPU features and automatically switch to WASM SIMD accelerated algorithms. Instead, it unconditionally uses the SIMD code path if the `wasm_simd` crate feature is enabled. Read more about this limitation [in the official Rust docs](https://doc.rust-lang.org/1.75.0/core/arch/wasm32/index.html#simd). ## Feature Flags ### x86_64 Targets The features `avx` and `sse` are enabled by default. On x86_64, these features enable compilation of AVX and SSE accelerated code. Disabling them reduces compile time and binary size. On other platforms than x86_64, these features do nothing and RustFFT will behave like they are not set. ### AArch64 Targets The `neon` is enabled by default. On AArch64, this feature enables compilation of Neon-accelerated code. Disabling it reduces compile time and binary size. On other platforms than AArch64, this feature does nothing and RustFFT will behave like it is not set. ### WebAssembly Targets The feature `wasm_simd` is disabled by default. On the WASM platform, this feature enables compilation of WASM SIMD accelerated code. To execute binaries compiled with `wasm_simd`, you need a [target browser or runtime which supports `fixed-width SIMD`](https://webassembly.org/roadmap/). If you run your SIMD accelerated code on an unsupported platform, WebAssembly will specify a [trap](https://webassembly.github.io/spec/core/intro/overview.html#trap) leading to immediate execution cancelation. On other platforms than WASM, this feature does nothing and RustFFT will behave like it is not set. ## Stability/Future Breaking Changes The latest version is 6.2 - Version 5.0 was released at the beginning of 2022 and contains several breaking API changes from previous versions. For users on very old version of RustFFT, check out the [Upgrade Guide](/UpgradeGuide4to5.md) for a walkthrough of the changes RustFFT 5.0 requires to upgrade. In the interest of stability, we're committing to making no more breaking changes for 3 years, aka until 2024. This policy has one exception: We currently re-export pre-1.0 versions of the [num-complex](https://crates.io/crates/num-complex) and [num-traits](https://crates.io/crates/num-traits) crates. In the interest of avoiding ecosystem fragmentation, we will keep up with these crates even if it requires major version bumps. When those crates release new major versions, we will upgrade as soon as possible, which will require a major version change of our own. In these situations, the version increase of num-complex/num-traits will be the only breaking change in the release. ### Supported Rust Version RustFFT requires rustc 1.61 or newer. Minor releases of RustFFT may upgrade the MSRV(minimum supported Rust version) to a newer version of rustc. However, if we need to increase the MSRV, the new Rust version must have been released at least six months ago. ## License Licensed under either of - Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0) - MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT) at your option. ### Contribution Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions. Before submitting a PR, please make sure to run `cargo fmt`. rustfft-6.2.0/UpgradeGuide4to5.md000064400000000000000000000170550072674642500147710ustar 00000000000000# 4.0 to 5.0 Upgrade Guide RustFFT 5.0 has several breaking changes compared to 4.0. This document will guide users through the upgrade process, explaining each breaking change and how to upgrade code to fit the new style. Each section is ordered by how likely they are to impact you: Things at the top are likely to affect every user of RustFFT, while things at the bottom are unlikely to affect most users. ## Renaming Structs Several structs and traits in RustFFT were renamed to follow the [Rust API guidlines](https://rust-lang.github.io/api-guidelines/naming.html) regarding acronyms: > In UpperCamelCase, acronyms and contractions of compound words count as one word: use Uuid rather than UUID, Usize rather than USize or Stdin rather than StdIn. The following were renamed in RustFFT 5.0 to conform to this style: * The `FFT` trait was renamed to `Fft` * The `FFTnum` trait was renamed to `FftNum` * The `FFTplanner` struct was renamed to `FftPlanner` * The `DFT` struct was renamed to `Dft` ## FFT Direction In RustFFT 4.0, forward FFTs vs inverse FFTs were specified by a boolean. For example, the 4.0 `FFTplanner` constructor expects a boolean parameter for direction: If you pass `false`, the planner will plan forward FFTs. If you pass `true`, the planner will plan inverse FFTs. In 5.0, there is a new `FftDirection` enum with `Forward` and `Inverse` variants. FFT algorithms that took a `bool` for direction now take `FftDirection` instead. For example, if you were constructing a `Radix4` instance to compute a power-of-two FFT, you will have to change the parameters to the constructor: ```rust // RustFFT 4.0 let fft_forward = Radix4::new(4096, false); let fft_inverse = Radix4::new(4096, true); // RustFFT 5.0 let fft_forward = Radix4::new(4096, FftDirection::Forward); let fft_inverse = Radix4::new(4096, FftDirection::Inverse); ``` A few traits and methods were renamed to support the new `FftDirection` trait: * The `IsInverse` trait was renamed to `Direction` * `IsInverse`'s only method, `is_inverse(&self) -> bool`, was renamed to `fft_direction(&self) -> FftDirection` * The `Fft` trait inherited the `IsInverse` trait, and now inherits the `Direction` trait instead, so if you were calling `is_inverse()` on a `Fft` instance you'll have to change that to `fft_direction()` as well. Finally, the way the `FftPlanner` handles forward vs inverse FFTs has changed. In 4.0, the `FFTplanner` took a direction in its constructor -- In 5.0, its constructor is empty, and it takes a direction in its `plan_fft` method. This means a single planner can be used to plan both forward and inverse FFTs. The `FftPlanner` also has `plan_fft_forward` and `plan_fft_inverse` convenince methods so that you don't have to import the FftDirection enum. ```rust // RustFFT 4.0 let planner_forward = FFTplanner::new(false); let fft_forward = planner.plan_fft(1234); let planner_inverse = FFTplanner::new(true); let fft_inverse = planner.plan_fft(1234); // RustFFT 5.0 let planner = FftPlanner::new(); let fft_forward1 = planner.plan_fft(1234, FftDirection::Forward); let fft_forward2 = planner.plan_fft_forward(1234); let fft_inverse1 = planner.plan_fft(1234, FftDirection::Inverse); let fft_inverse2 = planner.plan_fft_inverse(1234); ``` ## Fft Trait Methods In RustFFT 4.0, the `Fft` trait has two methods: * `FFT::process()` took an input and output buffer, and computed a single FFT, storing the result in the output buffer. * `FFT::process_multi()` took an input and output buffer, divided those buffers into chunks of size `fft.len()`, and computed a FFT on each chunk. RustFFT 5.0 makes a few changes to this setup. First, there is no longer a distinction between "single" and "multi" FFT methods: All `Fft` trait methods compute multiple FFTs if provided a buffer whose length is a multiple of the FFT length. Second, the `Fft` trait now has three methods. Most users will want the first method: 1. `Fft::process()` takes a single buffer instead of two, and computes FFTs in-place. Internally, it allocates scratch space as needed. 1. `Fft::process_with_scratch()` takes two buffers: A data buffer, and a scratch buffer. It computes FFTs in-place on the data buffer, using the provided scratch space as needed. 1. `Fft::process_outofplace_with_scratch()` takes three buffers: An input buffer, an output buffer, and a scratch buffer. It computes FFTs from the input buffer and stores the results in the output buffer, using the provided scratch space as needed. Example for users who want to use the new in-place `process()` behavior: ```rust // RustFFT 4.0 let fft = Radix4::new(4096, false); let mut input : Vec> = get_my_input_data(); let mut output = vec![Complex::zero(); fft.len()]; fft.process(&mut input, &mut output); // RustFFT 5.0 let fft = Radix4::new(4096, FftDirection::Forward); let mut buffer : Vec> = get_my_input_data(); fft.process(&mut buffer); ``` Example for users who want to keep the old out-of-place `process()` behavior from RustFFT 4.0: ```rust // RustFFT 4.0 let fft = Radix4::new(4096, false); let mut input : Vec> = get_my_input_data(); let mut output = vec![Complex::zero(); fft.len()]; fft.process(&mut input, &mut output); // RustFFT 5.0 let fft = Radix4::new(4096, FftDirection::Forward); let mut input : Vec> = get_my_input_data(); let mut output = vec![Complex::zero(); fft.len()]; let mut scratch = vec![Complex::zero(); fft.get_outofplace_scratch_len()]; fft.process_outofplace_with_scratch(&mut input, &mut output, &mut scratch); ``` ## Rader's Algorithm Constructor The constructor for `RadersAlgorithm` has changed. In RustFFT 4.0, its signature was `pub fn new(len: usize, inner_fft: Arc>)`, and it asserted that `len == inner_fft.len() + 1` RustFFT 5.0 removes the `len: usize` parameter, and `RadersAlgorithm` derives its FFT length from the inner FFT length instead. ```rust // RustFFT 4.0 let inner_fft : Arc> = ...; let fft = RadersAlgorithm::new(inner_fft.len() + 1, inner_fft); // RustFFT 5.0 let inner_fft : Arc> = ...; let fft = RadersAlgorithm::new(inner_fft); ``` ## Deleted the `FFTButterfly` trait In RustFFT 4.0, there was a trait called `FFTbutterfly`. This trait has been deleted. It had two methods which were merged into the `Fft` trait: * `FFTButterfly::process_inplace` is replaced by `Fft::process_inplace` or `Fft::process_inplace_with_scratch` * `FFTButterfly::process_multi_inplace` is replaced by `Fft::process_inplace_multi` Two FFT algorithms relied on the deleted trait: `MixedRadixDoubleButterfly` and `GoodThomasAlgorithmDoubleButterfly`. They took `FFTbutterfly` trait objects in their constructor. They've been renamed to `MixedRadixSmall` and `GoodThomasAlgorithmSmall` respectively, and take `Fft` trait objects in their constructor. ```rust // RustFFT 4.0 let butterfly8 : Arc> = Arc::new(Butterfly8::new(false)); let butterfly3 : Arc> = Arc::new(Butterfly3::new(false)); let fft1 = MixedRadixDoubleButterfly::new(Arc::clone(&butterfly8), Arc::clone(&butterfly3)); let fft2 = GoodThomasAlgorithmDoubleButterfly::new(Arc::clone(&butterfly8), Arc::clone(&butterfly3)); // RustFFT 5.0 let butterfly8 : Arc> = Arc::new(Butterfly8::new(FftDirection::Forward)); let butterfly3 : Arc> = Arc::new(Butterfly3::new(FftDirection::Forward)); let fft1 = MixedRadixSmall::new(Arc::clone(&butterfly8), Arc::clone(&butterfly3)); let fft2 = GoodThomasAlgorithmSmall::new(Arc::clone(&butterfly8), Arc::clone(&butterfly3)); ``` rustfft-6.2.0/benches/bench_check_neon_2to1024.rs000064400000000000000000000253500072674642500176660ustar 00000000000000#![feature(test)] extern crate rustfft; extern crate test; use paste::paste; use rustfft::num_complex::Complex; use rustfft::num_traits::Zero; use rustfft::Fft; use std::sync::Arc; use test::Bencher; // Make fft using planner fn bench_planned_32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerNeon::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Make fft using planner fn bench_planned_64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerNeon::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Create benches using functions taking one argument macro_rules! make_benches { ($name:ident, $fname:ident, { $($len:literal),* }) => { paste! { $( #[bench] fn [](b: &mut Bencher) { [](b, $len); } #[bench] fn [](b: &mut Bencher) { [](b, $len); } )* } } } make_benches!(from2to1024, planned, {2, 3, 4, 5, 6, 7, 8, 9 }); make_benches!(from2to1024, planned, {10, 11, 12, 13, 14, 15, 16, 17, 18, 19 }); make_benches!(from2to1024, planned, {20, 21, 22, 23, 24, 25, 26, 27, 28, 29 }); make_benches!(from2to1024, planned, {30, 31, 32, 33, 34, 35, 36, 37, 38, 39 }); make_benches!(from2to1024, planned, {40, 41, 42, 43, 44, 45, 46, 47, 48, 49 }); make_benches!(from2to1024, planned, {50, 51, 52, 53, 54, 55, 56, 57, 58, 59 }); make_benches!(from2to1024, planned, {60, 61, 62, 63, 64, 65, 66, 67, 68, 69 }); make_benches!(from2to1024, planned, {70, 71, 72, 73, 74, 75, 76, 77, 78, 79 }); make_benches!(from2to1024, planned, {80, 81, 82, 83, 84, 85, 86, 87, 88, 89 }); make_benches!(from2to1024, planned, {90, 91, 92, 93, 94, 95, 96, 97, 98, 99 }); make_benches!(from2to1024, planned, {100, 101, 102, 103, 104, 105, 106, 107, 108, 109 }); make_benches!(from2to1024, planned, {110, 111, 112, 113, 114, 115, 116, 117, 118, 119 }); make_benches!(from2to1024, planned, {120, 121, 122, 123, 124, 125, 126, 127, 128, 129 }); make_benches!(from2to1024, planned, {130, 131, 132, 133, 134, 135, 136, 137, 138, 139 }); make_benches!(from2to1024, planned, {140, 141, 142, 143, 144, 145, 146, 147, 148, 149 }); make_benches!(from2to1024, planned, {150, 151, 152, 153, 154, 155, 156, 157, 158, 159 }); make_benches!(from2to1024, planned, {160, 161, 162, 163, 164, 165, 166, 167, 168, 169 }); make_benches!(from2to1024, planned, {170, 171, 172, 173, 174, 175, 176, 177, 178, 179 }); make_benches!(from2to1024, planned, {180, 181, 182, 183, 184, 185, 186, 187, 188, 189 }); make_benches!(from2to1024, planned, {190, 191, 192, 193, 194, 195, 196, 197, 198, 199 }); /* make_benches!(from2to1024, planned, {200, 201, 202, 203, 204, 205, 206, 207, 208, 209 }); make_benches!(from2to1024, planned, {210, 211, 212, 213, 214, 215, 216, 217, 218, 219 }); make_benches!(from2to1024, planned, {220, 221, 222, 223, 224, 225, 226, 227, 228, 229 }); make_benches!(from2to1024, planned, {230, 231, 232, 233, 234, 235, 236, 237, 238, 239 }); make_benches!(from2to1024, planned, {240, 241, 242, 243, 244, 245, 246, 247, 248, 249 }); make_benches!(from2to1024, planned, {250, 251, 252, 253, 254, 255, 256, 257, 258, 259 }); make_benches!(from2to1024, planned, {260, 261, 262, 263, 264, 265, 266, 267, 268, 269 }); make_benches!(from2to1024, planned, {270, 271, 272, 273, 274, 275, 276, 277, 278, 279 }); make_benches!(from2to1024, planned, {280, 281, 282, 283, 284, 285, 286, 287, 288, 289 }); make_benches!(from2to1024, planned, {290, 291, 292, 293, 294, 295, 296, 297, 298, 299 }); make_benches!(from2to1024, planned, {300, 301, 302, 303, 304, 305, 306, 307, 308, 309 }); make_benches!(from2to1024, planned, {310, 311, 312, 313, 314, 315, 316, 317, 318, 319 }); make_benches!(from2to1024, planned, {320, 321, 322, 323, 324, 325, 326, 327, 328, 329 }); make_benches!(from2to1024, planned, {330, 331, 332, 333, 334, 335, 336, 337, 338, 339 }); make_benches!(from2to1024, planned, {340, 341, 342, 343, 344, 345, 346, 347, 348, 349 }); make_benches!(from2to1024, planned, {350, 351, 352, 353, 354, 355, 356, 357, 358, 359 }); make_benches!(from2to1024, planned, {360, 361, 362, 363, 364, 365, 366, 367, 368, 369 }); make_benches!(from2to1024, planned, {370, 371, 372, 373, 374, 375, 376, 377, 378, 379 }); make_benches!(from2to1024, planned, {380, 381, 382, 383, 384, 385, 386, 387, 388, 389 }); make_benches!(from2to1024, planned, {390, 391, 392, 393, 394, 395, 396, 397, 398, 399 }); make_benches!(from2to1024, planned, {400, 401, 402, 403, 404, 405, 406, 407, 408, 409 }); make_benches!(from2to1024, planned, {410, 411, 412, 413, 414, 415, 416, 417, 418, 419 }); make_benches!(from2to1024, planned, {420, 421, 422, 423, 424, 425, 426, 427, 428, 429 }); make_benches!(from2to1024, planned, {430, 431, 432, 433, 434, 435, 436, 437, 438, 439 }); make_benches!(from2to1024, planned, {440, 441, 442, 443, 444, 445, 446, 447, 448, 449 }); make_benches!(from2to1024, planned, {450, 451, 452, 453, 454, 455, 456, 457, 458, 459 }); make_benches!(from2to1024, planned, {460, 461, 462, 463, 464, 465, 466, 467, 468, 469 }); make_benches!(from2to1024, planned, {470, 471, 472, 473, 474, 475, 476, 477, 478, 479 }); make_benches!(from2to1024, planned, {480, 481, 482, 483, 484, 485, 486, 487, 488, 489 }); make_benches!(from2to1024, planned, {490, 491, 492, 493, 494, 495, 496, 497, 498, 499 }); make_benches!(from2to1024, planned, {500, 501, 502, 503, 504, 505, 506, 507, 508, 509 }); make_benches!(from2to1024, planned, {510, 511, 512, 513, 514, 515, 516, 517, 518, 519 }); make_benches!(from2to1024, planned, {520, 521, 522, 523, 524, 525, 526, 527, 528, 529 }); make_benches!(from2to1024, planned, {530, 531, 532, 533, 534, 535, 536, 537, 538, 539 }); make_benches!(from2to1024, planned, {540, 541, 542, 543, 544, 545, 546, 547, 548, 549 }); make_benches!(from2to1024, planned, {550, 551, 552, 553, 554, 555, 556, 557, 558, 559 }); make_benches!(from2to1024, planned, {560, 561, 562, 563, 564, 565, 566, 567, 568, 569 }); make_benches!(from2to1024, planned, {570, 571, 572, 573, 574, 575, 576, 577, 578, 579 }); make_benches!(from2to1024, planned, {580, 581, 582, 583, 584, 585, 586, 587, 588, 589 }); make_benches!(from2to1024, planned, {590, 591, 592, 593, 594, 595, 596, 597, 598, 599 }); make_benches!(from2to1024, planned, {600, 601, 602, 603, 604, 605, 606, 607, 608, 609 }); make_benches!(from2to1024, planned, {610, 611, 612, 613, 614, 615, 616, 617, 618, 619 }); make_benches!(from2to1024, planned, {620, 621, 622, 623, 624, 625, 626, 627, 628, 629 }); make_benches!(from2to1024, planned, {630, 631, 632, 633, 634, 635, 636, 637, 638, 639 }); make_benches!(from2to1024, planned, {640, 641, 642, 643, 644, 645, 646, 647, 648, 649 }); make_benches!(from2to1024, planned, {650, 651, 652, 653, 654, 655, 656, 657, 658, 659 }); make_benches!(from2to1024, planned, {660, 661, 662, 663, 664, 665, 666, 667, 668, 669 }); make_benches!(from2to1024, planned, {670, 671, 672, 673, 674, 675, 676, 677, 678, 679 }); make_benches!(from2to1024, planned, {680, 681, 682, 683, 684, 685, 686, 687, 688, 689 }); make_benches!(from2to1024, planned, {690, 691, 692, 693, 694, 695, 696, 697, 698, 699 }); make_benches!(from2to1024, planned, {700, 701, 702, 703, 704, 705, 706, 707, 708, 709 }); make_benches!(from2to1024, planned, {710, 711, 712, 713, 714, 715, 716, 717, 718, 719 }); make_benches!(from2to1024, planned, {720, 721, 722, 723, 724, 725, 726, 727, 728, 729 }); make_benches!(from2to1024, planned, {730, 731, 732, 733, 734, 735, 736, 737, 738, 739 }); make_benches!(from2to1024, planned, {740, 741, 742, 743, 744, 745, 746, 747, 748, 749 }); make_benches!(from2to1024, planned, {750, 751, 752, 753, 754, 755, 756, 757, 758, 759 }); make_benches!(from2to1024, planned, {760, 761, 762, 763, 764, 765, 766, 767, 768, 769 }); make_benches!(from2to1024, planned, {770, 771, 772, 773, 774, 775, 776, 777, 778, 779 }); make_benches!(from2to1024, planned, {780, 781, 782, 783, 784, 785, 786, 787, 788, 789 }); make_benches!(from2to1024, planned, {790, 791, 792, 793, 794, 795, 796, 797, 798, 799 }); make_benches!(from2to1024, planned, {800, 801, 802, 803, 804, 805, 806, 807, 808, 809 }); make_benches!(from2to1024, planned, {810, 811, 812, 813, 814, 815, 816, 817, 818, 819 }); make_benches!(from2to1024, planned, {820, 821, 822, 823, 824, 825, 826, 827, 828, 829 }); make_benches!(from2to1024, planned, {830, 831, 832, 833, 834, 835, 836, 837, 838, 839 }); make_benches!(from2to1024, planned, {840, 841, 842, 843, 844, 845, 846, 847, 848, 849 }); make_benches!(from2to1024, planned, {850, 851, 852, 853, 854, 855, 856, 857, 858, 859 }); make_benches!(from2to1024, planned, {860, 861, 862, 863, 864, 865, 866, 867, 868, 869 }); make_benches!(from2to1024, planned, {870, 871, 872, 873, 874, 875, 876, 877, 878, 879 }); make_benches!(from2to1024, planned, {880, 881, 882, 883, 884, 885, 886, 887, 888, 889 }); make_benches!(from2to1024, planned, {890, 891, 892, 893, 894, 895, 896, 897, 898, 899 }); make_benches!(from2to1024, planned, {900, 901, 902, 903, 904, 905, 906, 907, 908, 909 }); make_benches!(from2to1024, planned, {910, 911, 912, 913, 914, 915, 916, 917, 918, 919 }); make_benches!(from2to1024, planned, {920, 921, 922, 923, 924, 925, 926, 927, 928, 929 }); make_benches!(from2to1024, planned, {930, 931, 932, 933, 934, 935, 936, 937, 938, 939 }); make_benches!(from2to1024, planned, {940, 941, 942, 943, 944, 945, 946, 947, 948, 949 }); make_benches!(from2to1024, planned, {950, 951, 952, 953, 954, 955, 956, 957, 958, 959 }); make_benches!(from2to1024, planned, {960, 961, 962, 963, 964, 965, 966, 967, 968, 969 }); make_benches!(from2to1024, planned, {970, 971, 972, 973, 974, 975, 976, 977, 978, 979 }); make_benches!(from2to1024, planned, {980, 981, 982, 983, 984, 985, 986, 987, 988, 989 }); make_benches!(from2to1024, planned, {990, 991, 992, 993, 994, 995, 996, 997, 998, 999 }); make_benches!(from2to1024, planned, {1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009 }); make_benches!(from2to1024, planned, {1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019 }); make_benches!(from2to1024, planned, {1020, 1021, 1022, 1023, 1024 }); */ rustfft-6.2.0/benches/bench_check_scalar_2to1024.rs000064400000000000000000000276360072674642500202050ustar 00000000000000#![feature(test)] extern crate rustfft; extern crate test; use paste::paste; use rustfft::num_complex::Complex; use rustfft::num_traits::Zero; use rustfft::{Fft, FftDirection}; use std::sync::Arc; use test::Bencher; // Make fft using planner fn bench_planned_32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Make fft using planner fn bench_planned_64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Create benches using functions taking one argument macro_rules! make_benches { ($name:ident, $fname:ident, { $($len:literal),* }) => { paste! { $( #[bench] fn [](b: &mut Bencher) { [](b, $len); } #[bench] fn [](b: &mut Bencher) { [](b, $len); } )* } } } make_benches!(from2to1024, planned, {2, 3, 4, 5, 6, 7, 8, 9 }); make_benches!(from2to1024, planned, {10, 11, 12, 13, 14, 15, 16, 17, 18, 19 }); make_benches!(from2to1024, planned, {20, 21, 22, 23, 24, 25, 26, 27, 28, 29 }); make_benches!(from2to1024, planned, {30, 31, 32, 33, 34, 35, 36, 37, 38, 39 }); make_benches!(from2to1024, planned, {40, 41, 42, 43, 44, 45, 46, 47, 48, 49 }); make_benches!(from2to1024, planned, {50, 51, 52, 53, 54, 55, 56, 57, 58, 59 }); make_benches!(from2to1024, planned, {60, 61, 62, 63, 64, 65, 66, 67, 68, 69 }); make_benches!(from2to1024, planned, {70, 71, 72, 73, 74, 75, 76, 77, 78, 79 }); make_benches!(from2to1024, planned, {80, 81, 82, 83, 84, 85, 86, 87, 88, 89 }); make_benches!(from2to1024, planned, {90, 91, 92, 93, 94, 95, 96, 97, 98, 99 }); make_benches!(from2to1024, planned, {100, 101, 102, 103, 104, 105, 106, 107, 108, 109 }); make_benches!(from2to1024, planned, {110, 111, 112, 113, 114, 115, 116, 117, 118, 119 }); make_benches!(from2to1024, planned, {120, 121, 122, 123, 124, 125, 126, 127, 128, 129 }); make_benches!(from2to1024, planned, {130, 131, 132, 133, 134, 135, 136, 137, 138, 139 }); make_benches!(from2to1024, planned, {140, 141, 142, 143, 144, 145, 146, 147, 148, 149 }); make_benches!(from2to1024, planned, {150, 151, 152, 153, 154, 155, 156, 157, 158, 159 }); make_benches!(from2to1024, planned, {160, 161, 162, 163, 164, 165, 166, 167, 168, 169 }); make_benches!(from2to1024, planned, {170, 171, 172, 173, 174, 175, 176, 177, 178, 179 }); make_benches!(from2to1024, planned, {180, 181, 182, 183, 184, 185, 186, 187, 188, 189 }); make_benches!(from2to1024, planned, {190, 191, 192, 193, 194, 195, 196, 197, 198, 199 }); /* make_benches!(from2to1024, planned, {200, 201, 202, 203, 204, 205, 206, 207, 208, 209 }); make_benches!(from2to1024, planned, {210, 211, 212, 213, 214, 215, 216, 217, 218, 219 }); make_benches!(from2to1024, planned, {220, 221, 222, 223, 224, 225, 226, 227, 228, 229 }); make_benches!(from2to1024, planned, {230, 231, 232, 233, 234, 235, 236, 237, 238, 239 }); make_benches!(from2to1024, planned, {240, 241, 242, 243, 244, 245, 246, 247, 248, 249 }); make_benches!(from2to1024, planned, {250, 251, 252, 253, 254, 255, 256, 257, 258, 259 }); make_benches!(from2to1024, planned, {260, 261, 262, 263, 264, 265, 266, 267, 268, 269 }); make_benches!(from2to1024, planned, {270, 271, 272, 273, 274, 275, 276, 277, 278, 279 }); make_benches!(from2to1024, planned, {280, 281, 282, 283, 284, 285, 286, 287, 288, 289 }); make_benches!(from2to1024, planned, {290, 291, 292, 293, 294, 295, 296, 297, 298, 299 }); make_benches!(from2to1024, planned, {300, 301, 302, 303, 304, 305, 306, 307, 308, 309 }); make_benches!(from2to1024, planned, {310, 311, 312, 313, 314, 315, 316, 317, 318, 319 }); make_benches!(from2to1024, planned, {320, 321, 322, 323, 324, 325, 326, 327, 328, 329 }); make_benches!(from2to1024, planned, {330, 331, 332, 333, 334, 335, 336, 337, 338, 339 }); make_benches!(from2to1024, planned, {340, 341, 342, 343, 344, 345, 346, 347, 348, 349 }); make_benches!(from2to1024, planned, {350, 351, 352, 353, 354, 355, 356, 357, 358, 359 }); make_benches!(from2to1024, planned, {360, 361, 362, 363, 364, 365, 366, 367, 368, 369 }); make_benches!(from2to1024, planned, {370, 371, 372, 373, 374, 375, 376, 377, 378, 379 }); make_benches!(from2to1024, planned, {380, 381, 382, 383, 384, 385, 386, 387, 388, 389 }); make_benches!(from2to1024, planned, {390, 391, 392, 393, 394, 395, 396, 397, 398, 399 }); make_benches!(from2to1024, planned, {400, 401, 402, 403, 404, 405, 406, 407, 408, 409 }); make_benches!(from2to1024, planned, {410, 411, 412, 413, 414, 415, 416, 417, 418, 419 }); make_benches!(from2to1024, planned, {420, 421, 422, 423, 424, 425, 426, 427, 428, 429 }); make_benches!(from2to1024, planned, {430, 431, 432, 433, 434, 435, 436, 437, 438, 439 }); make_benches!(from2to1024, planned, {440, 441, 442, 443, 444, 445, 446, 447, 448, 449 }); make_benches!(from2to1024, planned, {450, 451, 452, 453, 454, 455, 456, 457, 458, 459 }); make_benches!(from2to1024, planned, {460, 461, 462, 463, 464, 465, 466, 467, 468, 469 }); make_benches!(from2to1024, planned, {470, 471, 472, 473, 474, 475, 476, 477, 478, 479 }); make_benches!(from2to1024, planned, {480, 481, 482, 483, 484, 485, 486, 487, 488, 489 }); make_benches!(from2to1024, planned, {490, 491, 492, 493, 494, 495, 496, 497, 498, 499 }); make_benches!(from2to1024, planned, {500, 501, 502, 503, 504, 505, 506, 507, 508, 509 }); make_benches!(from2to1024, planned, {510, 511, 512, 513, 514, 515, 516, 517, 518, 519 }); make_benches!(from2to1024, planned, {520, 521, 522, 523, 524, 525, 526, 527, 528, 529 }); make_benches!(from2to1024, planned, {530, 531, 532, 533, 534, 535, 536, 537, 538, 539 }); make_benches!(from2to1024, planned, {540, 541, 542, 543, 544, 545, 546, 547, 548, 549 }); make_benches!(from2to1024, planned, {550, 551, 552, 553, 554, 555, 556, 557, 558, 559 }); make_benches!(from2to1024, planned, {560, 561, 562, 563, 564, 565, 566, 567, 568, 569 }); make_benches!(from2to1024, planned, {570, 571, 572, 573, 574, 575, 576, 577, 578, 579 }); make_benches!(from2to1024, planned, {580, 581, 582, 583, 584, 585, 586, 587, 588, 589 }); make_benches!(from2to1024, planned, {590, 591, 592, 593, 594, 595, 596, 597, 598, 599 }); make_benches!(from2to1024, planned, {600, 601, 602, 603, 604, 605, 606, 607, 608, 609 }); make_benches!(from2to1024, planned, {610, 611, 612, 613, 614, 615, 616, 617, 618, 619 }); make_benches!(from2to1024, planned, {620, 621, 622, 623, 624, 625, 626, 627, 628, 629 }); make_benches!(from2to1024, planned, {630, 631, 632, 633, 634, 635, 636, 637, 638, 639 }); make_benches!(from2to1024, planned, {640, 641, 642, 643, 644, 645, 646, 647, 648, 649 }); make_benches!(from2to1024, planned, {650, 651, 652, 653, 654, 655, 656, 657, 658, 659 }); make_benches!(from2to1024, planned, {660, 661, 662, 663, 664, 665, 666, 667, 668, 669 }); make_benches!(from2to1024, planned, {670, 671, 672, 673, 674, 675, 676, 677, 678, 679 }); make_benches!(from2to1024, planned, {680, 681, 682, 683, 684, 685, 686, 687, 688, 689 }); make_benches!(from2to1024, planned, {690, 691, 692, 693, 694, 695, 696, 697, 698, 699 }); make_benches!(from2to1024, planned, {700, 701, 702, 703, 704, 705, 706, 707, 708, 709 }); make_benches!(from2to1024, planned, {710, 711, 712, 713, 714, 715, 716, 717, 718, 719 }); make_benches!(from2to1024, planned, {720, 721, 722, 723, 724, 725, 726, 727, 728, 729 }); make_benches!(from2to1024, planned, {730, 731, 732, 733, 734, 735, 736, 737, 738, 739 }); make_benches!(from2to1024, planned, {740, 741, 742, 743, 744, 745, 746, 747, 748, 749 }); make_benches!(from2to1024, planned, {750, 751, 752, 753, 754, 755, 756, 757, 758, 759 }); make_benches!(from2to1024, planned, {760, 761, 762, 763, 764, 765, 766, 767, 768, 769 }); make_benches!(from2to1024, planned, {770, 771, 772, 773, 774, 775, 776, 777, 778, 779 }); make_benches!(from2to1024, planned, {780, 781, 782, 783, 784, 785, 786, 787, 788, 789 }); make_benches!(from2to1024, planned, {790, 791, 792, 793, 794, 795, 796, 797, 798, 799 }); make_benches!(from2to1024, planned, {800, 801, 802, 803, 804, 805, 806, 807, 808, 809 }); make_benches!(from2to1024, planned, {810, 811, 812, 813, 814, 815, 816, 817, 818, 819 }); make_benches!(from2to1024, planned, {820, 821, 822, 823, 824, 825, 826, 827, 828, 829 }); make_benches!(from2to1024, planned, {830, 831, 832, 833, 834, 835, 836, 837, 838, 839 }); make_benches!(from2to1024, planned, {840, 841, 842, 843, 844, 845, 846, 847, 848, 849 }); make_benches!(from2to1024, planned, {850, 851, 852, 853, 854, 855, 856, 857, 858, 859 }); make_benches!(from2to1024, planned, {860, 861, 862, 863, 864, 865, 866, 867, 868, 869 }); make_benches!(from2to1024, planned, {870, 871, 872, 873, 874, 875, 876, 877, 878, 879 }); make_benches!(from2to1024, planned, {880, 881, 882, 883, 884, 885, 886, 887, 888, 889 }); make_benches!(from2to1024, planned, {890, 891, 892, 893, 894, 895, 896, 897, 898, 899 }); make_benches!(from2to1024, planned, {900, 901, 902, 903, 904, 905, 906, 907, 908, 909 }); make_benches!(from2to1024, planned, {910, 911, 912, 913, 914, 915, 916, 917, 918, 919 }); make_benches!(from2to1024, planned, {920, 921, 922, 923, 924, 925, 926, 927, 928, 929 }); make_benches!(from2to1024, planned, {930, 931, 932, 933, 934, 935, 936, 937, 938, 939 }); make_benches!(from2to1024, planned, {940, 941, 942, 943, 944, 945, 946, 947, 948, 949 }); make_benches!(from2to1024, planned, {950, 951, 952, 953, 954, 955, 956, 957, 958, 959 }); make_benches!(from2to1024, planned, {960, 961, 962, 963, 964, 965, 966, 967, 968, 969 }); make_benches!(from2to1024, planned, {970, 971, 972, 973, 974, 975, 976, 977, 978, 979 }); make_benches!(from2to1024, planned, {980, 981, 982, 983, 984, 985, 986, 987, 988, 989 }); make_benches!(from2to1024, planned, {990, 991, 992, 993, 994, 995, 996, 997, 998, 999 }); make_benches!(from2to1024, planned, {1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009 }); make_benches!(from2to1024, planned, {1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019 }); make_benches!(from2to1024, planned, {1020, 1021, 1022, 1023, 1024 }); */ make_benches!(power3_planned_scalar, planned, {0000003, 0000009, 0000027, 0000081, 0000243, 0000729, 0002187, 0006561, 0019683, 0059049, 0177147, 0531441, 1594323, 4782969 }); fn bench_radix3_32(b: &mut Bencher, len: usize) { let fft: Arc> = Arc::new(rustfft::algorithm::Radix3::new(len, FftDirection::Forward)); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } fn bench_radix3_64(b: &mut Bencher, len: usize) { let fft: Arc> = Arc::new(rustfft::algorithm::Radix3::new(len, FftDirection::Forward)); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } make_benches!(power3_radix3_scalar, radix3, {0000003, 0000009, 0000027, 0000081, 0000243, 0000729, 0002187, 0006561, 0019683, 0059049, 0177147, 0531441, 1594323, 4782969 }); rustfft-6.2.0/benches/bench_check_sse_2to1024.rs000064400000000000000000000253460072674642500175260ustar 00000000000000#![feature(test)] extern crate rustfft; extern crate test; use paste::paste; use rustfft::num_complex::Complex; use rustfft::num_traits::Zero; use rustfft::Fft; use std::sync::Arc; use test::Bencher; // Make fft using planner fn bench_planned_32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerSse::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Make fft using planner fn bench_planned_64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerSse::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Create benches using functions taking one argument macro_rules! make_benches { ($name:ident, $fname:ident, { $($len:literal),* }) => { paste! { $( #[bench] fn [](b: &mut Bencher) { [](b, $len); } #[bench] fn [](b: &mut Bencher) { [](b, $len); } )* } } } make_benches!(from2to1024, planned, {2, 3, 4, 5, 6, 7, 8, 9 }); make_benches!(from2to1024, planned, {10, 11, 12, 13, 14, 15, 16, 17, 18, 19 }); make_benches!(from2to1024, planned, {20, 21, 22, 23, 24, 25, 26, 27, 28, 29 }); make_benches!(from2to1024, planned, {30, 31, 32, 33, 34, 35, 36, 37, 38, 39 }); make_benches!(from2to1024, planned, {40, 41, 42, 43, 44, 45, 46, 47, 48, 49 }); make_benches!(from2to1024, planned, {50, 51, 52, 53, 54, 55, 56, 57, 58, 59 }); make_benches!(from2to1024, planned, {60, 61, 62, 63, 64, 65, 66, 67, 68, 69 }); make_benches!(from2to1024, planned, {70, 71, 72, 73, 74, 75, 76, 77, 78, 79 }); make_benches!(from2to1024, planned, {80, 81, 82, 83, 84, 85, 86, 87, 88, 89 }); make_benches!(from2to1024, planned, {90, 91, 92, 93, 94, 95, 96, 97, 98, 99 }); make_benches!(from2to1024, planned, {100, 101, 102, 103, 104, 105, 106, 107, 108, 109 }); make_benches!(from2to1024, planned, {110, 111, 112, 113, 114, 115, 116, 117, 118, 119 }); make_benches!(from2to1024, planned, {120, 121, 122, 123, 124, 125, 126, 127, 128, 129 }); make_benches!(from2to1024, planned, {130, 131, 132, 133, 134, 135, 136, 137, 138, 139 }); make_benches!(from2to1024, planned, {140, 141, 142, 143, 144, 145, 146, 147, 148, 149 }); make_benches!(from2to1024, planned, {150, 151, 152, 153, 154, 155, 156, 157, 158, 159 }); make_benches!(from2to1024, planned, {160, 161, 162, 163, 164, 165, 166, 167, 168, 169 }); make_benches!(from2to1024, planned, {170, 171, 172, 173, 174, 175, 176, 177, 178, 179 }); make_benches!(from2to1024, planned, {180, 181, 182, 183, 184, 185, 186, 187, 188, 189 }); make_benches!(from2to1024, planned, {190, 191, 192, 193, 194, 195, 196, 197, 198, 199 }); /* make_benches!(from2to1024, planned, {200, 201, 202, 203, 204, 205, 206, 207, 208, 209 }); make_benches!(from2to1024, planned, {210, 211, 212, 213, 214, 215, 216, 217, 218, 219 }); make_benches!(from2to1024, planned, {220, 221, 222, 223, 224, 225, 226, 227, 228, 229 }); make_benches!(from2to1024, planned, {230, 231, 232, 233, 234, 235, 236, 237, 238, 239 }); make_benches!(from2to1024, planned, {240, 241, 242, 243, 244, 245, 246, 247, 248, 249 }); make_benches!(from2to1024, planned, {250, 251, 252, 253, 254, 255, 256, 257, 258, 259 }); make_benches!(from2to1024, planned, {260, 261, 262, 263, 264, 265, 266, 267, 268, 269 }); make_benches!(from2to1024, planned, {270, 271, 272, 273, 274, 275, 276, 277, 278, 279 }); make_benches!(from2to1024, planned, {280, 281, 282, 283, 284, 285, 286, 287, 288, 289 }); make_benches!(from2to1024, planned, {290, 291, 292, 293, 294, 295, 296, 297, 298, 299 }); make_benches!(from2to1024, planned, {300, 301, 302, 303, 304, 305, 306, 307, 308, 309 }); make_benches!(from2to1024, planned, {310, 311, 312, 313, 314, 315, 316, 317, 318, 319 }); make_benches!(from2to1024, planned, {320, 321, 322, 323, 324, 325, 326, 327, 328, 329 }); make_benches!(from2to1024, planned, {330, 331, 332, 333, 334, 335, 336, 337, 338, 339 }); make_benches!(from2to1024, planned, {340, 341, 342, 343, 344, 345, 346, 347, 348, 349 }); make_benches!(from2to1024, planned, {350, 351, 352, 353, 354, 355, 356, 357, 358, 359 }); make_benches!(from2to1024, planned, {360, 361, 362, 363, 364, 365, 366, 367, 368, 369 }); make_benches!(from2to1024, planned, {370, 371, 372, 373, 374, 375, 376, 377, 378, 379 }); make_benches!(from2to1024, planned, {380, 381, 382, 383, 384, 385, 386, 387, 388, 389 }); make_benches!(from2to1024, planned, {390, 391, 392, 393, 394, 395, 396, 397, 398, 399 }); make_benches!(from2to1024, planned, {400, 401, 402, 403, 404, 405, 406, 407, 408, 409 }); make_benches!(from2to1024, planned, {410, 411, 412, 413, 414, 415, 416, 417, 418, 419 }); make_benches!(from2to1024, planned, {420, 421, 422, 423, 424, 425, 426, 427, 428, 429 }); make_benches!(from2to1024, planned, {430, 431, 432, 433, 434, 435, 436, 437, 438, 439 }); make_benches!(from2to1024, planned, {440, 441, 442, 443, 444, 445, 446, 447, 448, 449 }); make_benches!(from2to1024, planned, {450, 451, 452, 453, 454, 455, 456, 457, 458, 459 }); make_benches!(from2to1024, planned, {460, 461, 462, 463, 464, 465, 466, 467, 468, 469 }); make_benches!(from2to1024, planned, {470, 471, 472, 473, 474, 475, 476, 477, 478, 479 }); make_benches!(from2to1024, planned, {480, 481, 482, 483, 484, 485, 486, 487, 488, 489 }); make_benches!(from2to1024, planned, {490, 491, 492, 493, 494, 495, 496, 497, 498, 499 }); make_benches!(from2to1024, planned, {500, 501, 502, 503, 504, 505, 506, 507, 508, 509 }); make_benches!(from2to1024, planned, {510, 511, 512, 513, 514, 515, 516, 517, 518, 519 }); make_benches!(from2to1024, planned, {520, 521, 522, 523, 524, 525, 526, 527, 528, 529 }); make_benches!(from2to1024, planned, {530, 531, 532, 533, 534, 535, 536, 537, 538, 539 }); make_benches!(from2to1024, planned, {540, 541, 542, 543, 544, 545, 546, 547, 548, 549 }); make_benches!(from2to1024, planned, {550, 551, 552, 553, 554, 555, 556, 557, 558, 559 }); make_benches!(from2to1024, planned, {560, 561, 562, 563, 564, 565, 566, 567, 568, 569 }); make_benches!(from2to1024, planned, {570, 571, 572, 573, 574, 575, 576, 577, 578, 579 }); make_benches!(from2to1024, planned, {580, 581, 582, 583, 584, 585, 586, 587, 588, 589 }); make_benches!(from2to1024, planned, {590, 591, 592, 593, 594, 595, 596, 597, 598, 599 }); make_benches!(from2to1024, planned, {600, 601, 602, 603, 604, 605, 606, 607, 608, 609 }); make_benches!(from2to1024, planned, {610, 611, 612, 613, 614, 615, 616, 617, 618, 619 }); make_benches!(from2to1024, planned, {620, 621, 622, 623, 624, 625, 626, 627, 628, 629 }); make_benches!(from2to1024, planned, {630, 631, 632, 633, 634, 635, 636, 637, 638, 639 }); make_benches!(from2to1024, planned, {640, 641, 642, 643, 644, 645, 646, 647, 648, 649 }); make_benches!(from2to1024, planned, {650, 651, 652, 653, 654, 655, 656, 657, 658, 659 }); make_benches!(from2to1024, planned, {660, 661, 662, 663, 664, 665, 666, 667, 668, 669 }); make_benches!(from2to1024, planned, {670, 671, 672, 673, 674, 675, 676, 677, 678, 679 }); make_benches!(from2to1024, planned, {680, 681, 682, 683, 684, 685, 686, 687, 688, 689 }); make_benches!(from2to1024, planned, {690, 691, 692, 693, 694, 695, 696, 697, 698, 699 }); make_benches!(from2to1024, planned, {700, 701, 702, 703, 704, 705, 706, 707, 708, 709 }); make_benches!(from2to1024, planned, {710, 711, 712, 713, 714, 715, 716, 717, 718, 719 }); make_benches!(from2to1024, planned, {720, 721, 722, 723, 724, 725, 726, 727, 728, 729 }); make_benches!(from2to1024, planned, {730, 731, 732, 733, 734, 735, 736, 737, 738, 739 }); make_benches!(from2to1024, planned, {740, 741, 742, 743, 744, 745, 746, 747, 748, 749 }); make_benches!(from2to1024, planned, {750, 751, 752, 753, 754, 755, 756, 757, 758, 759 }); make_benches!(from2to1024, planned, {760, 761, 762, 763, 764, 765, 766, 767, 768, 769 }); make_benches!(from2to1024, planned, {770, 771, 772, 773, 774, 775, 776, 777, 778, 779 }); make_benches!(from2to1024, planned, {780, 781, 782, 783, 784, 785, 786, 787, 788, 789 }); make_benches!(from2to1024, planned, {790, 791, 792, 793, 794, 795, 796, 797, 798, 799 }); make_benches!(from2to1024, planned, {800, 801, 802, 803, 804, 805, 806, 807, 808, 809 }); make_benches!(from2to1024, planned, {810, 811, 812, 813, 814, 815, 816, 817, 818, 819 }); make_benches!(from2to1024, planned, {820, 821, 822, 823, 824, 825, 826, 827, 828, 829 }); make_benches!(from2to1024, planned, {830, 831, 832, 833, 834, 835, 836, 837, 838, 839 }); make_benches!(from2to1024, planned, {840, 841, 842, 843, 844, 845, 846, 847, 848, 849 }); make_benches!(from2to1024, planned, {850, 851, 852, 853, 854, 855, 856, 857, 858, 859 }); make_benches!(from2to1024, planned, {860, 861, 862, 863, 864, 865, 866, 867, 868, 869 }); make_benches!(from2to1024, planned, {870, 871, 872, 873, 874, 875, 876, 877, 878, 879 }); make_benches!(from2to1024, planned, {880, 881, 882, 883, 884, 885, 886, 887, 888, 889 }); make_benches!(from2to1024, planned, {890, 891, 892, 893, 894, 895, 896, 897, 898, 899 }); make_benches!(from2to1024, planned, {900, 901, 902, 903, 904, 905, 906, 907, 908, 909 }); make_benches!(from2to1024, planned, {910, 911, 912, 913, 914, 915, 916, 917, 918, 919 }); make_benches!(from2to1024, planned, {920, 921, 922, 923, 924, 925, 926, 927, 928, 929 }); make_benches!(from2to1024, planned, {930, 931, 932, 933, 934, 935, 936, 937, 938, 939 }); make_benches!(from2to1024, planned, {940, 941, 942, 943, 944, 945, 946, 947, 948, 949 }); make_benches!(from2to1024, planned, {950, 951, 952, 953, 954, 955, 956, 957, 958, 959 }); make_benches!(from2to1024, planned, {960, 961, 962, 963, 964, 965, 966, 967, 968, 969 }); make_benches!(from2to1024, planned, {970, 971, 972, 973, 974, 975, 976, 977, 978, 979 }); make_benches!(from2to1024, planned, {980, 981, 982, 983, 984, 985, 986, 987, 988, 989 }); make_benches!(from2to1024, planned, {990, 991, 992, 993, 994, 995, 996, 997, 998, 999 }); make_benches!(from2to1024, planned, {1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009 }); make_benches!(from2to1024, planned, {1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019 }); make_benches!(from2to1024, planned, {1020, 1021, 1022, 1023, 1024 }); */ rustfft-6.2.0/benches/bench_compare_scalar_neon.rs000064400000000000000000000061140072674642500204660ustar 00000000000000#![feature(test)] extern crate rustfft; extern crate test; use paste::paste; use rustfft::num_complex::Complex; use rustfft::num_traits::Zero; use rustfft::Fft; use std::sync::Arc; use test::Bencher; // Make fft using scalar planner fn bench_scalar_32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Make fft using scalar planner fn bench_scalar_64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Make fft using sse planner fn bench_neon_32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerNeon::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Make fft using sse planner fn bench_neon_64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerNeon::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Create benches using functions taking one argument macro_rules! make_benches { ($name:ident, { $($len:literal),* }) => { paste! { $( #[bench] fn [<$name _ $len _f32_scalar>](b: &mut Bencher) { [](b, $len); } #[bench] fn [<$name _ $len _f64_scalar>](b: &mut Bencher) { [](b, $len); } #[bench] fn [<$name _ $len _f32_neon>](b: &mut Bencher) { [](b, $len); } #[bench] fn [<$name _ $len _f64_neon>](b: &mut Bencher) { [](b, $len); } )* } } } make_benches!(neoncomparison, {4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072}); make_benches!(neoncomparison, { 262144, 524288, 1048576, 2097152, 4194304 }); rustfft-6.2.0/benches/bench_compare_scalar_sse_avx.rs000064400000000000000000000104340072674642500211770ustar 00000000000000#![feature(test)] extern crate rustfft; extern crate test; use paste::paste; use rustfft::num_complex::Complex; use rustfft::num_traits::Zero; use rustfft::Fft; use std::sync::Arc; use test::Bencher; // Make fft using scalar planner fn bench_scalar_32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Make fft using scalar planner fn bench_scalar_64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Make fft using sse planner fn bench_sse_32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerSse::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Make fft using sse planner fn bench_sse_64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerSse::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Make fft using sse planner fn bench_avx_32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerAvx::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Make fft using sse planner fn bench_avx_64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerAvx::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Create benches using functions taking one argument macro_rules! make_benches { ($name:ident, { $($len:literal),* }) => { paste! { $( #[bench] fn [<$name _ $len _f32_scalar>](b: &mut Bencher) { [](b, $len); } #[bench] fn [<$name _ $len _f64_scalar>](b: &mut Bencher) { [](b, $len); } #[bench] fn [<$name _ $len _f32_sse>](b: &mut Bencher) { [](b, $len); } #[bench] fn [<$name _ $len _f64_sse>](b: &mut Bencher) { [](b, $len); } #[bench] fn [<$name _ $len _f32_avx>](b: &mut Bencher) { [](b, $len); } #[bench] fn [<$name _ $len _f64_avx>](b: &mut Bencher) { [](b, $len); } )* } } } make_benches!(comparison, {16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768 }); make_benches!(comparison, {65536, 131072, 262144, 524288, 1048576, 2097152, 4194304 }); rustfft-6.2.0/benches/bench_compare_scalar_wasm_simd.rs000064400000000000000000000106610072674642500215140ustar 00000000000000#![feature(test)] /// Unfortunately, `cargo bench` does not permit running these benchmarks out-of-the-box /// on a WebAssembly virtual machine. /// /// Follow these steps to run these benchmarks: /// 0. Prerequisites: Install the `wasm32-wasi` target and `wasmer` /// /// /// 1. Build these benchmarks /// ```bash /// cargo build --bench=bench_rustfft_wasm_simd --release --target wasm32-wasi --features "wasm_simd" /// ``` /// /// After cargo built the bench binary, cargo stores it inside the /// `/target/wasm32-wasi/release/deps` directory. /// The file name of this binary follows this format: `bench_rustfft_wasm_simd-.wasm`. /// For instance, it could be named /// `target/wasm32-wasi/release/deps/bench_rustfft_scalar-6d2b3d5a567416f5.wasm` /// /// 2. Copy the most recently built WASM binary to hex.wasm /// ```bash /// cp `ls -t target/wasm32-wasi/release/deps/*.wasm | head -n 1` hex.wasm /// ``` /// /// 3. Run these benchmark e. g. with [wasmer](https://github.com/wasmerio/wasmer) /// ```bash /// wasmer run --dir=. hex.wasm -- --bench /// ``` /// /// For more information, refer to [Criterion's user guide](https://github.com/bheisler/criterion.rs/blob/dc2b06cd31f7aa34cff6a83a00598e0523186dad/book/src/user_guide/wasi.md) /// which should be mostly applicable to our use case. extern crate rustfft; extern crate test; use paste::paste; use rustfft::num_complex::Complex; use rustfft::num_traits::Zero; use rustfft::Fft; use std::sync::Arc; use test::Bencher; // Make fft using scalar planner fn bench_scalar_32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Make fft using scalar planner fn bench_scalar_64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Make fft using WASM SIMD planner fn bench_wasmsimd_32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerWasmSimd::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Make fft using WASM SIMD planner fn bench_wasmsimd_64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerWasmSimd::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer: Vec> = vec![Complex::zero(); len]; let mut scratch: Vec> = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Create benches using functions taking one argument macro_rules! make_benches { ($name:ident, { $($len:literal),* }) => { paste! { $( #[bench] fn [<$name _ $len _f32_scalar>](b: &mut Bencher) { [](b, $len); } #[bench] fn [<$name _ $len _f64_scalar>](b: &mut Bencher) { [](b, $len); } #[bench] fn [<$name _ $len _f32_wasmsimd>](b: &mut Bencher) { [](b, $len); } #[bench] fn [<$name _ $len _f64_wasmsimd>](b: &mut Bencher) { [](b, $len); } )* } } } make_benches!(wasmsimdcomparison, {4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072}); make_benches!(wasmsimdcomparison, { 262144, 524288, 1048576, 2097152, 4194304 }); rustfft-6.2.0/benches/bench_rustfft.rs000064400000000000000000000765320072674642500162040ustar 00000000000000#![allow(bare_trait_objects)] #![allow(non_snake_case)] #![feature(test)] extern crate test; extern crate rustfft; use std::sync::Arc; use test::Bencher; use rustfft::{Direction, FftNum, Fft, FftDirection, Length}; use rustfft::num_complex::Complex; use rustfft::num_traits::Zero; use rustfft::algorithm::*; use rustfft::algorithm::butterflies::*; struct Noop { len: usize, direction: FftDirection, } impl Fft for Noop { fn process_with_scratch(&self, _buffer: &mut [Complex], _scratch: &mut [Complex]) {} fn process_outofplace_with_scratch(&self, _input: &mut [Complex], _output: &mut [Complex], _scratch: &mut [Complex]) {} fn get_inplace_scratch_len(&self) -> usize { self.len } fn get_outofplace_scratch_len(&self) -> usize { 0 } } impl Length for Noop { fn len(&self) -> usize { self.len } } impl Direction for Noop { fn fft_direction(&self) -> FftDirection { self.direction } } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length fn bench_planned_f32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlanner::new(); let fft: Arc> = planner.plan_fft_forward(len); assert_eq!(fft.len(), len); let mut buffer = vec![Complex::zero(); len]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Powers of 4 #[bench] fn planned32_p2_00000064(b: &mut Bencher) { bench_planned_f32(b, 64); } #[bench] fn planned32_p2_00000128(b: &mut Bencher) { bench_planned_f32(b, 128); } #[bench] fn planned32_p2_00000256(b: &mut Bencher) { bench_planned_f32(b, 256); } #[bench] fn planned32_p2_00000512(b: &mut Bencher) { bench_planned_f32(b, 512); } #[bench] fn planned32_p2_00001024(b: &mut Bencher) { bench_planned_f32(b, 1024); } #[bench] fn planned32_p2_00002048(b: &mut Bencher) { bench_planned_f32(b, 2048); } #[bench] fn planned32_p2_00004096(b: &mut Bencher) { bench_planned_f32(b, 4096); } #[bench] fn planned32_p2_00016384(b: &mut Bencher) { bench_planned_f32(b, 16384); } #[bench] fn planned32_p2_00065536(b: &mut Bencher) { bench_planned_f32(b, 65536); } #[bench] fn planned32_p2_01048576(b: &mut Bencher) { bench_planned_f32(b, 1048576); } #[bench] fn planned32_p2_16777216(b: &mut Bencher) { bench_planned_f32(b, 16777216); } // Powers of 5 #[bench] fn planned32_p5_00125(b: &mut Bencher) { bench_planned_f32(b, 125); } #[bench] fn planned32_p5_00625(b: &mut Bencher) { bench_planned_f32(b, 625); } #[bench] fn planned32_p5_03125(b: &mut Bencher) { bench_planned_f32(b, 3125); } #[bench] fn planned32_p5_15625(b: &mut Bencher) { bench_planned_f32(b, 15625); } // Powers of 7 #[bench] fn planned32_p7_00343(b: &mut Bencher) { bench_planned_f32(b, 343); } #[bench] fn planned32_p7_02401(b: &mut Bencher) { bench_planned_f32(b, 2401); } #[bench] fn planned32_p7_16807(b: &mut Bencher) { bench_planned_f32(b, 16807); } // Prime lengths // Prime lengths #[bench] fn planned32_prime_0005(b: &mut Bencher) { bench_planned_f32(b, 5); } #[bench] fn planned32_prime_0017(b: &mut Bencher) { bench_planned_f32(b, 17); } #[bench] fn planned32_prime_0149(b: &mut Bencher) { bench_planned_f32(b, 149); } #[bench] fn planned32_prime_0151(b: &mut Bencher) { bench_planned_f32(b, 151); } #[bench] fn planned32_prime_0251(b: &mut Bencher) { bench_planned_f32(b, 251); } #[bench] fn planned32_prime_0257(b: &mut Bencher) { bench_planned_f32(b, 257); } #[bench] fn planned32_prime_1009(b: &mut Bencher) { bench_planned_f32(b, 1009); } #[bench] fn planned32_prime_1201(b: &mut Bencher) { bench_planned_f32(b, 1201); } #[bench] fn planned32_prime_2017(b: &mut Bencher) { bench_planned_f32(b, 2017); } #[bench] fn planned32_prime_2879(b: &mut Bencher) { bench_planned_f32(b, 2879); } #[bench] fn planned32_prime_32767(b: &mut Bencher) { bench_planned_f32(b, 32767); } #[bench] fn planned32_prime_65521(b: &mut Bencher) { bench_planned_f32(b, 65521); } #[bench] fn planned32_prime_65537(b: &mut Bencher) { bench_planned_f32(b, 65537); } #[bench] fn planned32_prime_746483(b: &mut Bencher) { bench_planned_f32(b,746483); } #[bench] fn planned32_prime_746497(b: &mut Bencher) { bench_planned_f32(b,746497); } //primes raised to a power #[bench] fn planned32_primepower_044521(b: &mut Bencher) { bench_planned_f32(b, 44521); } // 211^2 #[bench] fn planned32_primepower_160801(b: &mut Bencher) { bench_planned_f32(b, 160801); } // 401^2 // numbers times powers of two #[bench] fn planned32_composite_024576(b: &mut Bencher) { bench_planned_f32(b, 24576); } #[bench] fn planned32_composite_020736(b: &mut Bencher) { bench_planned_f32(b, 20736); } // power of 2 times large prime #[bench] fn planned32_composite_032192(b: &mut Bencher) { bench_planned_f32(b, 32192); } #[bench] fn planned32_composite_024028(b: &mut Bencher) { bench_planned_f32(b, 24028); } // small mixed composites times a large prime #[bench] fn planned32_composite_005472(b: &mut Bencher) { bench_planned_f32(b, 5472); } #[bench] fn planned32_composite_030270(b: &mut Bencher) { bench_planned_f32(b, 30270); } // small mixed composites #[bench] fn planned32_composite_000018(b: &mut Bencher) { bench_planned_f32(b, 00018); } #[bench] fn planned32_composite_000360(b: &mut Bencher) { bench_planned_f32(b, 00360); } #[bench] fn planned32_composite_001200(b: &mut Bencher) { bench_planned_f32(b, 01200); } #[bench] fn planned32_composite_044100(b: &mut Bencher) { bench_planned_f32(b, 44100); } #[bench] fn planned32_composite_048000(b: &mut Bencher) { bench_planned_f32(b, 48000); } #[bench] fn planned32_composite_046656(b: &mut Bencher) { bench_planned_f32(b, 46656); } #[bench] fn planned32_composite_100000(b: &mut Bencher) { bench_planned_f32(b, 100000); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length fn bench_planned_f64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlanner::new(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer = vec![Complex::zero(); len]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } #[bench] fn planned64_p2_00000064(b: &mut Bencher) { bench_planned_f64(b, 64); } #[bench] fn planned64_p2_00000128(b: &mut Bencher) { bench_planned_f64(b, 128); } #[bench] fn planned64_p2_00000256(b: &mut Bencher) { bench_planned_f64(b, 256); } #[bench] fn planned64_p2_00000512(b: &mut Bencher) { bench_planned_f64(b, 512); } #[bench] fn planned64_p2_00001024(b: &mut Bencher) { bench_planned_f64(b, 1024); } #[bench] fn planned64_p2_00002048(b: &mut Bencher) { bench_planned_f64(b, 2048); } #[bench] fn planned64_p2_00004096(b: &mut Bencher) { bench_planned_f64(b, 4096); } #[bench] fn planned64_p2_00016384(b: &mut Bencher) { bench_planned_f64(b, 16384); } #[bench] fn planned64_p2_00065536(b: &mut Bencher) { bench_planned_f64(b, 65536); } #[bench] fn planned64_p2_01048576(b: &mut Bencher) { bench_planned_f64(b, 1048576); } //#[bench] fn planned64_p2_16777216(b: &mut Bencher) { bench_planned_f64(b, 16777216); } // Powers of 5 #[bench] fn planned64_p5_00125(b: &mut Bencher) { bench_planned_f64(b, 125); } #[bench] fn planned64_p5_00625(b: &mut Bencher) { bench_planned_f64(b, 625); } #[bench] fn planned64_p5_03125(b: &mut Bencher) { bench_planned_f64(b, 3125); } #[bench] fn planned64_p5_15625(b: &mut Bencher) { bench_planned_f64(b, 15625); } #[bench] fn planned64_p7_00343(b: &mut Bencher) { bench_planned_f64(b, 343); } #[bench] fn planned64_p7_02401(b: &mut Bencher) { bench_planned_f64(b, 2401); } #[bench] fn planned64_p7_16807(b: &mut Bencher) { bench_planned_f64(b, 16807); } // Prime lengths #[bench] fn planned64_prime_0005(b: &mut Bencher) { bench_planned_f64(b, 5); } #[bench] fn planned64_prime_0017(b: &mut Bencher) { bench_planned_f64(b, 17); } #[bench] fn planned64_prime_0149(b: &mut Bencher) { bench_planned_f64(b, 149); } #[bench] fn planned64_prime_0151(b: &mut Bencher) { bench_planned_f64(b, 151); } #[bench] fn planned64_prime_0251(b: &mut Bencher) { bench_planned_f64(b, 251); } #[bench] fn planned64_prime_0257(b: &mut Bencher) { bench_planned_f64(b, 257); } #[bench] fn planned64_prime_1009(b: &mut Bencher) { bench_planned_f64(b, 1009); } #[bench] fn planned64_prime_2017(b: &mut Bencher) { bench_planned_f64(b, 2017); } #[bench] fn planned64_prime_2879(b: &mut Bencher) { bench_planned_f64(b, 2879); } #[bench] fn planned64_prime_32767(b: &mut Bencher) { bench_planned_f64(b, 32767); } #[bench] fn planned64_prime_65521(b: &mut Bencher) { bench_planned_f64(b, 65521); } #[bench] fn planned64_prime_65537(b: &mut Bencher) { bench_planned_f64(b, 65537); } #[bench] fn planned64_prime_746483(b: &mut Bencher) { bench_planned_f64(b,746483); } #[bench] fn planned64_prime_746497(b: &mut Bencher) { bench_planned_f64(b,746497); } //primes raised to a power #[bench] fn planned64_primepower_044521(b: &mut Bencher) { bench_planned_f64(b, 44521); } // 211^2 #[bench] fn planned64_primepower_160801(b: &mut Bencher) { bench_planned_f64(b, 160801); } // 401^2 // numbers times powers of two #[bench] fn planned64_composite_024576(b: &mut Bencher) { bench_planned_f64(b, 24576); } #[bench] fn planned64_composite_020736(b: &mut Bencher) { bench_planned_f64(b, 20736); } // power of 2 times large prime #[bench] fn planned64_composite_032192(b: &mut Bencher) { bench_planned_f64(b, 32192); } #[bench] fn planned64_composite_024028(b: &mut Bencher) { bench_planned_f64(b, 24028); } // small mixed composites times a large prime #[bench] fn planned64_composite_030270(b: &mut Bencher) { bench_planned_f64(b, 30270); } // small mixed composites #[bench] fn planned64_composite_000018(b: &mut Bencher) { bench_planned_f64(b, 00018); } #[bench] fn planned64_composite_000360(b: &mut Bencher) { bench_planned_f64(b, 00360); } #[bench] fn planned64_composite_044100(b: &mut Bencher) { bench_planned_f64(b, 44100); } #[bench] fn planned64_composite_048000(b: &mut Bencher) { bench_planned_f64(b, 48000); } #[bench] fn planned64_composite_046656(b: &mut Bencher) { bench_planned_f64(b, 46656); } #[bench] fn planned64_composite_100000(b: &mut Bencher) { bench_planned_f64(b, 100000); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to the Good-Thomas algorithm fn bench_good_thomas(b: &mut Bencher, width: usize, height: usize) { let mut planner = rustfft::FftPlanner::new(); let width_fft = planner.plan_fft_forward(width); let height_fft = planner.plan_fft_forward(height); let fft : Arc> = Arc::new(GoodThomasAlgorithm::new(width_fft, height_fft)); let mut buffer = vec![Complex::zero(); width * height]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| {fft.process_with_scratch(&mut buffer, &mut scratch);} ); } #[bench] fn good_thomas_0002_3(b: &mut Bencher) { bench_good_thomas(b, 2, 3); } #[bench] fn good_thomas_0003_4(b: &mut Bencher) { bench_good_thomas(b, 3, 4); } #[bench] fn good_thomas_0004_5(b: &mut Bencher) { bench_good_thomas(b, 4, 5); } #[bench] fn good_thomas_0007_32(b: &mut Bencher) { bench_good_thomas(b, 7, 32); } #[bench] fn good_thomas_0032_27(b: &mut Bencher) { bench_good_thomas(b, 32, 27); } #[bench] fn good_thomas_0256_243(b: &mut Bencher) { bench_good_thomas(b, 256, 243); } #[bench] fn good_thomas_2048_3(b: &mut Bencher) { bench_good_thomas(b, 2048, 3); } #[bench] fn good_thomas_2048_2187(b: &mut Bencher) { bench_good_thomas(b, 2048, 2187); } /// Times just the FFT setup (not execution) /// for a given length, specific to the Good-Thomas algorithm fn bench_good_thomas_setup(b: &mut Bencher, width: usize, height: usize) { let mut planner = rustfft::FftPlanner::new(); let width_fft = planner.plan_fft_forward(width); let height_fft = planner.plan_fft_forward(height); b.iter(|| { let fft : Arc> = Arc::new(GoodThomasAlgorithm::new(Arc::clone(&width_fft), Arc::clone(&height_fft))); test::black_box(fft); }); } #[bench] fn good_thomas_setup_0002_3(b: &mut Bencher) { bench_good_thomas_setup(b, 2, 3); } #[bench] fn good_thomas_setup_0003_4(b: &mut Bencher) { bench_good_thomas_setup(b, 3, 4); } #[bench] fn good_thomas_setup_0004_5(b: &mut Bencher) { bench_good_thomas_setup(b, 4, 5); } #[bench] fn good_thomas_setup_0007_32(b: &mut Bencher) { bench_good_thomas_setup(b, 7, 32); } #[bench] fn good_thomas_setup_0032_27(b: &mut Bencher) { bench_good_thomas_setup(b, 32, 27); } #[bench] fn good_thomas_setup_0256_243(b: &mut Bencher) { bench_good_thomas_setup(b, 256, 243); } #[bench] fn good_thomas_setup_2048_3(b: &mut Bencher) { bench_good_thomas_setup(b, 2048, 3); } #[bench] fn good_thomas_setup_2048_2187(b: &mut Bencher) { bench_good_thomas_setup(b, 2048, 2187); } /// Times just the FFT setup (not execution) /// for a given length, specific to MixedRadix fn bench_mixed_radix_setup(b: &mut Bencher, width: usize, height: usize) { let mut planner = rustfft::FftPlanner::new(); let width_fft = planner.plan_fft_forward(width); let height_fft = planner.plan_fft_forward(height); b.iter(|| { let fft : Arc> = Arc::new(MixedRadix::new(Arc::clone(&width_fft), Arc::clone(&height_fft))); test::black_box(fft); }); } #[bench] fn setup_mixed_radix_0002_3(b: &mut Bencher) { bench_mixed_radix_setup(b, 2, 3); } #[bench] fn setup_mixed_radix_0003_4(b: &mut Bencher) { bench_mixed_radix_setup(b, 3, 4); } #[bench] fn setup_mixed_radix_0004_5(b: &mut Bencher) { bench_mixed_radix_setup(b, 4, 5); } #[bench] fn setup_mixed_radix_0007_32(b: &mut Bencher) { bench_mixed_radix_setup(b, 7, 32); } #[bench] fn setup_mixed_radix_0032_27(b: &mut Bencher) { bench_mixed_radix_setup(b, 32, 27); } #[bench] fn setup_mixed_radix_0256_243(b: &mut Bencher) { bench_mixed_radix_setup(b, 256, 243); } #[bench] fn setup_mixed_radix_2048_3(b: &mut Bencher) { bench_mixed_radix_setup(b, 2048, 3); } #[bench] fn setup_mixed_radix_2048_2187(b: &mut Bencher) { bench_mixed_radix_setup(b, 2048, 2187); } /// Times just the FFT setup (not execution) /// for a given length, specific to MixedRadix fn bench_small_mixed_radix_setup(b: &mut Bencher, width: usize, height: usize) { let mut planner = rustfft::FftPlanner::new(); let width_fft = planner.plan_fft_forward(width); let height_fft = planner.plan_fft_forward(height); b.iter(|| { let fft : Arc> = Arc::new(MixedRadixSmall::new(Arc::clone(&width_fft), Arc::clone(&height_fft))); test::black_box(fft); }); } #[bench] fn setup_small_mixed_radix_0002_3(b: &mut Bencher) { bench_small_mixed_radix_setup(b, 2, 3); } #[bench] fn setup_small_mixed_radix_0003_4(b: &mut Bencher) { bench_small_mixed_radix_setup(b, 3, 4); } #[bench] fn setup_small_mixed_radix_0004_5(b: &mut Bencher) { bench_small_mixed_radix_setup(b, 4, 5); } #[bench] fn setup_small_mixed_radix_0007_32(b: &mut Bencher) { bench_small_mixed_radix_setup(b, 7, 32); } #[bench] fn setup_small_mixed_radix_0032_27(b: &mut Bencher) { bench_small_mixed_radix_setup(b, 32, 27); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to the Mixed-Radix algorithm fn bench_mixed_radix(b: &mut Bencher, width: usize, height: usize) { let mut planner = rustfft::FftPlanner::new(); let width_fft = planner.plan_fft_forward(width); let height_fft = planner.plan_fft_forward(height); let fft : Arc> = Arc::new(MixedRadix::new(width_fft, height_fft)); let mut buffer = vec![Complex{re: 0_f32, im: 0_f32}; fft.len()]; let mut scratch = vec![Complex{re: 0_f32, im: 0_f32}; fft.get_inplace_scratch_len()]; b.iter(|| {fft.process_with_scratch(&mut buffer, &mut scratch);} ); } #[bench] fn mixed_radix_0002_3(b: &mut Bencher) { bench_mixed_radix(b, 2, 3); } #[bench] fn mixed_radix_0003_4(b: &mut Bencher) { bench_mixed_radix(b, 3, 4); } #[bench] fn mixed_radix_0004_5(b: &mut Bencher) { bench_mixed_radix(b, 4, 5); } #[bench] fn mixed_radix_0007_32(b: &mut Bencher) { bench_mixed_radix(b, 7, 32); } #[bench] fn mixed_radix_0032_27(b: &mut Bencher) { bench_mixed_radix(b, 32, 27); } #[bench] fn mixed_radix_0256_243(b: &mut Bencher) { bench_mixed_radix(b, 256, 243); } #[bench] fn mixed_radix_2048_3(b: &mut Bencher) { bench_mixed_radix(b, 2048, 3); } #[bench] fn mixed_radix_2048_2187(b: &mut Bencher) { bench_mixed_radix(b, 2048, 2187); } fn plan_butterfly_fft(len: usize) -> Arc> { match len { 2 => Arc::new(Butterfly2::new(FftDirection::Forward)), 3 => Arc::new(Butterfly3::new(FftDirection::Forward)), 4 => Arc::new(Butterfly4::new(FftDirection::Forward)), 5 => Arc::new(Butterfly5::new(FftDirection::Forward)), 6 => Arc::new(Butterfly6::new(FftDirection::Forward)), 7 => Arc::new(Butterfly7::new(FftDirection::Forward)), 8 => Arc::new(Butterfly8::new(FftDirection::Forward)), 16 => Arc::new(Butterfly16::new(FftDirection::Forward)), 32 => Arc::new(Butterfly32::new(FftDirection::Forward)), _ => panic!("Invalid butterfly size: {}", len), } } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to the MixedRadixSmall algorithm fn bench_mixed_radix_small(b: &mut Bencher, width: usize, height: usize) { let width_fft = plan_butterfly_fft(width); let height_fft = plan_butterfly_fft(height); let fft : Arc> = Arc::new(MixedRadixSmall::new(width_fft, height_fft)); let mut signal = vec![Complex{re: 0_f32, im: 0_f32}; width * height]; let mut spectrum = signal.clone(); b.iter(|| {fft.process_with_scratch(&mut signal, &mut spectrum);} ); } #[bench] fn mixed_radix_small_0002_3(b: &mut Bencher) { bench_mixed_radix_small(b, 2, 3); } #[bench] fn mixed_radix_small_0003_4(b: &mut Bencher) { bench_mixed_radix_small(b, 3, 4); } #[bench] fn mixed_radix_small_0004_5(b: &mut Bencher) { bench_mixed_radix_small(b, 4, 5); } #[bench] fn mixed_radix_small_0007_32(b: &mut Bencher) { bench_mixed_radix_small(b, 7, 32); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to the Mixed-Radix Double Butterfly algorithm fn bench_good_thomas_small(b: &mut Bencher, width: usize, height: usize) { let width_fft = plan_butterfly_fft(width); let height_fft = plan_butterfly_fft(height); let fft : Arc> = Arc::new(GoodThomasAlgorithmSmall::new(width_fft, height_fft)); let mut signal = vec![Complex{re: 0_f32, im: 0_f32}; width * height]; let mut spectrum = signal.clone(); b.iter(|| {fft.process_with_scratch(&mut signal, &mut spectrum);} ); } #[bench] fn good_thomas_small_0002_3(b: &mut Bencher) { bench_good_thomas_small(b, 2, 3); } #[bench] fn good_thomas_small_0003_4(b: &mut Bencher) { bench_good_thomas_small(b, 3, 4); } #[bench] fn good_thomas_small_0004_5(b: &mut Bencher) { bench_good_thomas_small(b, 4, 5); } #[bench] fn good_thomas_small_0007_32(b: &mut Bencher) { bench_good_thomas_small(b, 7, 32); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to Rader's algorithm fn bench_raders_scalar(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlanner::new(); let inner_fft = planner.plan_fft_forward(len - 1); let fft : Arc> = Arc::new(RadersAlgorithm::new(inner_fft)); let mut buffer = vec![Complex{re: 0_f32, im: 0_f32}; len]; let mut scratch = vec![Complex{re: 0_f32, im: 0_f32}; fft.get_inplace_scratch_len()]; b.iter(|| {fft.process_with_scratch(&mut buffer, &mut scratch);} ); } #[bench] fn raders_fft_scalar_prime_0005(b: &mut Bencher) { bench_raders_scalar(b, 5); } #[bench] fn raders_fft_scalar_prime_0017(b: &mut Bencher) { bench_raders_scalar(b, 17); } #[bench] fn raders_fft_scalar_prime_0149(b: &mut Bencher) { bench_raders_scalar(b, 149); } #[bench] fn raders_fft_scalar_prime_0151(b: &mut Bencher) { bench_raders_scalar(b, 151); } #[bench] fn raders_fft_scalar_prime_0251(b: &mut Bencher) { bench_raders_scalar(b, 251); } #[bench] fn raders_fft_scalar_prime_0257(b: &mut Bencher) { bench_raders_scalar(b, 257); } #[bench] fn raders_fft_scalar_prime_1009(b: &mut Bencher) { bench_raders_scalar(b, 1009); } #[bench] fn raders_fft_scalar_prime_2017(b: &mut Bencher) { bench_raders_scalar(b, 2017); } #[bench] fn raders_fft_scalar_prime_12289(b: &mut Bencher) { bench_raders_scalar(b, 12289); } #[bench] fn raders_fft_scalar_prime_18433(b: &mut Bencher) { bench_raders_scalar(b, 18433); } #[bench] fn raders_fft_scalar_prime_65521(b: &mut Bencher) { bench_raders_scalar(b, 65521); } #[bench] fn raders_fft_scalar_prime_65537(b: &mut Bencher) { bench_raders_scalar(b, 65537); } #[bench] fn raders_fft_scalar_prime_746483(b: &mut Bencher) { bench_raders_scalar(b,746483); } #[bench] fn raders_fft_scalar_prime_746497(b: &mut Bencher) { bench_raders_scalar(b,746497); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to Bluestein's Algorithm fn bench_bluesteins_scalar_prime(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlanner::new(); let inner_fft = planner.plan_fft_forward((len * 2 - 1).checked_next_power_of_two().unwrap()); let fft : Arc> = Arc::new(BluesteinsAlgorithm::new(len, inner_fft)); let mut buffer = vec![Zero::zero(); len]; let mut scratch = vec![Zero::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch);} ); } #[bench] fn bench_bluesteins_scalar_prime_0005(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 5); } #[bench] fn bench_bluesteins_scalar_prime_0017(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 17); } #[bench] fn bench_bluesteins_scalar_prime_0149(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 149); } #[bench] fn bench_bluesteins_scalar_prime_0151(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 151); } #[bench] fn bench_bluesteins_scalar_prime_0251(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 251); } #[bench] fn bench_bluesteins_scalar_prime_0257(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 257); } #[bench] fn bench_bluesteins_scalar_prime_1009(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 1009); } #[bench] fn bench_bluesteins_scalar_prime_2017(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 2017); } #[bench] fn bench_bluesteins_scalar_prime_32767(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 32767); } #[bench] fn bench_bluesteins_scalar_prime_65521(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 65521); } #[bench] fn bench_bluesteins_scalar_prime_65537(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 65537); } #[bench] fn bench_bluesteins_scalar_prime_746483(b: &mut Bencher) { bench_bluesteins_scalar_prime(b,746483); } #[bench] fn bench_bluesteins_scalar_prime_746497(b: &mut Bencher) { bench_bluesteins_scalar_prime(b,746497); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to Rader's algorithm fn bench_radix4(b: &mut Bencher, len: usize) { assert!(len % 4 == 0); let fft = Radix4::new(len, FftDirection::Forward); let mut signal = vec![Complex{re: 0_f32, im: 0_f32}; len]; let mut spectrum = signal.clone(); b.iter(|| {fft.process_outofplace_with_scratch(&mut signal, &mut spectrum, &mut []);} ); } #[bench] fn radix4_______64(b: &mut Bencher) { bench_radix4(b, 64); } #[bench] fn radix4______256(b: &mut Bencher) { bench_radix4(b, 256); } #[bench] fn radix4_____1024(b: &mut Bencher) { bench_radix4(b, 1024); } #[bench] fn radix4____65536(b: &mut Bencher) { bench_radix4(b, 65536); } #[bench] fn radix4__1048576(b: &mut Bencher) { bench_radix4(b, 1048576); } //#[bench] fn radix4_16777216(b: &mut Bencher) { bench_radix4(b, 16777216); } fn get_mixed_radix_power2(len: usize) -> Arc> { match len { 8 => Arc::new(Butterfly8::new( FftDirection::Forward)), 16 => Arc::new(Butterfly16::new(FftDirection::Forward)), 32 => Arc::new(Butterfly32::new(FftDirection::Forward)), _ => { let zeroes = len.trailing_zeros(); assert!(zeroes % 2 == 0); let half_zeroes = zeroes / 2; let inner = get_mixed_radix_power2(1 << half_zeroes); Arc::new(MixedRadix::new(Arc::clone(&inner), inner)) } } } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to Rader's algorithm fn bench_mixed_radix_power2(b: &mut Bencher, len: usize) { let fft = get_mixed_radix_power2(len); let mut buffer = vec![Zero::zero(); len]; let mut scratch = vec![Zero::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } #[bench] fn mixed_radix_power2__00000256(b: &mut Bencher) { bench_mixed_radix_power2(b, 256); } #[bench] fn mixed_radix_power2__00001024(b: &mut Bencher) { bench_mixed_radix_power2(b, 1024); } #[bench] fn mixed_radix_power2__00004096(b: &mut Bencher) { bench_mixed_radix_power2(b, 4096); } #[bench] fn mixed_radix_power2__00065536(b: &mut Bencher) { bench_mixed_radix_power2(b, 65536); } #[bench] fn mixed_radix_power2__01048576(b: &mut Bencher) { bench_mixed_radix_power2(b, 1048576); } #[bench] fn mixed_radix_power2__16777216(b: &mut Bencher) { bench_mixed_radix_power2(b, 16777216); } fn get_mixed_radix_inline_power2(len: usize) -> Arc> { match len { 8 => Arc::new(Butterfly8::new( FftDirection::Forward)), 16 => Arc::new(Butterfly16::new(FftDirection::Forward)), 32 => Arc::new(Butterfly32::new(FftDirection::Forward)), _ => { let zeroes = len.trailing_zeros(); assert!(zeroes % 2 == 0); let half_zeroes = zeroes / 2; let inner = get_mixed_radix_inline_power2(1 << half_zeroes); Arc::new(MixedRadix::new(Arc::clone(&inner), inner)) } } } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to Rader's algorithm fn bench_mixed_radix_inline_power2(b: &mut Bencher, len: usize) { let fft = get_mixed_radix_inline_power2(len); let mut buffer = vec![Zero::zero(); len]; let mut scratch = vec![Zero::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } #[bench] fn mixed_radix_power2_inline__00000256(b: &mut Bencher) { bench_mixed_radix_inline_power2(b, 256); } #[bench] fn mixed_radix_power2_inline__00001024(b: &mut Bencher) { bench_mixed_radix_inline_power2(b, 1024); } #[bench] fn mixed_radix_power2_inline__00004096(b: &mut Bencher) { bench_mixed_radix_inline_power2(b, 4096); } #[bench] fn mixed_radix_power2_inline__00065536(b: &mut Bencher) { bench_mixed_radix_inline_power2(b, 65536); } #[bench] fn mixed_radix_power2_inline__01048576(b: &mut Bencher) { bench_mixed_radix_inline_power2(b, 1048576); } #[bench] fn mixed_radix_power2_inline__16777216(b: &mut Bencher) { bench_mixed_radix_inline_power2(b, 16777216); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length fn bench_butterfly32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlanner::new(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer = vec![Complex::zero(); len * 10]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } #[bench] fn butterfly32_02(b: &mut Bencher) { bench_butterfly32(b, 2); } #[bench] fn butterfly32_03(b: &mut Bencher) { bench_butterfly32(b, 3); } #[bench] fn butterfly32_04(b: &mut Bencher) { bench_butterfly32(b, 4); } #[bench] fn butterfly32_05(b: &mut Bencher) { bench_butterfly32(b, 5); } #[bench] fn butterfly32_06(b: &mut Bencher) { bench_butterfly32(b, 6); } #[bench] fn butterfly32_07(b: &mut Bencher) { bench_butterfly32(b, 7); } #[bench] fn butterfly32_08(b: &mut Bencher) { bench_butterfly32(b, 8); } #[bench] fn butterfly32_09(b: &mut Bencher) { bench_butterfly32(b, 9); } #[bench] fn butterfly32_11(b: &mut Bencher) { bench_butterfly32(b, 11); } #[bench] fn butterfly32_12(b: &mut Bencher) { bench_butterfly32(b, 12); } #[bench] fn butterfly32_16(b: &mut Bencher) { bench_butterfly32(b, 16); } #[bench] fn butterfly32_24(b: &mut Bencher) { bench_butterfly32(b, 24); } #[bench] fn butterfly32_27(b: &mut Bencher) { bench_butterfly32(b, 27); } #[bench] fn butterfly32_32(b: &mut Bencher) { bench_butterfly32(b, 32); } #[bench] fn butterfly32_36(b: &mut Bencher) { bench_butterfly32(b, 36); } #[bench] fn butterfly32_48(b: &mut Bencher) { bench_butterfly32(b, 48); } #[bench] fn butterfly32_54(b: &mut Bencher) { bench_butterfly32(b, 54); } #[bench] fn butterfly32_64(b: &mut Bencher) { bench_butterfly32(b, 64); } #[bench] fn butterfly32_72(b: &mut Bencher) { bench_butterfly32(b, 72); } #[bench] fn butterfly32_128(b: &mut Bencher) { bench_butterfly32(b, 128); } #[bench] fn butterfly32_256(b: &mut Bencher) { bench_butterfly32(b, 256); } #[bench] fn butterfly32_512(b: &mut Bencher) { bench_butterfly32(b, 512); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length fn bench_butterfly64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlanner::new(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer = vec![Complex::zero(); len * 10]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } #[bench] fn butterfly64_02(b: &mut Bencher) { bench_butterfly64(b, 2); } #[bench] fn butterfly64_03(b: &mut Bencher) { bench_butterfly64(b, 3); } #[bench] fn butterfly64_04(b: &mut Bencher) { bench_butterfly64(b, 4); } #[bench] fn butterfly64_05(b: &mut Bencher) { bench_butterfly64(b, 5); } #[bench] fn butterfly64_06(b: &mut Bencher) { bench_butterfly64(b, 6); } #[bench] fn butterfly64_07(b: &mut Bencher) { bench_butterfly64(b, 7); } #[bench] fn butterfly64_08(b: &mut Bencher) { bench_butterfly64(b, 8); } #[bench] fn butterfly64_09(b: &mut Bencher) { bench_butterfly64(b, 9); } #[bench] fn butterfly64_11(b: &mut Bencher) { bench_butterfly64(b, 11); } #[bench] fn butterfly64_12(b: &mut Bencher) { bench_butterfly64(b, 12); } #[bench] fn butterfly64_16(b: &mut Bencher) { bench_butterfly64(b, 16); } #[bench] fn butterfly64_18(b: &mut Bencher) { bench_butterfly64(b, 18); } #[bench] fn butterfly64_24(b: &mut Bencher) { bench_butterfly64(b, 24); } #[bench] fn butterfly64_27(b: &mut Bencher) { bench_butterfly64(b, 27); } #[bench] fn butterfly64_32(b: &mut Bencher) { bench_butterfly64(b, 32); } #[bench] fn butterfly64_36(b: &mut Bencher) { bench_butterfly64(b, 36); } #[bench] fn butterfly64_64(b: &mut Bencher) { bench_butterfly64(b, 64); } #[bench] fn butterfly64_128(b: &mut Bencher) { bench_butterfly64(b, 128); } #[bench] fn butterfly64_256(b: &mut Bencher) { bench_butterfly64(b, 256); } #[bench] fn butterfly64_512(b: &mut Bencher) { bench_butterfly64(b, 512); } rustfft-6.2.0/benches/bench_rustfft_neon.rs000064400000000000000000000245640072674642500172210ustar 00000000000000#![feature(test)] extern crate rustfft; extern crate test; use rustfft::num_complex::Complex; use rustfft::num_traits::Zero; use rustfft::Fft; use std::sync::Arc; use test::Bencher; /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length fn bench_planned_f32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerNeon::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); assert_eq!(fft.len(), len); let mut buffer = vec![Complex::zero(); len]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } fn bench_planned_f64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerNeon::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); assert_eq!(fft.len(), len); let mut buffer = vec![Complex::zero(); len]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length. /// Run the fft on a 10*len vector, similar to how the butterflies are often used. fn bench_planned_multi_f32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerNeon::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer = vec![Complex::zero(); len * 10]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } fn bench_planned_multi_f64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerNeon::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer = vec![Complex::zero(); len * 10]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // All butterflies #[bench] fn neon_butterfly32_02(b: &mut Bencher) { bench_planned_multi_f32(b, 2);} #[bench] fn neon_butterfly32_03(b: &mut Bencher) { bench_planned_multi_f32(b, 3);} #[bench] fn neon_butterfly32_04(b: &mut Bencher) { bench_planned_multi_f32(b, 4);} #[bench] fn neon_butterfly32_05(b: &mut Bencher) { bench_planned_multi_f32(b, 5);} #[bench] fn neon_butterfly32_06(b: &mut Bencher) { bench_planned_multi_f32(b, 6);} #[bench] fn neon_butterfly32_07(b: &mut Bencher) { bench_planned_multi_f32(b, 7);} #[bench] fn neon_butterfly32_08(b: &mut Bencher) { bench_planned_multi_f32(b, 8);} #[bench] fn neon_butterfly32_09(b: &mut Bencher) { bench_planned_multi_f32(b, 9);} #[bench] fn neon_butterfly32_10(b: &mut Bencher) { bench_planned_multi_f32(b, 10);} #[bench] fn neon_butterfly32_11(b: &mut Bencher) { bench_planned_multi_f32(b, 11);} #[bench] fn neon_butterfly32_12(b: &mut Bencher) { bench_planned_multi_f32(b, 12);} #[bench] fn neon_butterfly32_13(b: &mut Bencher) { bench_planned_multi_f32(b, 13);} #[bench] fn neon_butterfly32_15(b: &mut Bencher) { bench_planned_multi_f32(b, 15);} #[bench] fn neon_butterfly32_16(b: &mut Bencher) { bench_planned_multi_f32(b, 16);} #[bench] fn neon_butterfly32_17(b: &mut Bencher) { bench_planned_multi_f32(b, 17);} #[bench] fn neon_butterfly32_19(b: &mut Bencher) { bench_planned_multi_f32(b, 19);} #[bench] fn neon_butterfly32_23(b: &mut Bencher) { bench_planned_multi_f32(b, 23);} #[bench] fn neon_butterfly32_29(b: &mut Bencher) { bench_planned_multi_f32(b, 29);} #[bench] fn neon_butterfly32_31(b: &mut Bencher) { bench_planned_multi_f32(b, 31);} #[bench] fn neon_butterfly32_32(b: &mut Bencher) { bench_planned_multi_f32(b, 32);} #[bench] fn neon_butterfly64_02(b: &mut Bencher) { bench_planned_multi_f64(b, 2);} #[bench] fn neon_butterfly64_03(b: &mut Bencher) { bench_planned_multi_f64(b, 3);} #[bench] fn neon_butterfly64_04(b: &mut Bencher) { bench_planned_multi_f64(b, 4);} #[bench] fn neon_butterfly64_05(b: &mut Bencher) { bench_planned_multi_f64(b, 5);} #[bench] fn neon_butterfly64_06(b: &mut Bencher) { bench_planned_multi_f64(b, 6);} #[bench] fn neon_butterfly64_07(b: &mut Bencher) { bench_planned_multi_f64(b, 7);} #[bench] fn neon_butterfly64_08(b: &mut Bencher) { bench_planned_multi_f64(b, 8);} #[bench] fn neon_butterfly64_09(b: &mut Bencher) { bench_planned_multi_f64(b, 9);} #[bench] fn neon_butterfly64_10(b: &mut Bencher) { bench_planned_multi_f64(b, 10);} #[bench] fn neon_butterfly64_11(b: &mut Bencher) { bench_planned_multi_f64(b, 11);} #[bench] fn neon_butterfly64_12(b: &mut Bencher) { bench_planned_multi_f64(b, 12);} #[bench] fn neon_butterfly64_13(b: &mut Bencher) { bench_planned_multi_f64(b, 13);} #[bench] fn neon_butterfly64_15(b: &mut Bencher) { bench_planned_multi_f64(b, 15);} #[bench] fn neon_butterfly64_16(b: &mut Bencher) { bench_planned_multi_f64(b, 16);} #[bench] fn neon_butterfly64_17(b: &mut Bencher) { bench_planned_multi_f64(b, 17);} #[bench] fn neon_butterfly64_19(b: &mut Bencher) { bench_planned_multi_f64(b, 19);} #[bench] fn neon_butterfly64_23(b: &mut Bencher) { bench_planned_multi_f64(b, 23);} #[bench] fn neon_butterfly64_29(b: &mut Bencher) { bench_planned_multi_f64(b, 29);} #[bench] fn neon_butterfly64_31(b: &mut Bencher) { bench_planned_multi_f64(b, 31);} #[bench] fn neon_butterfly64_32(b: &mut Bencher) { bench_planned_multi_f64(b, 32);} // Powers of 2 #[bench] fn neon_planned32_p2_00000064(b: &mut Bencher) { bench_planned_f32(b, 64); } #[bench] fn neon_planned32_p2_00000128(b: &mut Bencher) { bench_planned_f32(b, 128); } #[bench] fn neon_planned32_p2_00000256(b: &mut Bencher) { bench_planned_f32(b, 256); } #[bench] fn neon_planned32_p2_00000512(b: &mut Bencher) { bench_planned_f32(b, 512); } #[bench] fn neon_planned32_p2_00001024(b: &mut Bencher) { bench_planned_f32(b, 1024); } #[bench] fn neon_planned32_p2_00002048(b: &mut Bencher) { bench_planned_f32(b, 2048); } #[bench] fn neon_planned32_p2_00004096(b: &mut Bencher) { bench_planned_f32(b, 4096); } #[bench] fn neon_planned32_p2_00016384(b: &mut Bencher) { bench_planned_f32(b, 16384); } #[bench] fn neon_planned32_p2_00065536(b: &mut Bencher) { bench_planned_f32(b, 65536); } #[bench] fn neon_planned32_p2_01048576(b: &mut Bencher) { bench_planned_f32(b, 1048576); } #[bench] fn neon_planned64_p2_00000064(b: &mut Bencher) { bench_planned_f64(b, 64); } #[bench] fn neon_planned64_p2_00000128(b: &mut Bencher) { bench_planned_f64(b, 128); } #[bench] fn neon_planned64_p2_00000256(b: &mut Bencher) { bench_planned_f64(b, 256); } #[bench] fn neon_planned64_p2_00000512(b: &mut Bencher) { bench_planned_f64(b, 512); } #[bench] fn neon_planned64_p2_00001024(b: &mut Bencher) { bench_planned_f64(b, 1024); } #[bench] fn neon_planned64_p2_00002048(b: &mut Bencher) { bench_planned_f64(b, 2048); } #[bench] fn neon_planned64_p2_00004096(b: &mut Bencher) { bench_planned_f64(b, 4096); } #[bench] fn neon_planned64_p2_00016384(b: &mut Bencher) { bench_planned_f64(b, 16384); } #[bench] fn neon_planned64_p2_00065536(b: &mut Bencher) { bench_planned_f64(b, 65536); } #[bench] fn neon_planned64_p2_01048576(b: &mut Bencher) { bench_planned_f64(b, 1048576); } // Powers of 7 #[bench] fn neon_planned32_p7_00343(b: &mut Bencher) { bench_planned_f32(b, 343); } #[bench] fn neon_planned32_p7_02401(b: &mut Bencher) { bench_planned_f32(b, 2401); } #[bench] fn neon_planned32_p7_16807(b: &mut Bencher) { bench_planned_f32(b, 16807); } #[bench] fn neon_planned64_p7_00343(b: &mut Bencher) { bench_planned_f64(b, 343); } #[bench] fn neon_planned64_p7_02401(b: &mut Bencher) { bench_planned_f64(b, 2401); } #[bench] fn neon_planned64_p7_16807(b: &mut Bencher) { bench_planned_f64(b, 16807); } // Prime lengths #[bench] fn neon_planned32_prime_0149(b: &mut Bencher) { bench_planned_f32(b, 149); } #[bench] fn neon_planned32_prime_0151(b: &mut Bencher) { bench_planned_f32(b, 151); } #[bench] fn neon_planned32_prime_0251(b: &mut Bencher) { bench_planned_f32(b, 251); } #[bench] fn neon_planned32_prime_0257(b: &mut Bencher) { bench_planned_f32(b, 257); } #[bench] fn neon_planned32_prime_2017(b: &mut Bencher) { bench_planned_f32(b, 2017); } #[bench] fn neon_planned32_prime_2879(b: &mut Bencher) { bench_planned_f32(b, 2879); } #[bench] fn neon_planned32_prime_65521(b: &mut Bencher) { bench_planned_f32(b, 65521); } #[bench] fn neon_planned32_prime_746497(b: &mut Bencher) { bench_planned_f32(b,746497); } #[bench] fn neon_planned64_prime_0149(b: &mut Bencher) { bench_planned_f64(b, 149); } #[bench] fn neon_planned64_prime_0151(b: &mut Bencher) { bench_planned_f64(b, 151); } #[bench] fn neon_planned64_prime_0251(b: &mut Bencher) { bench_planned_f64(b, 251); } #[bench] fn neon_planned64_prime_0257(b: &mut Bencher) { bench_planned_f64(b, 257); } #[bench] fn neon_planned64_prime_2017(b: &mut Bencher) { bench_planned_f64(b, 2017); } #[bench] fn neon_planned64_prime_2879(b: &mut Bencher) { bench_planned_f64(b, 2879); } #[bench] fn neon_planned64_prime_65521(b: &mut Bencher) { bench_planned_f64(b, 65521); } #[bench] fn neon_planned64_prime_746497(b: &mut Bencher) { bench_planned_f64(b,746497); } // small mixed composites #[bench] fn neon_planned32_composite_000018(b: &mut Bencher) { bench_planned_f32(b, 00018); } #[bench] fn neon_planned32_composite_000360(b: &mut Bencher) { bench_planned_f32(b, 00360); } #[bench] fn neon_planned32_composite_001200(b: &mut Bencher) { bench_planned_f32(b, 01200); } #[bench] fn neon_planned32_composite_044100(b: &mut Bencher) { bench_planned_f32(b, 44100); } #[bench] fn neon_planned32_composite_048000(b: &mut Bencher) { bench_planned_f32(b, 48000); } #[bench] fn neon_planned32_composite_046656(b: &mut Bencher) { bench_planned_f32(b, 46656); } #[bench] fn neon_planned64_composite_000018(b: &mut Bencher) { bench_planned_f64(b, 00018); } #[bench] fn neon_planned64_composite_000360(b: &mut Bencher) { bench_planned_f64(b, 00360); } #[bench] fn neon_planned64_composite_001200(b: &mut Bencher) { bench_planned_f64(b, 01200); } #[bench] fn neon_planned64_composite_044100(b: &mut Bencher) { bench_planned_f64(b, 44100); } #[bench] fn neon_planned64_composite_048000(b: &mut Bencher) { bench_planned_f64(b, 48000); } #[bench] fn neon_planned64_composite_046656(b: &mut Bencher) { bench_planned_f64(b, 46656); } rustfft-6.2.0/benches/bench_rustfft_scalar.rs000064400000000000000000001006560072674642500175240ustar 00000000000000#![allow(bare_trait_objects)] #![allow(non_snake_case)] #![feature(test)] extern crate rustfft; extern crate test; use rustfft::algorithm::butterflies::*; use rustfft::algorithm::*; use rustfft::num_complex::Complex; use rustfft::num_traits::Zero; use rustfft::FftPlannerScalar; use rustfft::{Direction, Fft, FftDirection, FftNum, Length}; use std::sync::Arc; use test::Bencher; struct Noop { len: usize, direction: FftDirection, } impl Fft for Noop { fn process_with_scratch(&self, _buffer: &mut [Complex], _scratch: &mut [Complex]) {} fn process_outofplace_with_scratch( &self, _input: &mut [Complex], _output: &mut [Complex], _scratch: &mut [Complex], ) { } fn get_inplace_scratch_len(&self) -> usize { self.len } fn get_outofplace_scratch_len(&self) -> usize { 0 } } impl Length for Noop { fn len(&self) -> usize { self.len } } impl Direction for Noop { fn fft_direction(&self) -> FftDirection { self.direction } } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length fn bench_planned_f32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let fft: Arc> = planner.plan_fft_forward(len); assert_eq!(fft.len(), len); let mut buffer = vec![Complex::zero(); len]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Powers of 4 #[bench] fn planned32_p2_00000064(b: &mut Bencher) { bench_planned_f32(b, 64); } #[bench] fn planned32_p2_00000128(b: &mut Bencher) { bench_planned_f32(b, 128); } #[bench] fn planned32_p2_00000256(b: &mut Bencher) { bench_planned_f32(b, 256); } #[bench] fn planned32_p2_00000512(b: &mut Bencher) { bench_planned_f32(b, 512); } #[bench] fn planned32_p2_00001024(b: &mut Bencher) { bench_planned_f32(b, 1024); } #[bench] fn planned32_p2_00002048(b: &mut Bencher) { bench_planned_f32(b, 2048); } #[bench] fn planned32_p2_00004096(b: &mut Bencher) { bench_planned_f32(b, 4096); } #[bench] fn planned32_p2_00016384(b: &mut Bencher) { bench_planned_f32(b, 16384); } #[bench] fn planned32_p2_00065536(b: &mut Bencher) { bench_planned_f32(b, 65536); } #[bench] fn planned32_p2_01048576(b: &mut Bencher) { bench_planned_f32(b, 1048576); } //#[bench] fn planned32_p2_16777216(b: &mut Bencher) { bench_planned_f32(b, 16777216); } // Powers of 5 //#[bench] fn planned32_p5_00125(b: &mut Bencher) { bench_planned_f32(b, 125); } //#[bench] fn planned32_p5_00625(b: &mut Bencher) { bench_planned_f32(b, 625); } //#[bench] fn planned32_p5_03125(b: &mut Bencher) { bench_planned_f32(b, 3125); } //#[bench] fn planned32_p5_15625(b: &mut Bencher) { bench_planned_f32(b, 15625); } // Powers of 7 //#[bench] fn planned32_p7_00343(b: &mut Bencher) { bench_planned_f32(b, 343); } //#[bench] fn planned32_p7_02401(b: &mut Bencher) { bench_planned_f32(b, 2401); } //#[bench] fn planned32_p7_16807(b: &mut Bencher) { bench_planned_f32(b, 16807); } // Prime lengths // Prime lengths //#[bench] fn planned32_prime_0005(b: &mut Bencher) { bench_planned_f32(b, 5); } //#[bench] fn planned32_prime_0017(b: &mut Bencher) { bench_planned_f32(b, 17); } //#[bench] fn planned32_prime_0149(b: &mut Bencher) { bench_planned_f32(b, 149); } //#[bench] fn planned32_prime_0151(b: &mut Bencher) { bench_planned_f32(b, 151); } //#[bench] fn planned32_prime_0251(b: &mut Bencher) { bench_planned_f32(b, 251); } //#[bench] fn planned32_prime_0257(b: &mut Bencher) { bench_planned_f32(b, 257); } //#[bench] fn planned32_prime_1009(b: &mut Bencher) { bench_planned_f32(b, 1009); } //#[bench] fn planned32_prime_1201(b: &mut Bencher) { bench_planned_f32(b, 1201); } //#[bench] fn planned32_prime_2017(b: &mut Bencher) { bench_planned_f32(b, 2017); } //#[bench] fn planned32_prime_2879(b: &mut Bencher) { bench_planned_f32(b, 2879); } //#[bench] fn planned32_prime_32767(b: &mut Bencher) { bench_planned_f32(b, 32767); } //#[bench] fn planned32_prime_65521(b: &mut Bencher) { bench_planned_f32(b, 65521); } //#[bench] fn planned32_prime_65537(b: &mut Bencher) { bench_planned_f32(b, 65537); } //#[bench] fn planned32_prime_746483(b: &mut Bencher) { bench_planned_f32(b,746483); } //#[bench] fn planned32_prime_746497(b: &mut Bencher) { bench_planned_f32(b,746497); } //primes raised to a power //#[bench] fn planned32_primepower_044521(b: &mut Bencher) { bench_planned_f32(b, 44521); } // 211^2 //#[bench] fn planned32_primepower_160801(b: &mut Bencher) { bench_planned_f32(b, 160801); } // 401^2 // numbers times powers of two //#[bench] fn planned32_composite_024576(b: &mut Bencher) { bench_planned_f32(b, 24576); } //#[bench] fn planned32_composite_020736(b: &mut Bencher) { bench_planned_f32(b, 20736); } // power of 2 times large prime //#[bench] fn planned32_composite_032192(b: &mut Bencher) { bench_planned_f32(b, 32192); } //#[bench] fn planned32_composite_024028(b: &mut Bencher) { bench_planned_f32(b, 24028); } // small mixed composites times a large prime //#[bench] fn planned32_composite_005472(b: &mut Bencher) { bench_planned_f32(b, 5472); } //#[bench] fn planned32_composite_030270(b: &mut Bencher) { bench_planned_f32(b, 30270); } // small mixed composites //#[bench] fn planned32_composite_000018(b: &mut Bencher) { bench_planned_f32(b, 00018); } //#[bench] fn planned32_composite_000360(b: &mut Bencher) { bench_planned_f32(b, 00360); } //#[bench] fn planned32_composite_001200(b: &mut Bencher) { bench_planned_f32(b, 01200); } //#[bench] fn planned32_composite_044100(b: &mut Bencher) { bench_planned_f32(b, 44100); } //#[bench] fn planned32_composite_048000(b: &mut Bencher) { bench_planned_f32(b, 48000); } //#[bench] fn planned32_composite_046656(b: &mut Bencher) { bench_planned_f32(b, 46656); } //#[bench] fn planned32_composite_100000(b: &mut Bencher) { bench_planned_f32(b, 100000); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length fn bench_planned_f64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer = vec![Complex::zero(); len]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } #[bench] fn planned64_p2_00000064(b: &mut Bencher) { bench_planned_f64(b, 64); } #[bench] fn planned64_p2_00000128(b: &mut Bencher) { bench_planned_f64(b, 128); } #[bench] fn planned64_p2_00000256(b: &mut Bencher) { bench_planned_f64(b, 256); } #[bench] fn planned64_p2_00000512(b: &mut Bencher) { bench_planned_f64(b, 512); } #[bench] fn planned64_p2_00001024(b: &mut Bencher) { bench_planned_f64(b, 1024); } #[bench] fn planned64_p2_00002048(b: &mut Bencher) { bench_planned_f64(b, 2048); } #[bench] fn planned64_p2_00004096(b: &mut Bencher) { bench_planned_f64(b, 4096); } #[bench] fn planned64_p2_00016384(b: &mut Bencher) { bench_planned_f64(b, 16384); } #[bench] fn planned64_p2_00065536(b: &mut Bencher) { bench_planned_f64(b, 65536); } #[bench] fn planned64_p2_01048576(b: &mut Bencher) { bench_planned_f64(b, 1048576); } //#[bench] fn planned64_p2_16777216(b: &mut Bencher) { bench_planned_f64(b, 16777216); } // Powers of 5 //#[bench] fn planned64_p5_00125(b: &mut Bencher) { bench_planned_f64(b, 125); } //#[bench] fn planned64_p5_00625(b: &mut Bencher) { bench_planned_f64(b, 625); } //#[bench] fn planned64_p5_03125(b: &mut Bencher) { bench_planned_f64(b, 3125); } //#[bench] fn planned64_p5_15625(b: &mut Bencher) { bench_planned_f64(b, 15625); } //#[bench] fn planned64_p7_00343(b: &mut Bencher) { bench_planned_f64(b, 343); } //#[bench] fn planned64_p7_02401(b: &mut Bencher) { bench_planned_f64(b, 2401); } //#[bench] fn planned64_p7_16807(b: &mut Bencher) { bench_planned_f64(b, 16807); } // Prime lengths //#[bench] fn planned64_prime_0005(b: &mut Bencher) { bench_planned_f64(b, 5); } //#[bench] fn planned64_prime_0017(b: &mut Bencher) { bench_planned_f64(b, 17); } //#[bench] fn planned64_prime_0149(b: &mut Bencher) { bench_planned_f64(b, 149); } //#[bench] fn planned64_prime_0151(b: &mut Bencher) { bench_planned_f64(b, 151); } //#[bench] fn planned64_prime_0251(b: &mut Bencher) { bench_planned_f64(b, 251); } //#[bench] fn planned64_prime_0257(b: &mut Bencher) { bench_planned_f64(b, 257); } //#[bench] fn planned64_prime_1009(b: &mut Bencher) { bench_planned_f64(b, 1009); } //#[bench] fn planned64_prime_2017(b: &mut Bencher) { bench_planned_f64(b, 2017); } //#[bench] fn planned64_prime_2879(b: &mut Bencher) { bench_planned_f64(b, 2879); } //#[bench] fn planned64_prime_32767(b: &mut Bencher) { bench_planned_f64(b, 32767); } //#[bench] fn planned64_prime_65521(b: &mut Bencher) { bench_planned_f64(b, 65521); } //#[bench] fn planned64_prime_65537(b: &mut Bencher) { bench_planned_f64(b, 65537); } //#[bench] fn planned64_prime_746483(b: &mut Bencher) { bench_planned_f64(b,746483); } //#[bench] fn planned64_prime_746497(b: &mut Bencher) { bench_planned_f64(b,746497); } //primes raised to a power //#[bench] fn planned64_primepower_044521(b: &mut Bencher) { bench_planned_f64(b, 44521); } // 211^2 //#[bench] fn planned64_primepower_160801(b: &mut Bencher) { bench_planned_f64(b, 160801); } // 401^2 // numbers times powers of two //#[bench] fn planned64_composite_024576(b: &mut Bencher) { bench_planned_f64(b, 24576); } //#[bench] fn planned64_composite_020736(b: &mut Bencher) { bench_planned_f64(b, 20736); } // power of 2 times large prime //#[bench] fn planned64_composite_032192(b: &mut Bencher) { bench_planned_f64(b, 32192); } //#[bench] fn planned64_composite_024028(b: &mut Bencher) { bench_planned_f64(b, 24028); } // small mixed composites times a large prime //#[bench] fn planned64_composite_030270(b: &mut Bencher) { bench_planned_f64(b, 30270); } // small mixed composites //#[bench] fn planned64_composite_000018(b: &mut Bencher) { bench_planned_f64(b, 00018); } //#[bench] fn planned64_composite_000360(b: &mut Bencher) { bench_planned_f64(b, 00360); } //#[bench] fn planned64_composite_044100(b: &mut Bencher) { bench_planned_f64(b, 44100); } //#[bench] fn planned64_composite_048000(b: &mut Bencher) { bench_planned_f64(b, 48000); } //#[bench] fn planned64_composite_046656(b: &mut Bencher) { bench_planned_f64(b, 46656); } //#[bench] fn planned64_composite_100000(b: &mut Bencher) { bench_planned_f64(b, 100000); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to the Good-Thomas algorithm fn bench_good_thomas(b: &mut Bencher, width: usize, height: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let width_fft = planner.plan_fft_forward(width); let height_fft = planner.plan_fft_forward(height); let fft: Arc> = Arc::new(GoodThomasAlgorithm::new(width_fft, height_fft)); let mut buffer = vec![Complex::zero(); width * height]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } #[bench] fn good_thomas_0002_3(b: &mut Bencher) { bench_good_thomas(b, 2, 3); } #[bench] fn good_thomas_0003_4(b: &mut Bencher) { bench_good_thomas(b, 3, 4); } #[bench] fn good_thomas_0004_5(b: &mut Bencher) { bench_good_thomas(b, 4, 5); } #[bench] fn good_thomas_0007_32(b: &mut Bencher) { bench_good_thomas(b, 7, 32); } #[bench] fn good_thomas_0032_27(b: &mut Bencher) { bench_good_thomas(b, 32, 27); } //#[bench] fn good_thomas_0256_243(b: &mut Bencher) { bench_good_thomas(b, 256, 243); } //#[bench] fn good_thomas_2048_3(b: &mut Bencher) { bench_good_thomas(b, 2048, 3); } //#[bench] fn good_thomas_2048_2187(b: &mut Bencher) { bench_good_thomas(b, 2048, 2187); } /// Times just the FFT setup (not execution) /// for a given length, specific to the Good-Thomas algorithm fn bench_good_thomas_setup(b: &mut Bencher, width: usize, height: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let width_fft = planner.plan_fft_forward(width); let height_fft = planner.plan_fft_forward(height); b.iter(|| { let fft: Arc> = Arc::new(GoodThomasAlgorithm::new( Arc::clone(&width_fft), Arc::clone(&height_fft), )); test::black_box(fft); }); } #[bench] fn good_thomas_setup_0002_3(b: &mut Bencher) { bench_good_thomas_setup(b, 2, 3); } #[bench] fn good_thomas_setup_0003_4(b: &mut Bencher) { bench_good_thomas_setup(b, 3, 4); } #[bench] fn good_thomas_setup_0004_5(b: &mut Bencher) { bench_good_thomas_setup(b, 4, 5); } #[bench] fn good_thomas_setup_0007_32(b: &mut Bencher) { bench_good_thomas_setup(b, 7, 32); } #[bench] fn good_thomas_setup_0032_27(b: &mut Bencher) { bench_good_thomas_setup(b, 32, 27); } #[bench] fn good_thomas_setup_0256_243(b: &mut Bencher) { bench_good_thomas_setup(b, 256, 243); } #[bench] fn good_thomas_setup_2048_3(b: &mut Bencher) { bench_good_thomas_setup(b, 2048, 3); } #[bench] fn good_thomas_setup_2048_2187(b: &mut Bencher) { bench_good_thomas_setup(b, 2048, 2187); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to the Mixed-Radix algorithm fn bench_mixed_radix(b: &mut Bencher, width: usize, height: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let width_fft = planner.plan_fft_forward(width); let height_fft = planner.plan_fft_forward(height); let fft: Arc> = Arc::new(MixedRadix::new(width_fft, height_fft)); let mut buffer = vec![ Complex { re: 0_f32, im: 0_f32 }; fft.len() ]; let mut scratch = vec![ Complex { re: 0_f32, im: 0_f32 }; fft.get_inplace_scratch_len() ]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } #[bench] fn mixed_radix_0002_3(b: &mut Bencher) { bench_mixed_radix(b, 2, 3); } #[bench] fn mixed_radix_0003_4(b: &mut Bencher) { bench_mixed_radix(b, 3, 4); } #[bench] fn mixed_radix_0004_5(b: &mut Bencher) { bench_mixed_radix(b, 4, 5); } #[bench] fn mixed_radix_0007_32(b: &mut Bencher) { bench_mixed_radix(b, 7, 32); } #[bench] fn mixed_radix_0032_27(b: &mut Bencher) { bench_mixed_radix(b, 32, 27); } //#[bench] fn mixed_radix_0256_243(b: &mut Bencher) { bench_mixed_radix(b, 256, 243); } //#[bench] fn mixed_radix_2048_3(b: &mut Bencher) { bench_mixed_radix(b, 2048, 3); } //#[bench] fn mixed_radix_2048_2187(b: &mut Bencher) { bench_mixed_radix(b, 2048, 2187); } fn plan_butterfly_fft(len: usize) -> Arc> { match len { 2 => Arc::new(Butterfly2::new(FftDirection::Forward)), 3 => Arc::new(Butterfly3::new(FftDirection::Forward)), 4 => Arc::new(Butterfly4::new(FftDirection::Forward)), 5 => Arc::new(Butterfly5::new(FftDirection::Forward)), 6 => Arc::new(Butterfly6::new(FftDirection::Forward)), 7 => Arc::new(Butterfly7::new(FftDirection::Forward)), 8 => Arc::new(Butterfly8::new(FftDirection::Forward)), 16 => Arc::new(Butterfly16::new(FftDirection::Forward)), 32 => Arc::new(Butterfly32::new(FftDirection::Forward)), _ => panic!("Invalid butterfly size: {}", len), } } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to the MixedRadixSmall algorithm fn bench_mixed_radix_small(b: &mut Bencher, width: usize, height: usize) { let width_fft = plan_butterfly_fft(width); let height_fft = plan_butterfly_fft(height); let fft: Arc> = Arc::new(MixedRadixSmall::new(width_fft, height_fft)); let mut signal = vec![ Complex { re: 0_f32, im: 0_f32 }; width * height ]; let mut spectrum = signal.clone(); b.iter(|| { fft.process_with_scratch(&mut signal, &mut spectrum); }); } #[bench] fn mixed_radix_small_0002_3(b: &mut Bencher) { bench_mixed_radix_small(b, 2, 3); } #[bench] fn mixed_radix_small_0003_4(b: &mut Bencher) { bench_mixed_radix_small(b, 3, 4); } #[bench] fn mixed_radix_small_0004_5(b: &mut Bencher) { bench_mixed_radix_small(b, 4, 5); } #[bench] fn mixed_radix_small_0007_32(b: &mut Bencher) { bench_mixed_radix_small(b, 7, 32); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to the Mixed-Radix Double Butterfly algorithm fn bench_good_thomas_small(b: &mut Bencher, width: usize, height: usize) { let width_fft = plan_butterfly_fft(width); let height_fft = plan_butterfly_fft(height); let fft: Arc> = Arc::new(GoodThomasAlgorithmSmall::new(width_fft, height_fft)); let mut signal = vec![ Complex { re: 0_f32, im: 0_f32 }; width * height ]; let mut spectrum = signal.clone(); b.iter(|| { fft.process_with_scratch(&mut signal, &mut spectrum); }); } #[bench] fn good_thomas_small_0002_3(b: &mut Bencher) { bench_good_thomas_small(b, 2, 3); } #[bench] fn good_thomas_small_0003_4(b: &mut Bencher) { bench_good_thomas_small(b, 3, 4); } #[bench] fn good_thomas_small_0004_5(b: &mut Bencher) { bench_good_thomas_small(b, 4, 5); } #[bench] fn good_thomas_small_0007_32(b: &mut Bencher) { bench_good_thomas_small(b, 7, 32); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to Rader's algorithm #[allow(dead_code)] fn bench_raders_scalar(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let inner_fft = planner.plan_fft_forward(len - 1); let fft: Arc> = Arc::new(RadersAlgorithm::new(inner_fft)); let mut buffer = vec![ Complex { re: 0_f32, im: 0_f32 }; len ]; let mut scratch = vec![ Complex { re: 0_f32, im: 0_f32 }; fft.get_inplace_scratch_len() ]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } //#[bench] fn raders_fft_scalar_prime_0005(b: &mut Bencher) { bench_raders_scalar(b, 5); } //#[bench] fn raders_fft_scalar_prime_0017(b: &mut Bencher) { bench_raders_scalar(b, 17); } //#[bench] fn raders_fft_scalar_prime_0149(b: &mut Bencher) { bench_raders_scalar(b, 149); } //#[bench] fn raders_fft_scalar_prime_0151(b: &mut Bencher) { bench_raders_scalar(b, 151); } //#[bench] fn raders_fft_scalar_prime_0251(b: &mut Bencher) { bench_raders_scalar(b, 251); } //#[bench] fn raders_fft_scalar_prime_0257(b: &mut Bencher) { bench_raders_scalar(b, 257); } //#[bench] fn raders_fft_scalar_prime_1009(b: &mut Bencher) { bench_raders_scalar(b, 1009); } //#[bench] fn raders_fft_scalar_prime_2017(b: &mut Bencher) { bench_raders_scalar(b, 2017); } //#[bench] fn raders_fft_scalar_prime_12289(b: &mut Bencher) { bench_raders_scalar(b, 12289); } //#[bench] fn raders_fft_scalar_prime_18433(b: &mut Bencher) { bench_raders_scalar(b, 18433); } //#[bench] fn raders_fft_scalar_prime_65521(b: &mut Bencher) { bench_raders_scalar(b, 65521); } //#[bench] fn raders_fft_scalar_prime_65537(b: &mut Bencher) { bench_raders_scalar(b, 65537); } //#[bench] fn raders_fft_scalar_prime_746483(b: &mut Bencher) { bench_raders_scalar(b,746483); } //#[bench] fn raders_fft_scalar_prime_746497(b: &mut Bencher) { bench_raders_scalar(b,746497); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to Bluestein's Algorithm #[allow(dead_code)] fn bench_bluesteins_scalar_prime(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let inner_fft = planner.plan_fft_forward((len * 2 - 1).checked_next_power_of_two().unwrap()); let fft: Arc> = Arc::new(BluesteinsAlgorithm::new(len, inner_fft)); let mut buffer = vec![Zero::zero(); len]; let mut scratch = vec![Zero::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } //#[bench] fn bench_bluesteins_scalar_prime_0005(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 5); } //#[bench] fn bench_bluesteins_scalar_prime_0017(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 17); } //#[bench] fn bench_bluesteins_scalar_prime_0149(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 149); } //#[bench] fn bench_bluesteins_scalar_prime_0151(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 151); } //#[bench] fn bench_bluesteins_scalar_prime_0251(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 251); } //#[bench] fn bench_bluesteins_scalar_prime_0257(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 257); } //#[bench] fn bench_bluesteins_scalar_prime_1009(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 1009); } //#[bench] fn bench_bluesteins_scalar_prime_2017(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 2017); } //#[bench] fn bench_bluesteins_scalar_prime_32767(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 32767); } //#[bench] fn bench_bluesteins_scalar_prime_65521(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 65521); } //#[bench] fn bench_bluesteins_scalar_prime_65537(b: &mut Bencher) { bench_bluesteins_scalar_prime(b, 65537); } //#[bench] fn bench_bluesteins_scalar_prime_746483(b: &mut Bencher) { bench_bluesteins_scalar_prime(b,746483); } //#[bench] fn bench_bluesteins_scalar_prime_746497(b: &mut Bencher) { bench_bluesteins_scalar_prime(b,746497); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to Rader's algorithm fn bench_radix4(b: &mut Bencher, len: usize) { assert!(len % 4 == 0); let fft = Radix4::new(len, FftDirection::Forward); let mut signal = vec![ Complex { re: 0_f32, im: 0_f32 }; len ]; let mut spectrum = signal.clone(); b.iter(|| { fft.process_outofplace_with_scratch(&mut signal, &mut spectrum, &mut []); }); } #[bench] fn radix4_______64(b: &mut Bencher) { bench_radix4(b, 64); } #[bench] fn radix4______256(b: &mut Bencher) { bench_radix4(b, 256); } #[bench] fn radix4_____1024(b: &mut Bencher) { bench_radix4(b, 1024); } #[bench] fn radix4____65536(b: &mut Bencher) { bench_radix4(b, 65536); } //#[bench] fn radix4__1048576(b: &mut Bencher) { bench_radix4(b, 1048576); } //#[bench] fn radix4_16777216(b: &mut Bencher) { bench_radix4(b, 16777216); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to Rader's algorithm fn bench_64_radix4(b: &mut Bencher, len: usize) { assert!(len % 4 == 0); let fft = Radix4::new(len, FftDirection::Forward); let mut signal = vec![ Complex { re: 0_f64, im: 0_f64 }; len ]; let mut spectrum = signal.clone(); b.iter(|| { fft.process_outofplace_with_scratch(&mut signal, &mut spectrum, &mut []); }); } #[bench] fn radix4_64____64(b: &mut Bencher) { bench_64_radix4(b, 64); } #[bench] fn radix4_64___256(b: &mut Bencher) { bench_64_radix4(b, 256); } #[bench] fn radix4_64__1024(b: &mut Bencher) { bench_64_radix4(b, 1024); } #[bench] fn radix4_64_65536(b: &mut Bencher) { bench_64_radix4(b, 65536); } //#[bench] fn radix4__1048576(b: &mut Bencher) { bench_radix4(b, 1048576); } //#[bench] fn radix4_16777216(b: &mut Bencher) { bench_radix4(b, 16777216); } fn get_mixed_radix_power2(len: usize) -> Arc> { match len { 8 => Arc::new(Butterfly8::new(FftDirection::Forward)), 16 => Arc::new(Butterfly16::new(FftDirection::Forward)), 32 => Arc::new(Butterfly32::new(FftDirection::Forward)), _ => { let zeroes = len.trailing_zeros(); assert!(zeroes % 2 == 0); let half_zeroes = zeroes / 2; let inner = get_mixed_radix_power2(1 << half_zeroes); Arc::new(MixedRadix::new(Arc::clone(&inner), inner)) } } } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to Rader's algorithm fn bench_mixed_radix_power2(b: &mut Bencher, len: usize) { let fft = get_mixed_radix_power2(len); let mut buffer = vec![Zero::zero(); len]; let mut scratch = vec![Zero::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } #[bench] fn mixed_radix_power2__00000256(b: &mut Bencher) { bench_mixed_radix_power2(b, 256); } #[bench] fn mixed_radix_power2__00001024(b: &mut Bencher) { bench_mixed_radix_power2(b, 1024); } #[bench] fn mixed_radix_power2__00004096(b: &mut Bencher) { bench_mixed_radix_power2(b, 4096); } #[bench] fn mixed_radix_power2__00065536(b: &mut Bencher) { bench_mixed_radix_power2(b, 65536); } //#[bench] fn mixed_radix_power2__01048576(b: &mut Bencher) { bench_mixed_radix_power2(b, 1048576); } //#[bench] fn mixed_radix_power2__16777216(b: &mut Bencher) { bench_mixed_radix_power2(b, 16777216); } fn get_mixed_radix_inline_power2(len: usize) -> Arc> { match len { 8 => Arc::new(Butterfly8::new(FftDirection::Forward)), 16 => Arc::new(Butterfly16::new(FftDirection::Forward)), 32 => Arc::new(Butterfly32::new(FftDirection::Forward)), _ => { let zeroes = len.trailing_zeros(); assert!(zeroes % 2 == 0); let half_zeroes = zeroes / 2; let inner = get_mixed_radix_inline_power2(1 << half_zeroes); Arc::new(MixedRadix::new(Arc::clone(&inner), inner)) } } } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length, specific to Rader's algorithm fn bench_mixed_radix_inline_power2(b: &mut Bencher, len: usize) { let fft = get_mixed_radix_inline_power2(len); let mut buffer = vec![Zero::zero(); len]; let mut scratch = vec![Zero::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } #[bench] fn mixed_radix_power2_inline__00000256(b: &mut Bencher) { bench_mixed_radix_inline_power2(b, 256); } #[bench] fn mixed_radix_power2_inline__00001024(b: &mut Bencher) { bench_mixed_radix_inline_power2(b, 1024); } #[bench] fn mixed_radix_power2_inline__00004096(b: &mut Bencher) { bench_mixed_radix_inline_power2(b, 4096); } #[bench] fn mixed_radix_power2_inline__00065536(b: &mut Bencher) { bench_mixed_radix_inline_power2(b, 65536); } //#[bench] fn mixed_radix_power2_inline__01048576(b: &mut Bencher) { bench_mixed_radix_inline_power2(b, 1048576); } //#[bench] fn mixed_radix_power2_inline__16777216(b: &mut Bencher) { bench_mixed_radix_inline_power2(b, 16777216); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length fn bench_butterfly32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer = vec![Complex::zero(); len * 10]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } #[bench] fn butterfly32_02(b: &mut Bencher) { bench_butterfly32(b, 2); } #[bench] fn butterfly32_03(b: &mut Bencher) { bench_butterfly32(b, 3); } #[bench] fn butterfly32_04(b: &mut Bencher) { bench_butterfly32(b, 4); } #[bench] fn butterfly32_05(b: &mut Bencher) { bench_butterfly32(b, 5); } #[bench] fn butterfly32_06(b: &mut Bencher) { bench_butterfly32(b, 6); } #[bench] fn butterfly32_07(b: &mut Bencher) { bench_butterfly32(b, 7); } #[bench] fn butterfly32_08(b: &mut Bencher) { bench_butterfly32(b, 8); } #[bench] fn butterfly32_09(b: &mut Bencher) { bench_butterfly32(b, 9); } #[bench] fn butterfly32_11(b: &mut Bencher) { bench_butterfly32(b, 11); } #[bench] fn butterfly32_12(b: &mut Bencher) { bench_butterfly32(b, 12); } #[bench] fn butterfly32_16(b: &mut Bencher) { bench_butterfly32(b, 16); } //#[bench] fn butterfly32_24(b: &mut Bencher) { bench_butterfly32(b, 24); } //#[bench] fn butterfly32_27(b: &mut Bencher) { bench_butterfly32(b, 27); } //#[bench] fn butterfly32_32(b: &mut Bencher) { bench_butterfly32(b, 32); } //#[bench] fn butterfly32_36(b: &mut Bencher) { bench_butterfly32(b, 36); } //#[bench] fn butterfly32_48(b: &mut Bencher) { bench_butterfly32(b, 48); } //#[bench] fn butterfly32_54(b: &mut Bencher) { bench_butterfly32(b, 54); } //#[bench] fn butterfly32_64(b: &mut Bencher) { bench_butterfly32(b, 64); } //#[bench] fn butterfly32_72(b: &mut Bencher) { bench_butterfly32(b, 72); } //#[bench] fn butterfly32_128(b: &mut Bencher) { bench_butterfly32(b, 128); } //#[bench] fn butterfly32_256(b: &mut Bencher) { bench_butterfly32(b, 256); } //#[bench] fn butterfly32_512(b: &mut Bencher) { bench_butterfly32(b, 512); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length fn bench_butterfly64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerScalar::new(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer = vec![Complex::zero(); len * 10]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } #[bench] fn butterfly64_02(b: &mut Bencher) { bench_butterfly64(b, 2); } #[bench] fn butterfly64_03(b: &mut Bencher) { bench_butterfly64(b, 3); } #[bench] fn butterfly64_04(b: &mut Bencher) { bench_butterfly64(b, 4); } #[bench] fn butterfly64_05(b: &mut Bencher) { bench_butterfly64(b, 5); } #[bench] fn butterfly64_06(b: &mut Bencher) { bench_butterfly64(b, 6); } #[bench] fn butterfly64_07(b: &mut Bencher) { bench_butterfly64(b, 7); } #[bench] fn butterfly64_08(b: &mut Bencher) { bench_butterfly64(b, 8); } #[bench] fn butterfly64_09(b: &mut Bencher) { bench_butterfly64(b, 9); } #[bench] fn butterfly64_11(b: &mut Bencher) { bench_butterfly64(b, 11); } #[bench] fn butterfly64_12(b: &mut Bencher) { bench_butterfly64(b, 12); } #[bench] fn butterfly64_16(b: &mut Bencher) { bench_butterfly64(b, 16); } //#[bench] fn butterfly64_18(b: &mut Bencher) { bench_butterfly64(b, 18); } //#[bench] fn butterfly64_24(b: &mut Bencher) { bench_butterfly64(b, 24); } //#[bench] fn butterfly64_27(b: &mut Bencher) { bench_butterfly64(b, 27); } //#[bench] fn butterfly64_32(b: &mut Bencher) { bench_butterfly64(b, 32); } //#[bench] fn butterfly64_36(b: &mut Bencher) { bench_butterfly64(b, 36); } //#[bench] fn butterfly64_64(b: &mut Bencher) { bench_butterfly64(b, 64); } //#[bench] fn butterfly64_128(b: &mut Bencher) { bench_butterfly64(b, 128); } //#[bench] fn butterfly64_256(b: &mut Bencher) { bench_butterfly64(b, 256); } //#[bench] fn butterfly64_512(b: &mut Bencher) { bench_butterfly64(b, 512); } fn bench_bluesteins_setup(b: &mut Bencher, len: usize) { let inner_len = (len * 2 - 1).next_power_of_two(); let inner_fft = FftPlannerScalar::::new().plan_fft_forward(inner_len); b.iter(|| { test::black_box(BluesteinsAlgorithm::new(len, Arc::clone(&inner_fft))); }); } #[bench] fn setup_bluesteins_0017(b: &mut Bencher) { bench_bluesteins_setup(b, 17); } #[bench] fn setup_bluesteins_0055(b: &mut Bencher) { bench_bluesteins_setup(b, 55); } #[bench] fn setup_bluesteins_0117(b: &mut Bencher) { bench_bluesteins_setup(b, 117); } #[bench] fn setup_bluesteins_0555(b: &mut Bencher) { bench_bluesteins_setup(b, 555); } #[bench] fn setup_bluesteins_1117(b: &mut Bencher) { bench_bluesteins_setup(b, 1117); } #[bench] fn setup_bluesteins_5555(b: &mut Bencher) { bench_bluesteins_setup(b, 5555); } rustfft-6.2.0/benches/bench_rustfft_sse.rs000064400000000000000000000244220072674642500170450ustar 00000000000000#![feature(test)] extern crate rustfft; extern crate test; use rustfft::num_complex::Complex; use rustfft::num_traits::Zero; use rustfft::Fft; use std::sync::Arc; use test::Bencher; /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length fn bench_planned_f32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerSse::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); assert_eq!(fft.len(), len); let mut buffer = vec![Complex::zero(); len]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } fn bench_planned_f64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerSse::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); assert_eq!(fft.len(), len); let mut buffer = vec![Complex::zero(); len]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length. /// Run the fft on a 10*len vector, similar to how the butterflies are often used. fn bench_planned_multi_f32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerSse::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer = vec![Complex::zero(); len * 10]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } fn bench_planned_multi_f64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerSse::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer = vec![Complex::zero(); len * 10]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // All butterflies #[bench] fn sse_butterfly32_02(b: &mut Bencher) { bench_planned_multi_f32(b, 2);} #[bench] fn sse_butterfly32_03(b: &mut Bencher) { bench_planned_multi_f32(b, 3);} #[bench] fn sse_butterfly32_04(b: &mut Bencher) { bench_planned_multi_f32(b, 4);} #[bench] fn sse_butterfly32_05(b: &mut Bencher) { bench_planned_multi_f32(b, 5);} #[bench] fn sse_butterfly32_06(b: &mut Bencher) { bench_planned_multi_f32(b, 6);} #[bench] fn sse_butterfly32_07(b: &mut Bencher) { bench_planned_multi_f32(b, 7);} #[bench] fn sse_butterfly32_08(b: &mut Bencher) { bench_planned_multi_f32(b, 8);} #[bench] fn sse_butterfly32_09(b: &mut Bencher) { bench_planned_multi_f32(b, 9);} #[bench] fn sse_butterfly32_10(b: &mut Bencher) { bench_planned_multi_f32(b, 10);} #[bench] fn sse_butterfly32_11(b: &mut Bencher) { bench_planned_multi_f32(b, 11);} #[bench] fn sse_butterfly32_12(b: &mut Bencher) { bench_planned_multi_f32(b, 12);} #[bench] fn sse_butterfly32_13(b: &mut Bencher) { bench_planned_multi_f32(b, 13);} #[bench] fn sse_butterfly32_15(b: &mut Bencher) { bench_planned_multi_f32(b, 15);} #[bench] fn sse_butterfly32_16(b: &mut Bencher) { bench_planned_multi_f32(b, 16);} #[bench] fn sse_butterfly32_17(b: &mut Bencher) { bench_planned_multi_f32(b, 17);} #[bench] fn sse_butterfly32_19(b: &mut Bencher) { bench_planned_multi_f32(b, 19);} #[bench] fn sse_butterfly32_23(b: &mut Bencher) { bench_planned_multi_f32(b, 23);} #[bench] fn sse_butterfly32_29(b: &mut Bencher) { bench_planned_multi_f32(b, 29);} #[bench] fn sse_butterfly32_31(b: &mut Bencher) { bench_planned_multi_f32(b, 31);} #[bench] fn sse_butterfly32_32(b: &mut Bencher) { bench_planned_multi_f32(b, 32);} #[bench] fn sse_butterfly64_02(b: &mut Bencher) { bench_planned_multi_f64(b, 2);} #[bench] fn sse_butterfly64_03(b: &mut Bencher) { bench_planned_multi_f64(b, 3);} #[bench] fn sse_butterfly64_04(b: &mut Bencher) { bench_planned_multi_f64(b, 4);} #[bench] fn sse_butterfly64_05(b: &mut Bencher) { bench_planned_multi_f64(b, 5);} #[bench] fn sse_butterfly64_06(b: &mut Bencher) { bench_planned_multi_f64(b, 6);} #[bench] fn sse_butterfly64_07(b: &mut Bencher) { bench_planned_multi_f64(b, 7);} #[bench] fn sse_butterfly64_08(b: &mut Bencher) { bench_planned_multi_f64(b, 8);} #[bench] fn sse_butterfly64_09(b: &mut Bencher) { bench_planned_multi_f64(b, 9);} #[bench] fn sse_butterfly64_10(b: &mut Bencher) { bench_planned_multi_f64(b, 10);} #[bench] fn sse_butterfly64_11(b: &mut Bencher) { bench_planned_multi_f64(b, 11);} #[bench] fn sse_butterfly64_12(b: &mut Bencher) { bench_planned_multi_f64(b, 12);} #[bench] fn sse_butterfly64_13(b: &mut Bencher) { bench_planned_multi_f64(b, 13);} #[bench] fn sse_butterfly64_15(b: &mut Bencher) { bench_planned_multi_f64(b, 15);} #[bench] fn sse_butterfly64_16(b: &mut Bencher) { bench_planned_multi_f64(b, 16);} #[bench] fn sse_butterfly64_17(b: &mut Bencher) { bench_planned_multi_f64(b, 17);} #[bench] fn sse_butterfly64_19(b: &mut Bencher) { bench_planned_multi_f64(b, 19);} #[bench] fn sse_butterfly64_23(b: &mut Bencher) { bench_planned_multi_f64(b, 23);} #[bench] fn sse_butterfly64_29(b: &mut Bencher) { bench_planned_multi_f64(b, 29);} #[bench] fn sse_butterfly64_31(b: &mut Bencher) { bench_planned_multi_f64(b, 31);} #[bench] fn sse_butterfly64_32(b: &mut Bencher) { bench_planned_multi_f64(b, 32);} // Powers of 2 #[bench] fn sse_planned32_p2_00000064(b: &mut Bencher) { bench_planned_f32(b, 64); } #[bench] fn sse_planned32_p2_00000128(b: &mut Bencher) { bench_planned_f32(b, 128); } #[bench] fn sse_planned32_p2_00000256(b: &mut Bencher) { bench_planned_f32(b, 256); } #[bench] fn sse_planned32_p2_00000512(b: &mut Bencher) { bench_planned_f32(b, 512); } #[bench] fn sse_planned32_p2_00001024(b: &mut Bencher) { bench_planned_f32(b, 1024); } #[bench] fn sse_planned32_p2_00002048(b: &mut Bencher) { bench_planned_f32(b, 2048); } #[bench] fn sse_planned32_p2_00004096(b: &mut Bencher) { bench_planned_f32(b, 4096); } #[bench] fn sse_planned32_p2_00016384(b: &mut Bencher) { bench_planned_f32(b, 16384); } #[bench] fn sse_planned32_p2_00065536(b: &mut Bencher) { bench_planned_f32(b, 65536); } #[bench] fn sse_planned32_p2_01048576(b: &mut Bencher) { bench_planned_f32(b, 1048576); } #[bench] fn sse_planned64_p2_00000064(b: &mut Bencher) { bench_planned_f64(b, 64); } #[bench] fn sse_planned64_p2_00000128(b: &mut Bencher) { bench_planned_f64(b, 128); } #[bench] fn sse_planned64_p2_00000256(b: &mut Bencher) { bench_planned_f64(b, 256); } #[bench] fn sse_planned64_p2_00000512(b: &mut Bencher) { bench_planned_f64(b, 512); } #[bench] fn sse_planned64_p2_00001024(b: &mut Bencher) { bench_planned_f64(b, 1024); } #[bench] fn sse_planned64_p2_00002048(b: &mut Bencher) { bench_planned_f64(b, 2048); } #[bench] fn sse_planned64_p2_00004096(b: &mut Bencher) { bench_planned_f64(b, 4096); } #[bench] fn sse_planned64_p2_00016384(b: &mut Bencher) { bench_planned_f64(b, 16384); } #[bench] fn sse_planned64_p2_00065536(b: &mut Bencher) { bench_planned_f64(b, 65536); } #[bench] fn sse_planned64_p2_01048576(b: &mut Bencher) { bench_planned_f64(b, 1048576); } // Powers of 7 #[bench] fn sse_planned32_p7_00343(b: &mut Bencher) { bench_planned_f32(b, 343); } #[bench] fn sse_planned32_p7_02401(b: &mut Bencher) { bench_planned_f32(b, 2401); } #[bench] fn sse_planned32_p7_16807(b: &mut Bencher) { bench_planned_f32(b, 16807); } #[bench] fn sse_planned64_p7_00343(b: &mut Bencher) { bench_planned_f64(b, 343); } #[bench] fn sse_planned64_p7_02401(b: &mut Bencher) { bench_planned_f64(b, 2401); } #[bench] fn sse_planned64_p7_16807(b: &mut Bencher) { bench_planned_f64(b, 16807); } // Prime lengths #[bench] fn sse_planned32_prime_0149(b: &mut Bencher) { bench_planned_f32(b, 149); } #[bench] fn sse_planned32_prime_0151(b: &mut Bencher) { bench_planned_f32(b, 151); } #[bench] fn sse_planned32_prime_0251(b: &mut Bencher) { bench_planned_f32(b, 251); } #[bench] fn sse_planned32_prime_0257(b: &mut Bencher) { bench_planned_f32(b, 257); } #[bench] fn sse_planned32_prime_2017(b: &mut Bencher) { bench_planned_f32(b, 2017); } #[bench] fn sse_planned32_prime_2879(b: &mut Bencher) { bench_planned_f32(b, 2879); } #[bench] fn sse_planned32_prime_65521(b: &mut Bencher) { bench_planned_f32(b, 65521); } #[bench] fn sse_planned32_prime_746497(b: &mut Bencher) { bench_planned_f32(b,746497); } #[bench] fn sse_planned64_prime_0149(b: &mut Bencher) { bench_planned_f64(b, 149); } #[bench] fn sse_planned64_prime_0151(b: &mut Bencher) { bench_planned_f64(b, 151); } #[bench] fn sse_planned64_prime_0251(b: &mut Bencher) { bench_planned_f64(b, 251); } #[bench] fn sse_planned64_prime_0257(b: &mut Bencher) { bench_planned_f64(b, 257); } #[bench] fn sse_planned64_prime_2017(b: &mut Bencher) { bench_planned_f64(b, 2017); } #[bench] fn sse_planned64_prime_2879(b: &mut Bencher) { bench_planned_f64(b, 2879); } #[bench] fn sse_planned64_prime_65521(b: &mut Bencher) { bench_planned_f64(b, 65521); } #[bench] fn sse_planned64_prime_746497(b: &mut Bencher) { bench_planned_f64(b,746497); } // small mixed composites #[bench] fn sse_planned32_composite_000018(b: &mut Bencher) { bench_planned_f32(b, 00018); } #[bench] fn sse_planned32_composite_000360(b: &mut Bencher) { bench_planned_f32(b, 00360); } #[bench] fn sse_planned32_composite_001200(b: &mut Bencher) { bench_planned_f32(b, 01200); } #[bench] fn sse_planned32_composite_044100(b: &mut Bencher) { bench_planned_f32(b, 44100); } #[bench] fn sse_planned32_composite_048000(b: &mut Bencher) { bench_planned_f32(b, 48000); } #[bench] fn sse_planned32_composite_046656(b: &mut Bencher) { bench_planned_f32(b, 46656); } #[bench] fn sse_planned64_composite_000018(b: &mut Bencher) { bench_planned_f64(b, 00018); } #[bench] fn sse_planned64_composite_000360(b: &mut Bencher) { bench_planned_f64(b, 00360); } #[bench] fn sse_planned64_composite_001200(b: &mut Bencher) { bench_planned_f64(b, 01200); } #[bench] fn sse_planned64_composite_044100(b: &mut Bencher) { bench_planned_f64(b, 44100); } #[bench] fn sse_planned64_composite_048000(b: &mut Bencher) { bench_planned_f64(b, 48000); } #[bench] fn sse_planned64_composite_046656(b: &mut Bencher) { bench_planned_f64(b, 46656); } rustfft-6.2.0/benches/bench_rustfft_wasm_simd.rs000064400000000000000000000313470072674642500202420ustar 00000000000000#![feature(test)] /// Unfortunately, `cargo bench` does not permit running these benchmarks out-of-the-box /// on a WebAssembly virtual machine. /// /// Follow these steps to run these benchmarks: /// 0. Prerequisites: Install the `wasm32-wasi` target and `wasmer` /// /// /// 1. Build these benchmarks /// ```bash /// cargo build --bench=bench_rustfft_wasm_simd --release --target wasm32-wasi --features "wasm_simd" /// ``` /// /// After cargo built the bench binary, cargo stores it inside the /// `/target/wasm32-wasi/release/deps` directory. /// The file name of this binary follows this format: `bench_rustfft_wasm_simd-.wasm`. /// For instance, it could be named /// `target/wasm32-wasi/release/deps/bench_rustfft_scalar-6d2b3d5a567416f5.wasm` /// /// 2. Copy the most recently built WASM binary to hex.wasm /// ```bash /// cp `ls -t target/wasm32-wasi/release/deps/*.wasm | head -n 1` hex.wasm /// ``` /// /// 3. Run these benchmark e. g. with [wasmer](https://github.com/wasmerio/wasmer) /// ```bash /// wasmer run --dir=. hex.wasm -- --bench /// ``` /// /// For more information, refer to [Criterion's user guide](https://github.com/bheisler/criterion.rs/blob/dc2b06cd31f7aa34cff6a83a00598e0523186dad/book/src/user_guide/wasi.md) /// which should be mostly applicable to our use case. extern crate rustfft; extern crate test; use rustfft::num_complex::Complex; use rustfft::num_traits::Zero; use rustfft::Fft; use std::sync::Arc; use test::Bencher; /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length fn bench_planned_f32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerWasmSimd::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); assert_eq!(fft.len(), len); let mut buffer = vec![Complex::zero(); len]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } fn bench_planned_f64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerWasmSimd::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); assert_eq!(fft.len(), len); let mut buffer = vec![Complex::zero(); len]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } /// Times just the FFT execution (not allocation and pre-calculation) /// for a given length. /// Run the fft on a 10*len vector, similar to how the butterflies are often used. fn bench_planned_multi_f32(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerWasmSimd::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer = vec![Complex::zero(); len * 10]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } fn bench_planned_multi_f64(b: &mut Bencher, len: usize) { let mut planner = rustfft::FftPlannerWasmSimd::new().unwrap(); let fft: Arc> = planner.plan_fft_forward(len); let mut buffer = vec![Complex::zero(); len * 10]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // All butterflies #[bench] fn wasm_simd_butterfly32_02(b: &mut Bencher) { bench_planned_multi_f32(b, 2); } #[bench] fn wasm_simd_butterfly32_03(b: &mut Bencher) { bench_planned_multi_f32(b, 3); } #[bench] fn wasm_simd_butterfly32_04(b: &mut Bencher) { bench_planned_multi_f32(b, 4); } #[bench] fn wasm_simd_butterfly32_05(b: &mut Bencher) { bench_planned_multi_f32(b, 5); } #[bench] fn wasm_simd_butterfly32_06(b: &mut Bencher) { bench_planned_multi_f32(b, 6); } #[bench] fn wasm_simd_butterfly32_07(b: &mut Bencher) { bench_planned_multi_f32(b, 7); } #[bench] fn wasm_simd_butterfly32_08(b: &mut Bencher) { bench_planned_multi_f32(b, 8); } #[bench] fn wasm_simd_butterfly32_09(b: &mut Bencher) { bench_planned_multi_f32(b, 9); } #[bench] fn wasm_simd_butterfly32_10(b: &mut Bencher) { bench_planned_multi_f32(b, 10); } #[bench] fn wasm_simd_butterfly32_11(b: &mut Bencher) { bench_planned_multi_f32(b, 11); } #[bench] fn wasm_simd_butterfly32_12(b: &mut Bencher) { bench_planned_multi_f32(b, 12); } #[bench] fn wasm_simd_butterfly32_13(b: &mut Bencher) { bench_planned_multi_f32(b, 13); } #[bench] fn wasm_simd_butterfly32_15(b: &mut Bencher) { bench_planned_multi_f32(b, 15); } #[bench] fn wasm_simd_butterfly32_16(b: &mut Bencher) { bench_planned_multi_f32(b, 16); } #[bench] fn wasm_simd_butterfly32_17(b: &mut Bencher) { bench_planned_multi_f32(b, 17); } #[bench] fn wasm_simd_butterfly32_19(b: &mut Bencher) { bench_planned_multi_f32(b, 19); } #[bench] fn wasm_simd_butterfly32_23(b: &mut Bencher) { bench_planned_multi_f32(b, 23); } #[bench] fn wasm_simd_butterfly32_29(b: &mut Bencher) { bench_planned_multi_f32(b, 29); } #[bench] fn wasm_simd_butterfly32_31(b: &mut Bencher) { bench_planned_multi_f32(b, 31); } #[bench] fn wasm_simd_butterfly32_32(b: &mut Bencher) { bench_planned_multi_f32(b, 32); } #[bench] fn wasm_simd_butterfly64_02(b: &mut Bencher) { bench_planned_multi_f64(b, 2); } #[bench] fn wasm_simd_butterfly64_03(b: &mut Bencher) { bench_planned_multi_f64(b, 3); } #[bench] fn wasm_simd_butterfly64_04(b: &mut Bencher) { bench_planned_multi_f64(b, 4); } #[bench] fn wasm_simd_butterfly64_05(b: &mut Bencher) { bench_planned_multi_f64(b, 5); } #[bench] fn wasm_simd_butterfly64_06(b: &mut Bencher) { bench_planned_multi_f64(b, 6); } #[bench] fn wasm_simd_butterfly64_07(b: &mut Bencher) { bench_planned_multi_f64(b, 7); } #[bench] fn wasm_simd_butterfly64_08(b: &mut Bencher) { bench_planned_multi_f64(b, 8); } #[bench] fn wasm_simd_butterfly64_09(b: &mut Bencher) { bench_planned_multi_f64(b, 9); } #[bench] fn wasm_simd_butterfly64_10(b: &mut Bencher) { bench_planned_multi_f64(b, 10); } #[bench] fn wasm_simd_butterfly64_11(b: &mut Bencher) { bench_planned_multi_f64(b, 11); } #[bench] fn wasm_simd_butterfly64_12(b: &mut Bencher) { bench_planned_multi_f64(b, 12); } #[bench] fn wasm_simd_butterfly64_13(b: &mut Bencher) { bench_planned_multi_f64(b, 13); } #[bench] fn wasm_simd_butterfly64_15(b: &mut Bencher) { bench_planned_multi_f64(b, 15); } #[bench] fn wasm_simd_butterfly64_16(b: &mut Bencher) { bench_planned_multi_f64(b, 16); } #[bench] fn wasm_simd_butterfly64_17(b: &mut Bencher) { bench_planned_multi_f64(b, 17); } #[bench] fn wasm_simd_butterfly64_19(b: &mut Bencher) { bench_planned_multi_f64(b, 19); } #[bench] fn wasm_simd_butterfly64_23(b: &mut Bencher) { bench_planned_multi_f64(b, 23); } #[bench] fn wasm_simd_butterfly64_29(b: &mut Bencher) { bench_planned_multi_f64(b, 29); } #[bench] fn wasm_simd_butterfly64_31(b: &mut Bencher) { bench_planned_multi_f64(b, 31); } #[bench] fn wasm_simd_butterfly64_32(b: &mut Bencher) { bench_planned_multi_f64(b, 32); } // Powers of 2 #[bench] fn wasm_simd_planned32_p2_00000064(b: &mut Bencher) { bench_planned_f32(b, 64); } #[bench] fn wasm_simd_planned32_p2_00000128(b: &mut Bencher) { bench_planned_f32(b, 128); } #[bench] fn wasm_simd_planned32_p2_00000256(b: &mut Bencher) { bench_planned_f32(b, 256); } #[bench] fn wasm_simd_planned32_p2_00000512(b: &mut Bencher) { bench_planned_f32(b, 512); } #[bench] fn wasm_simd_planned32_p2_00001024(b: &mut Bencher) { bench_planned_f32(b, 1024); } #[bench] fn wasm_simd_planned32_p2_00002048(b: &mut Bencher) { bench_planned_f32(b, 2048); } #[bench] fn wasm_simd_planned32_p2_00004096(b: &mut Bencher) { bench_planned_f32(b, 4096); } #[bench] fn wasm_simd_planned32_p2_00016384(b: &mut Bencher) { bench_planned_f32(b, 16384); } #[bench] fn wasm_simd_planned32_p2_00065536(b: &mut Bencher) { bench_planned_f32(b, 65536); } #[bench] fn wasm_simd_planned32_p2_01048576(b: &mut Bencher) { bench_planned_f32(b, 1048576); } #[bench] fn wasm_simd_planned64_p2_00000064(b: &mut Bencher) { bench_planned_f64(b, 64); } #[bench] fn wasm_simd_planned64_p2_00000128(b: &mut Bencher) { bench_planned_f64(b, 128); } #[bench] fn wasm_simd_planned64_p2_00000256(b: &mut Bencher) { bench_planned_f64(b, 256); } #[bench] fn wasm_simd_planned64_p2_00000512(b: &mut Bencher) { bench_planned_f64(b, 512); } #[bench] fn wasm_simd_planned64_p2_00001024(b: &mut Bencher) { bench_planned_f64(b, 1024); } #[bench] fn wasm_simd_planned64_p2_00002048(b: &mut Bencher) { bench_planned_f64(b, 2048); } #[bench] fn wasm_simd_planned64_p2_00004096(b: &mut Bencher) { bench_planned_f64(b, 4096); } #[bench] fn wasm_simd_planned64_p2_00016384(b: &mut Bencher) { bench_planned_f64(b, 16384); } #[bench] fn wasm_simd_planned64_p2_00065536(b: &mut Bencher) { bench_planned_f64(b, 65536); } #[bench] fn wasm_simd_planned64_p2_01048576(b: &mut Bencher) { bench_planned_f64(b, 1048576); } // Powers of 7 #[bench] fn wasm_simd_planned32_p7_00343(b: &mut Bencher) { bench_planned_f32(b, 343); } #[bench] fn wasm_simd_planned32_p7_02401(b: &mut Bencher) { bench_planned_f32(b, 2401); } #[bench] fn wasm_simd_planned32_p7_16807(b: &mut Bencher) { bench_planned_f32(b, 16807); } #[bench] fn wasm_simd_planned64_p7_00343(b: &mut Bencher) { bench_planned_f64(b, 343); } #[bench] fn wasm_simd_planned64_p7_02401(b: &mut Bencher) { bench_planned_f64(b, 2401); } #[bench] fn wasm_simd_planned64_p7_16807(b: &mut Bencher) { bench_planned_f64(b, 16807); } // Prime lengths #[bench] fn wasm_simd_planned32_prime_0149(b: &mut Bencher) { bench_planned_f32(b, 149); } #[bench] fn wasm_simd_planned32_prime_0151(b: &mut Bencher) { bench_planned_f32(b, 151); } #[bench] fn wasm_simd_planned32_prime_0251(b: &mut Bencher) { bench_planned_f32(b, 251); } #[bench] fn wasm_simd_planned32_prime_0257(b: &mut Bencher) { bench_planned_f32(b, 257); } #[bench] fn wasm_simd_planned32_prime_2017(b: &mut Bencher) { bench_planned_f32(b, 2017); } #[bench] fn wasm_simd_planned32_prime_2879(b: &mut Bencher) { bench_planned_f32(b, 2879); } #[bench] fn wasm_simd_planned32_prime_65521(b: &mut Bencher) { bench_planned_f32(b, 65521); } #[bench] fn wasm_simd_planned32_prime_746497(b: &mut Bencher) { bench_planned_f32(b, 746497); } #[bench] fn wasm_simd_planned64_prime_0149(b: &mut Bencher) { bench_planned_f64(b, 149); } #[bench] fn wasm_simd_planned64_prime_0151(b: &mut Bencher) { bench_planned_f64(b, 151); } #[bench] fn wasm_simd_planned64_prime_0251(b: &mut Bencher) { bench_planned_f64(b, 251); } #[bench] fn wasm_simd_planned64_prime_0257(b: &mut Bencher) { bench_planned_f64(b, 257); } #[bench] fn wasm_simd_planned64_prime_2017(b: &mut Bencher) { bench_planned_f64(b, 2017); } #[bench] fn wasm_simd_planned64_prime_2879(b: &mut Bencher) { bench_planned_f64(b, 2879); } #[bench] fn wasm_simd_planned64_prime_65521(b: &mut Bencher) { bench_planned_f64(b, 65521); } #[bench] fn wasm_simd_planned64_prime_746497(b: &mut Bencher) { bench_planned_f64(b, 746497); } // small mixed composites #[bench] fn wasm_simd_planned32_composite_000018(b: &mut Bencher) { bench_planned_f32(b, 00018); } #[bench] fn wasm_simd_planned32_composite_000360(b: &mut Bencher) { bench_planned_f32(b, 00360); } #[bench] fn wasm_simd_planned32_composite_001200(b: &mut Bencher) { bench_planned_f32(b, 01200); } #[bench] fn wasm_simd_planned32_composite_044100(b: &mut Bencher) { bench_planned_f32(b, 44100); } #[bench] fn wasm_simd_planned32_composite_048000(b: &mut Bencher) { bench_planned_f32(b, 48000); } #[bench] fn wasm_simd_planned32_composite_046656(b: &mut Bencher) { bench_planned_f32(b, 46656); } #[bench] fn wasm_simd_planned64_composite_000018(b: &mut Bencher) { bench_planned_f64(b, 00018); } #[bench] fn wasm_simd_planned64_composite_000360(b: &mut Bencher) { bench_planned_f64(b, 00360); } #[bench] fn wasm_simd_planned64_composite_001200(b: &mut Bencher) { bench_planned_f64(b, 01200); } #[bench] fn wasm_simd_planned64_composite_044100(b: &mut Bencher) { bench_planned_f64(b, 44100); } #[bench] fn wasm_simd_planned64_composite_048000(b: &mut Bencher) { bench_planned_f64(b, 48000); } #[bench] fn wasm_simd_planned64_composite_046656(b: &mut Bencher) { bench_planned_f64(b, 46656); } rustfft-6.2.0/benches/compare_3n2m_strategies.rs000064400000000000000000001261530072674642500200620ustar 00000000000000#![feature(test)] #![allow(non_snake_case)] #![allow(unused)] extern crate rustfft; extern crate test; use test::Bencher; use rustfft::algorithm::butterflies::*; use rustfft::algorithm::Dft; use rustfft::num_complex::Complex; use rustfft::num_traits::Zero; use rustfft::{Fft, FftNum}; use rustfft::{FftPlanner, FftPlannerAvx}; use primal_check::miller_rabin; use std::sync::Arc; /// This benchmark's purpose is to build some programmer intuition for planner heuristics /// We have mixed radix 2xn, 3xn, 4xn, 6xn, 8xn, 9x, 12xn, and 16xn implementations -- for a given FFT of the form 2^xn * 3^m, which combination is the fastest? Is 12xn -> 4xn faster than 6xn -> 8xn? /// Is it faster to put 9xn as an outer FFT of 8xn or as an inner FFT? this file autogenerates benchmarks that answer these questions /// /// The "generate_3n2m_comparison_benchmarks" benchmark will print benchmark code to the console which should be pasted back into this file, basically a low-budget procedural macro #[derive(Clone, Debug)] struct FftSize { len: usize, power2: u32, power3: u32, } impl FftSize { fn new(len: usize) -> Self { let power2 = len.trailing_zeros(); let mut remaining_factors = len >> power2; let mut power3 = 0; while remaining_factors % 3 == 0 { power3 += 1; remaining_factors /= 3; } assert!(remaining_factors == 1); Self { power2, power3, len, } } fn divide(&self, other: &Self) -> Option { if self.power2 <= other.power2 && self.power3 <= other.power3 { Some(Self { power2: other.power2 - self.power2, power3: other.power3 - self.power3, len: other.len / self.len, }) } else { None } } } // We don't need to generate a combinatoric explosion of tests that we know will be slow. filter_radix applies some dumb heuristics to filter out the most common slow cases fn filter_radix(current_strategy: &[usize], potential_radix: &FftSize, is_butterfly: bool) -> bool { // if we've seen any radix larger than this before, reject. otherwise we'll get a million reorderings of the same radixex, with benchmarking showing that smaller being higher is typically faster if !is_butterfly && current_strategy .iter() .find(|i| **i > potential_radix.len && **i != 16) .is_some() { return false; } // apply filters to size 2 if potential_radix.len == 2 { // if our strategy already contains any 2's, 3's, or 4's, reject -- because 4, 6, or 8 will be faster, respectively return !current_strategy.contains(&2) && !current_strategy.contains(&3) && !current_strategy.contains(&4); } // apply filters to size 3 if potential_radix.len == 3 { // if our strategy already contains any 2's, 3's or 4s, reject -- because 6 and 9 and 12 will be faster, respectively return !current_strategy.contains(&2) && !current_strategy.contains(&3) && !current_strategy.contains(&4); } // apply filters to size 4 if potential_radix.len == 4 { // if our strategy already contains any 2's, reject -- because 8 will be faster // if our strategy already contains 2 4's, don't add a third, because 2 8's would have been faster // if our strategy already contains a 16, reject -- because 2 8's will be faster (8s are seriously fast guys) return !current_strategy.contains(&2) && !current_strategy.contains(&3) && !current_strategy.contains(&4) && !current_strategy.contains(&16); } if potential_radix.len == 16 { // if our strategy already contains a 4, reject -- because 2 8's will be faster (8s are seriously fast guys) // if our strategy already contains a 16, reject -- benchmarking shows that 16s are very situational, and repeating them never helps) return !current_strategy.contains(&4) && !current_strategy.contains(&16); } return true; } fn recursive_strategy_builder( strategy_list: &mut Vec>, last_ditch_strategy_list: &mut Vec>, mut current_strategy: Vec, len: FftSize, butterfly_sizes: &[usize], last_ditch_butterflies: &[usize], available_radixes: &[FftSize], ) { if butterfly_sizes.contains(&len.len) { if filter_radix(¤t_strategy, &len, true) { current_strategy.push(len.len); //If this strategy contains a 2 or 3, it's very unlikely to be the fastest. we don't want to rule it out, because it's required sometimes, but don't use it unless there aren't any other if current_strategy.contains(&2) || current_strategy.contains(&3) { strategy_list.push(current_strategy.clone()); } else { strategy_list.push(current_strategy.clone()); } } } else if last_ditch_butterflies.contains(&len.len) { if filter_radix(¤t_strategy, &len, true) { current_strategy.push(len.len); last_ditch_strategy_list.push(current_strategy.clone()); } } else if len.len > 1 { for radix in available_radixes { if filter_radix(¤t_strategy, radix, false) { if let Some(inner) = radix.divide(&len) { let mut cloned_strategy = current_strategy.clone(); cloned_strategy.push(radix.len); recursive_strategy_builder( strategy_list, last_ditch_strategy_list, cloned_strategy, inner, butterfly_sizes, last_ditch_butterflies, available_radixes, ); } } } } } // it's faster to filter strategies at the radix level since we can prune entire permutations, but some can only be done once the full plan is built fn filter_strategy(strategy: &Vec) -> bool { if strategy.contains(&16) { let index = strategy.iter().position(|s| *s == 16).unwrap(); index == 0 || index == strategy.len() - 1 || index == strategy.len() - 2 || (strategy[index - 1] < 12 && strategy[index + 1] >= 12) } else { true } } // cargo bench generate_3n2m_comparison_benchmarks_32 -- --nocapture --ignored #[ignore] #[bench] fn generate_3n2m_comparison_benchmarks_32(_: &mut test::Bencher) { let butterfly_sizes = [128, 256, 512, 72, 36, 48, 54, 64]; let last_ditch_butterflies = [27, 9, 32, 24]; let available_radixes = [ FftSize::new(3), FftSize::new(4), FftSize::new(6), FftSize::new(8), FftSize::new(9), FftSize::new(12), FftSize::new(16), ]; let max_len: usize = 1 << 21; let min_len = 64; let max_power2 = max_len.trailing_zeros(); let max_power3 = (max_len as f32).log(3.0).ceil() as u32; for power3 in 1..2 { for power2 in 4..max_power2 { let len = 3usize.pow(power3) << power2; if len > max_len { continue; } //let planned_fft : Arc> = rustfft::FftPlanner::new(false).plan_fft(len); // we want to catalog all the different possible ways there are to compute a FFT of size `len` // we can do that by recursively looping over each radix, dividing our length by that radix, then recursively trying rach radix again let mut strategies = vec![]; let mut last_ditch_strategies = vec![]; recursive_strategy_builder( &mut strategies, &mut last_ditch_strategies, Vec::new(), FftSize::new(len), &butterfly_sizes, &last_ditch_butterflies, &available_radixes, ); if strategies.len() == 0 { strategies = last_ditch_strategies; } for mut s in strategies.into_iter().filter(filter_strategy) { s.reverse(); let strategy_strings: Vec<_> = s.into_iter().map(|i| i.to_string()).collect(); let test_id = strategy_strings.join("_"); let strategy_array = strategy_strings.join(","); println!("#[bench] fn comparef32__2power{:02}__3power{:02}__len{:08}__{}(b: &mut Bencher) {{ compare_fft_f32(b, &[{}]); }}", power2, power3, len, test_id, strategy_array); } } } } // cargo bench generate_3n2m_comparison_benchmarks_64 -- --nocapture --ignored #[ignore] #[bench] fn generate_3n2m_comparison_benchmarks_64(_: &mut test::Bencher) { let butterfly_sizes = [512, 256, 128, 64, 36, 27, 24, 18, 12]; let last_ditch_butterflies = [32, 16, 8, 9]; let available_radixes = [ FftSize::new(3), FftSize::new(4), FftSize::new(6), FftSize::new(8), FftSize::new(9), FftSize::new(12), ]; let max_len: usize = 1 << 21; let min_len = 64; let max_power2 = max_len.trailing_zeros(); let max_power3 = (max_len as f32).log(3.0).ceil() as u32; for power3 in 0..1 { for power2 in 3..max_power2 { let len = 3usize.pow(power3) << power2; if len > max_len { continue; } //let planned_fft : Arc> = rustfft::FftPlanner::new(false).plan_fft(len); // we want to catalog all the different possible ways there are to compute a FFT of size `len` // we can do that by recursively looping over each radix, dividing our length by that radix, then recursively trying rach radix again // we can do that by recursively looping over each radix, dividing our length by that radix, then recursively trying rach radix again let mut strategies = vec![]; let mut last_ditch_strategies = vec![]; recursive_strategy_builder( &mut strategies, &mut last_ditch_strategies, Vec::new(), FftSize::new(len), &butterfly_sizes, &last_ditch_butterflies, &available_radixes, ); if strategies.len() == 0 { strategies = last_ditch_strategies; } for mut s in strategies.into_iter().filter(filter_strategy) { s.reverse(); let strategy_strings: Vec<_> = s.into_iter().map(|i| i.to_string()).collect(); let test_id = strategy_strings.join("_"); let strategy_array = strategy_strings.join(","); println!("#[bench] fn comparef64__2power{:02}__3power{:02}__len{:08}__{}(b: &mut Bencher) {{ compare_fft_f64(b, &[{}]); }}", power2, power3, len, test_id, strategy_array); } } } } // cargo bench generate_3n2m_planned_benchmarks_32 -- --nocapture --ignored #[ignore] #[bench] fn generate_3n2m_planned_benchmarks(_: &mut test::Bencher) { let mut fft_sizes = vec![]; let max_len: usize = 1 << 23; let max_power2 = max_len.trailing_zeros(); let max_power3 = (max_len as f32).log(3.0).ceil() as u32; for power2 in 0..max_power2 { for power3 in 0..max_power3 { let len = 3usize.pow(power3) << power2; if len > max_len { continue; } if power3 < 2 && power2 > 16 { continue; } if power3 < 3 && power2 > 17 { continue; } if power2 < 1 { continue; } fft_sizes.push(len); } } for len in fft_sizes { let power2 = len.trailing_zeros(); let mut remaining_factors = len >> power2; let mut power3 = 0; while remaining_factors % 3 == 0 { power3 += 1; remaining_factors /= 3; } println!("#[bench] fn comparef32_len{:07}_2power{:02}_3power{:02}(b: &mut Bencher) {{ bench_planned_fft_f32(b, {}); }}",len, power2, power3, len); } } // cargo bench generate_3n2m_planned_benchmarks_64 -- --nocapture --ignored #[ignore] #[bench] fn generate_3n2m_planned_benchmarks_64(_: &mut test::Bencher) { let mut fft_sizes = vec![]; let max_len: usize = 1 << 23; let max_power2 = max_len.trailing_zeros(); let max_power3 = (max_len as f32).log(3.0).ceil() as u32; for power2 in 0..max_power2 { for power3 in 0..max_power3 { let len = 3usize.pow(power3) << power2; if len > max_len { continue; } if power3 < 1 && power2 > 13 { continue; } if power3 < 4 && power2 > 14 { continue; } if power2 < 2 { continue; } fft_sizes.push(len); } } for len in fft_sizes { let power2 = len.trailing_zeros(); let mut remaining_factors = len >> power2; let mut power3 = 0; while remaining_factors % 3 == 0 { power3 += 1; remaining_factors /= 3; } println!("#[bench] fn comparef64_len{:07}_2power{:02}_3power{:02}(b: &mut Bencher) {{ bench_planned_fft_f64(b, {}); }}",len, power2, power3, len); } } #[derive(Copy, Clone, Debug)] pub struct PartialFactors { power2: u32, power3: u32, power5: u32, power7: u32, power11: u32, other_factors: usize, } impl PartialFactors { pub fn compute(len: usize) -> Self { let power2 = len.trailing_zeros(); let mut other_factors = len >> power2; let mut power3 = 0; while other_factors % 3 == 0 { power3 += 1; other_factors /= 3; } let mut power5 = 0; while other_factors % 5 == 0 { power5 += 1; other_factors /= 5; } let mut power7 = 0; while other_factors % 7 == 0 { power7 += 1; other_factors /= 7; } let mut power11 = 0; while other_factors % 11 == 0 { power11 += 1; other_factors /= 11; } Self { power2, power3, power5, power7, power11, other_factors, } } pub fn get_power2(&self) -> u32 { self.power2 } pub fn get_power3(&self) -> u32 { self.power3 } pub fn get_power5(&self) -> u32 { self.power5 } pub fn get_power7(&self) -> u32 { self.power7 } pub fn get_power11(&self) -> u32 { self.power11 } pub fn get_other_factors(&self) -> usize { self.other_factors } pub fn product(&self) -> usize { (self.other_factors * 3usize.pow(self.power3) * 5usize.pow(self.power5) * 7usize.pow(self.power7) * 11usize.pow(self.power11)) << self.power2 } pub fn product_power2power3(&self) -> usize { 3usize.pow(self.power3) << self.power2 } #[allow(unused)] pub fn divide_by(&self, divisor: &PartialFactors) -> Option { let two_divides = self.power2 >= divisor.power2; let three_divides = self.power3 >= divisor.power3; let five_divides = self.power5 >= divisor.power5; let seven_divides = self.power7 >= divisor.power7; let eleven_divides = self.power11 >= divisor.power11; let other_divides = self.other_factors % divisor.other_factors == 0; if two_divides && three_divides && five_divides && seven_divides && eleven_divides && other_divides { Some(Self { power2: self.power2 - divisor.power2, power3: self.power3 - divisor.power3, power5: self.power5 - divisor.power5, power7: self.power7 - divisor.power7, power11: self.power11 - divisor.power11, other_factors: if self.other_factors == divisor.other_factors { 1 } else { self.other_factors / divisor.other_factors }, }) } else { None } } } // cargo bench generate_raders_benchmarks -- --nocapture --ignored #[ignore] #[bench] fn generate_raders_benchmarks(_: &mut test::Bencher) { for len in 10usize..100000 { if miller_rabin(len as u64) { let inner_factors = PartialFactors::compute(len - 1); if inner_factors.get_other_factors() == 1 && inner_factors.get_power11() > 0 { println!("#[bench] fn comparef64_len{:07}_11p{:02}_bluesteins(b: &mut Bencher) {{ bench_planned_bluesteins_f64(b, {}); }}", len, inner_factors.get_power11(), len); println!("#[bench] fn comparef64_len{:07}_11p{:02}_raders(b: &mut Bencher) {{ bench_planned_raders_f64(b, {}); }}", len, inner_factors.get_power11(), len); } } } } fn wrap_fft(fft: impl Fft + 'static) -> Arc> { Arc::new(fft) as Arc> } // passes the given FFT length directly to the FFT planner fn bench_planned_fft_f32(b: &mut Bencher, len: usize) { let mut planner: FftPlanner = FftPlanner::new(); let fft = planner.plan_fft_forward(len); let mut buffer = vec![Complex::zero(); fft.len()]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // passes the given FFT length directly to the FFT planner fn bench_planned_fft_f64(b: &mut Bencher, len: usize) { let mut planner: FftPlanner = FftPlanner::new(); let fft = planner.plan_fft_forward(len); let mut buffer = vec![Complex::zero(); fft.len()]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } /* // Computes the given FFT length using Bluestein's Algorithm, using the planner to plan the inner FFT fn bench_planned_bluesteins_f32(b: &mut Bencher, len: usize) { let mut planner : FftPlannerAvx = FftPlannerAvx::new(false).unwrap(); let fft = planner.construct_bluesteins(len); let mut buffer = vec![Complex::zero(); fft.len()]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Computes the given FFT length using Rader's Algorithm, using the planner to plan the inner FFT fn bench_planned_raders_f32(b: &mut Bencher, len: usize) { let mut planner : FftPlannerAvx = FftPlannerAvx::new(false).unwrap(); let fft = planner.construct_raders(len); let mut buffer = vec![Complex::zero(); fft.len()]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Computes the given FFT length using Bluestein's Algorithm, using the planner to plan the inner FFT fn bench_planned_bluesteins_f64(b: &mut Bencher, len: usize) { let mut planner : FftPlannerAvx = FftPlannerAvx::new(false).unwrap(); let fft = planner.construct_bluesteins(len); let mut buffer = vec![Complex::zero(); fft.len()]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } // Computes the given FFT length using Rader's Algorithm, using the planner to plan the inner FFT fn bench_planned_raders_f64(b: &mut Bencher, len: usize) { let mut planner : FftPlannerAvx = FftPlannerAvx::new(false).unwrap(); let fft = planner.construct_raders(len); let mut buffer = vec![Complex::zero(); fft.len()]; let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; b.iter(|| { fft.process_with_scratch(&mut buffer, &mut scratch); }); } #[bench] fn comparef64_len0000023_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 23); } #[bench] fn comparef64_len0000023_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 23); } #[bench] fn comparef64_len0000067_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 67); } #[bench] fn comparef64_len0000067_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 67); } #[bench] fn comparef64_len0000089_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 89); } #[bench] fn comparef64_len0000089_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 89); } #[bench] fn comparef64_len0000199_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 199); } #[bench] fn comparef64_len0000199_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 199); } #[bench] fn comparef64_len0000331_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 331); } #[bench] fn comparef64_len0000331_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 331); } #[bench] fn comparef64_len0000353_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 353); } #[bench] fn comparef64_len0000353_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 353); } #[bench] fn comparef64_len0000397_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 397); } #[bench] fn comparef64_len0000397_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 397); } #[bench] fn comparef64_len0000463_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 463); } #[bench] fn comparef64_len0000463_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 463); } #[bench] fn comparef64_len0000617_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 617); } #[bench] fn comparef64_len0000617_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 617); } #[bench] fn comparef64_len0000661_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 661); } #[bench] fn comparef64_len0000661_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 661); } #[bench] fn comparef64_len0000727_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 727); } #[bench] fn comparef64_len0000727_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 727); } #[bench] fn comparef64_len0000881_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 881); } #[bench] fn comparef64_len0000881_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 881); } #[bench] fn comparef64_len0000991_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 991); } #[bench] fn comparef64_len0000991_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 991); } #[bench] fn comparef64_len0001321_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 1321); } #[bench] fn comparef64_len0001321_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 1321); } #[bench] fn comparef64_len0001409_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 1409); } #[bench] fn comparef64_len0001409_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 1409); } #[bench] fn comparef64_len0001453_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 1453); } #[bench] fn comparef64_len0001453_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 1453); } #[bench] fn comparef64_len0001783_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 1783); } #[bench] fn comparef64_len0001783_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 1783); } #[bench] fn comparef64_len0002113_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 2113); } #[bench] fn comparef64_len0002113_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 2113); } #[bench] fn comparef64_len0002179_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 2179); } #[bench] fn comparef64_len0002179_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 2179); } #[bench] fn comparef64_len0002311_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 2311); } #[bench] fn comparef64_len0002311_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 2311); } #[bench] fn comparef64_len0002377_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 2377); } #[bench] fn comparef64_len0002377_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 2377); } #[bench] fn comparef64_len0002663_11p03_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 2663); } #[bench] fn comparef64_len0002663_11p03_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 2663); } #[bench] fn comparef64_len0002971_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 2971); } #[bench] fn comparef64_len0002971_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 2971); } #[bench] fn comparef64_len0003169_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 3169); } #[bench] fn comparef64_len0003169_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 3169); } #[bench] fn comparef64_len0003301_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 3301); } #[bench] fn comparef64_len0003301_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 3301); } #[bench] fn comparef64_len0003389_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 3389); } #[bench] fn comparef64_len0003389_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 3389); } #[bench] fn comparef64_len0003631_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 3631); } #[bench] fn comparef64_len0003631_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 3631); } #[bench] fn comparef64_len0003697_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 3697); } #[bench] fn comparef64_len0003697_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 3697); } #[bench] fn comparef64_len0003851_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 3851); } #[bench] fn comparef64_len0003851_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 3851); } #[bench] fn comparef64_len0004159_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 4159); } #[bench] fn comparef64_len0004159_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 4159); } #[bench] fn comparef64_len0004357_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 4357); } #[bench] fn comparef64_len0004357_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 4357); } #[bench] fn comparef64_len0004621_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 4621); } #[bench] fn comparef64_len0004621_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 4621); } #[bench] fn comparef64_len0004951_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 4951); } #[bench] fn comparef64_len0004951_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 4951); } #[bench] fn comparef64_len0005281_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 5281); } #[bench] fn comparef64_len0005281_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 5281); } #[bench] fn comparef64_len0005347_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 5347); } #[bench] fn comparef64_len0005347_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 5347); } #[bench] fn comparef64_len0005501_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 5501); } #[bench] fn comparef64_len0005501_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 5501); } #[bench] fn comparef64_len0006337_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 6337); } #[bench] fn comparef64_len0006337_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 6337); } #[bench] fn comparef64_len0006469_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 6469); } #[bench] fn comparef64_len0006469_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 6469); } #[bench] fn comparef64_len0007129_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 7129); } #[bench] fn comparef64_len0007129_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 7129); } #[bench] fn comparef64_len0007393_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 7393); } #[bench] fn comparef64_len0007393_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 7393); } #[bench] fn comparef64_len0007547_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 7547); } #[bench] fn comparef64_len0007547_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 7547); } #[bench] fn comparef64_len0008317_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 8317); } #[bench] fn comparef64_len0008317_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 8317); } #[bench] fn comparef64_len0008713_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 8713); } #[bench] fn comparef64_len0008713_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 8713); } #[bench] fn comparef64_len0009241_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 9241); } #[bench] fn comparef64_len0009241_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 9241); } #[bench] fn comparef64_len0009857_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 9857); } #[bench] fn comparef64_len0009857_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 9857); } #[bench] fn comparef64_len0009901_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 9901); } #[bench] fn comparef64_len0009901_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 9901); } #[bench] fn comparef64_len0010781_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 10781); } #[bench] fn comparef64_len0010781_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 10781); } #[bench] fn comparef64_len0010891_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 10891); } #[bench] fn comparef64_len0010891_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 10891); } #[bench] fn comparef64_len0011551_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 11551); } #[bench] fn comparef64_len0011551_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 11551); } #[bench] fn comparef64_len0011617_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 11617); } #[bench] fn comparef64_len0011617_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 11617); } #[bench] fn comparef64_len0012101_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 12101); } #[bench] fn comparef64_len0012101_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 12101); } #[bench] fn comparef64_len0013553_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 13553); } #[bench] fn comparef64_len0013553_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 13553); } #[bench] fn comparef64_len0013751_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 13751); } #[bench] fn comparef64_len0013751_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 13751); } #[bench] fn comparef64_len0014081_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 14081); } #[bench] fn comparef64_len0014081_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 14081); } #[bench] fn comparef64_len0014851_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 14851); } #[bench] fn comparef64_len0014851_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 14851); } #[bench] fn comparef64_len0015401_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 15401); } #[bench] fn comparef64_len0015401_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 15401); } #[bench] fn comparef64_len0015973_11p03_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 15973); } #[bench] fn comparef64_len0015973_11p03_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 15973); } #[bench] fn comparef64_len0016633_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 16633); } #[bench] fn comparef64_len0016633_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 16633); } #[bench] fn comparef64_len0018481_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 18481); } #[bench] fn comparef64_len0018481_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 18481); } #[bench] fn comparef64_len0019009_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 19009); } #[bench] fn comparef64_len0019009_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 19009); } #[bench] fn comparef64_len0019603_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 19603); } #[bench] fn comparef64_len0019603_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 19603); } #[bench] fn comparef64_len0019801_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 19801); } #[bench] fn comparef64_len0019801_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 19801); } #[bench] fn comparef64_len0021121_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 21121); } #[bench] fn comparef64_len0021121_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 21121); } #[bench] fn comparef64_len0022639_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 22639); } #[bench] fn comparef64_len0022639_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 22639); } #[bench] fn comparef64_len0023761_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 23761); } #[bench] fn comparef64_len0023761_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 23761); } #[bench] fn comparef64_len0025411_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 25411); } #[bench] fn comparef64_len0025411_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 25411); } #[bench] fn comparef64_len0025873_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 25873); } #[bench] fn comparef64_len0025873_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 25873); } #[bench] fn comparef64_len0026731_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 26731); } #[bench] fn comparef64_len0026731_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 26731); } #[bench] fn comparef64_len0026951_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 26951); } #[bench] fn comparef64_len0026951_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 26951); } #[bench] fn comparef64_len0028513_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 28513); } #[bench] fn comparef64_len0028513_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 28513); } #[bench] fn comparef64_len0029569_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 29569); } #[bench] fn comparef64_len0029569_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 29569); } #[bench] fn comparef64_len0030493_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 30493); } #[bench] fn comparef64_len0030493_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 30493); } #[bench] fn comparef64_len0030977_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 30977); } #[bench] fn comparef64_len0030977_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 30977); } #[bench] fn comparef64_len0032077_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 32077); } #[bench] fn comparef64_len0032077_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 32077); } #[bench] fn comparef64_len0032341_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 32341); } #[bench] fn comparef64_len0032341_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 32341); } #[bench] fn comparef64_len0034651_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 34651); } #[bench] fn comparef64_len0034651_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 34651); } #[bench] fn comparef64_len0034849_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 34849); } #[bench] fn comparef64_len0034849_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 34849); } #[bench] fn comparef64_len0035201_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 35201); } #[bench] fn comparef64_len0035201_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 35201); } #[bench] fn comparef64_len0037423_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 37423); } #[bench] fn comparef64_len0037423_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 37423); } #[bench] fn comparef64_len0038501_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 38501); } #[bench] fn comparef64_len0038501_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 38501); } #[bench] fn comparef64_len0047521_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 47521); } #[bench] fn comparef64_len0047521_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 47521); } #[bench] fn comparef64_len0047917_11p03_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 47917); } #[bench] fn comparef64_len0047917_11p03_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 47917); } #[bench] fn comparef64_len0050821_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 50821); } #[bench] fn comparef64_len0050821_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 50821); } #[bench] fn comparef64_len0055001_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 55001); } #[bench] fn comparef64_len0055001_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 55001); } #[bench] fn comparef64_len0055441_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 55441); } #[bench] fn comparef64_len0055441_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 55441); } #[bench] fn comparef64_len0055903_11p03_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 55903); } #[bench] fn comparef64_len0055903_11p03_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 55903); } #[bench] fn comparef64_len0057751_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 57751); } #[bench] fn comparef64_len0057751_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 57751); } #[bench] fn comparef64_len0063361_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 63361); } #[bench] fn comparef64_len0063361_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 63361); } #[bench] fn comparef64_len0064153_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 64153); } #[bench] fn comparef64_len0064153_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 64153); } #[bench] fn comparef64_len0066529_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 66529); } #[bench] fn comparef64_len0066529_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 66529); } #[bench] fn comparef64_len0068993_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 68993); } #[bench] fn comparef64_len0068993_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 68993); } #[bench] fn comparef64_len0069697_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 69697); } #[bench] fn comparef64_len0069697_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 69697); } #[bench] fn comparef64_len0076231_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 76231); } #[bench] fn comparef64_len0076231_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 76231); } #[bench] fn comparef64_len0077617_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 77617); } #[bench] fn comparef64_len0077617_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 77617); } #[bench] fn comparef64_len0079201_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 79201); } #[bench] fn comparef64_len0079201_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 79201); } #[bench] fn comparef64_len0079861_11p03_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 79861); } #[bench] fn comparef64_len0079861_11p03_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 79861); } #[bench] fn comparef64_len0080191_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 80191); } #[bench] fn comparef64_len0080191_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 80191); } #[bench] fn comparef64_len0084481_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 84481); } #[bench] fn comparef64_len0084481_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 84481); } #[bench] fn comparef64_len0084701_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 84701); } #[bench] fn comparef64_len0084701_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 84701); } #[bench] fn comparef64_len0087121_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 87121); } #[bench] fn comparef64_len0087121_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 87121); } #[bench] fn comparef64_len0088001_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 88001); } #[bench] fn comparef64_len0088001_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 88001); } #[bench] fn comparef64_len0089101_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 89101); } #[bench] fn comparef64_len0089101_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 89101); } #[bench] fn comparef64_len0092401_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 92401); } #[bench] fn comparef64_len0092401_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 92401); } #[bench] fn comparef64_len0097021_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 97021); } #[bench] fn comparef64_len0097021_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 97021); } #[bench] fn comparef64_len0098011_11p02_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 98011); } #[bench] fn comparef64_len0098011_11p02_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 98011); } #[bench] fn comparef64_len0098561_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 98561); } #[bench] fn comparef64_len0098561_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 98561); } #[bench] fn comparef64_len0099793_11p01_bluesteins(b: &mut Bencher) { bench_planned_bluesteins_f64(b, 99793); } #[bench] fn comparef64_len0099793_11p01_raders(b: &mut Bencher) { bench_planned_raders_f64(b, 99793); } */ rustfft-6.2.0/build.rs000064400000000000000000000007570072674642500130340ustar 00000000000000extern crate version_check; static MIN_RUSTC: &str = "1.61.0"; fn main() { println!("cargo:rerun-if-changed=build.rs"); match version_check::is_min_version(MIN_RUSTC) { Some(true) => {} Some(false) => panic!( "\n====\nUnsupported rustc version {}\nRustFFT needs at least {}\n====\n", version_check::Version::read().unwrap(), MIN_RUSTC ), None => panic!("Unable to determine rustc version."), }; } rustfft-6.2.0/examples/asmtest.rs000064400000000000000000000025410072674642500152240ustar 00000000000000//! This example is mean to be used for inspecting the generated assembly. //! This can be interesting when working with simd intrinsics. //! //! To use: //! - Mark the function that should be investigated with `#[inline(never)]`. //! - If needed, add any required feature to the function, for example `#[target_feature(enable = "sse4.1")]` //! - Change the code below to use the changed function. //! Currently it is set up to look at the f32 version of the SSE 4-point butterfly. //! It uses the FftPlannerSse to plan a length 4 FFT, that will use the modified butterfly. //! - Ask rustc to output assembly code: //! `cargo rustc --release --features sse --example asmtest -- --emit=asm` //! - This will create a file at `target/release/examples/asmtest-0123456789abcdef.s` (with a random number in the filename). //! - Open this file and search for the function. use rustfft::num_complex::Complex32; //use rustfft::num_complex::Complex64; //use rustfft::FftPlannerScalar; use rustfft::FftPlannerSse; //use rustfft::FftPlannerNeon; fn main() { //let mut planner = FftPlannerScalar::new(); let mut planner = FftPlannerSse::new().unwrap(); //let mut planner = FftPlannerNeon::new().unwrap(); let fft = planner.plan_fft_forward(4); let mut buffer = vec![Complex32::new(0.0, 0.0); 100]; fft.process(&mut buffer); } rustfft-6.2.0/examples/concurrency.rs000064400000000000000000000012360072674642500160760ustar 00000000000000//! Show how to use an `FFT` object from multiple threads use std::sync::Arc; use std::thread; use rustfft::num_complex::Complex32; use rustfft::FftPlanner; fn main() { let mut planner = FftPlanner::new(); let fft = planner.plan_fft_forward(100); let threads: Vec> = (0..2) .map(|_| { let fft_copy = Arc::clone(&fft); thread::spawn(move || { let mut buffer = vec![Complex32::new(0.0, 0.0); 100]; fft_copy.process(&mut buffer); }) }) .collect(); for thread in threads { thread.join().unwrap(); } } rustfft-6.2.0/rustfmt.toml000064400000000000000000000004750072674642500137650ustar 00000000000000unstable_features = true # for "ignore", remove asap ignore = [ "benches/bench_rustfft.rs", "benches/bench_rustfft_sse.rs", "benches/bench_rustfft_neon.rs", "src/sse/sse_prime_butterflies.rs", "src/neon/neon_prime_butterflies.rs", "benches/bench_compare_scalar_neon.rs", ] rustfft-6.2.0/src/algorithm/bluesteins_algorithm.rs000064400000000000000000000215650072674642500207350ustar 00000000000000use std::sync::Arc; use num_complex::Complex; use num_traits::Zero; use crate::array_utils; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::{common::FftNum, twiddles, FftDirection}; use crate::{Direction, Fft, Length}; /// Implementation of Bluestein's Algorithm /// /// This algorithm computes an arbitrary-sized FFT in O(nlogn) time. It does this by converting this size-N FFT into a /// size-M FFT where M >= 2N - 1. /// /// The choice of M is very important for the performance of Bluestein's Algorithm. The most obvious choice is the next-largest /// power of two -- but if there's a smaller/faster FFT size that satisfies the `>= 2N - 1` requirement, that will significantly /// improve this algorithm's overall performance. /// /// ~~~ /// // Computes a forward FFT of size 1201, using Bluestein's Algorithm /// use rustfft::algorithm::BluesteinsAlgorithm; /// use rustfft::{Fft, FftPlanner}; /// use rustfft::num_complex::Complex; /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 1201]; /// /// // We need to find an inner FFT whose size is greater than 1201*2 - 1. /// // The size 2401 (7^4) satisfies this requirement, while also being relatively fast. /// let mut planner = FftPlanner::new(); /// let inner_fft = planner.plan_fft_forward(2401); /// /// let fft = BluesteinsAlgorithm::new(1201, inner_fft); /// fft.process(&mut buffer); /// ~~~ /// /// Bluesteins's Algorithm is relatively expensive compared to other FFT algorithms. Benchmarking shows that it is up to /// an order of magnitude slower than similar composite sizes. In the example size above of 1201, benchmarking shows /// that it takes 5x more time to compute than computing a FFT of size 1200 via a step of MixedRadix. pub struct BluesteinsAlgorithm { inner_fft: Arc>, inner_fft_multiplier: Box<[Complex]>, twiddles: Box<[Complex]>, len: usize, direction: FftDirection, } impl BluesteinsAlgorithm { /// Creates a FFT instance which will process inputs/outputs of size `len`. `inner_fft.len()` must be >= `len * 2 - 1` /// /// Note that this constructor is quite expensive to run; This algorithm must compute a FFT using `inner_fft` within the /// constructor. This further underlines the fact that Bluesteins Algorithm is more expensive to run than other /// FFT algorithms /// /// # Panics /// Panics if `inner_fft.len() < len * 2 - 1`. pub fn new(len: usize, inner_fft: Arc>) -> Self { let inner_fft_len = inner_fft.len(); assert!(len * 2 - 1 <= inner_fft_len, "Bluestein's algorithm requires inner_fft.len() >= self.len() * 2 - 1. Expected >= {}, got {}", len * 2 - 1, inner_fft_len); // when computing FFTs, we're going to run our inner multiply pairise by some precomputed data, then run an inverse inner FFT. We need to precompute that inner data here let inner_fft_scale = T::one() / T::from_usize(inner_fft_len).unwrap(); let direction = inner_fft.fft_direction(); // Compute twiddle factors that we'll run our inner FFT on let mut inner_fft_input = vec![Complex::zero(); inner_fft_len]; twiddles::fill_bluesteins_twiddles( &mut inner_fft_input[..len], direction.opposite_direction(), ); // Scale the computed twiddles and copy them to the end of the array inner_fft_input[0] = inner_fft_input[0] * inner_fft_scale; for i in 1..len { let twiddle = inner_fft_input[i] * inner_fft_scale; inner_fft_input[i] = twiddle; inner_fft_input[inner_fft_len - i] = twiddle; } //Compute the inner fft let mut inner_fft_scratch = vec![Complex::zero(); inner_fft.get_inplace_scratch_len()]; inner_fft.process_with_scratch(&mut inner_fft_input, &mut inner_fft_scratch); // also compute some more mundane twiddle factors to start and end with let mut twiddles = vec![Complex::zero(); len]; twiddles::fill_bluesteins_twiddles(&mut twiddles, direction); Self { inner_fft: inner_fft, inner_fft_multiplier: inner_fft_input.into_boxed_slice(), twiddles: twiddles.into_boxed_slice(), len, direction, } } fn perform_fft_inplace(&self, input: &mut [Complex], scratch: &mut [Complex]) { let (inner_input, inner_scratch) = scratch.split_at_mut(self.inner_fft_multiplier.len()); // Copy the buffer into our inner FFT input. the buffer will only fill part of the FFT input, so zero fill the rest for ((buffer_entry, inner_entry), twiddle) in input .iter() .zip(inner_input.iter_mut()) .zip(self.twiddles.iter()) { *inner_entry = *buffer_entry * *twiddle; } for inner in (&mut inner_input[input.len()..]).iter_mut() { *inner = Complex::zero(); } // run our inner forward FFT self.inner_fft .process_with_scratch(inner_input, inner_scratch); // Multiply our inner FFT output by our precomputed data. Then, conjugate the result to set up for an inverse FFT for (inner, multiplier) in inner_input.iter_mut().zip(self.inner_fft_multiplier.iter()) { *inner = (*inner * *multiplier).conj(); } // inverse FFT. we're computing a forward but we're massaging it into an inverse by conjugating the inputs and outputs self.inner_fft .process_with_scratch(inner_input, inner_scratch); // copy our data back to the buffer, applying twiddle factors again as we go. Also conjugate inner_input to complete the inverse FFT for ((buffer_entry, inner_entry), twiddle) in input .iter_mut() .zip(inner_input.iter()) .zip(self.twiddles.iter()) { *buffer_entry = inner_entry.conj() * twiddle; } } fn perform_fft_out_of_place( &self, input: &mut [Complex], output: &mut [Complex], scratch: &mut [Complex], ) { let (inner_input, inner_scratch) = scratch.split_at_mut(self.inner_fft_multiplier.len()); // Copy the buffer into our inner FFT input. the buffer will only fill part of the FFT input, so zero fill the rest for ((buffer_entry, inner_entry), twiddle) in input .iter() .zip(inner_input.iter_mut()) .zip(self.twiddles.iter()) { *inner_entry = *buffer_entry * *twiddle; } for inner in inner_input.iter_mut().skip(input.len()) { *inner = Complex::zero(); } // run our inner forward FFT self.inner_fft .process_with_scratch(inner_input, inner_scratch); // Multiply our inner FFT output by our precomputed data. Then, conjugate the result to set up for an inverse FFT for (inner, multiplier) in inner_input.iter_mut().zip(self.inner_fft_multiplier.iter()) { *inner = (*inner * *multiplier).conj(); } // inverse FFT. we're computing a forward but we're massaging it into an inverse by conjugating the inputs and outputs self.inner_fft .process_with_scratch(inner_input, inner_scratch); // copy our data back to the buffer, applying twiddle factors again as we go. Also conjugate inner_input to complete the inverse FFT for ((buffer_entry, inner_entry), twiddle) in output .iter_mut() .zip(inner_input.iter()) .zip(self.twiddles.iter()) { *buffer_entry = inner_entry.conj() * twiddle; } } } boilerplate_fft!( BluesteinsAlgorithm, |this: &BluesteinsAlgorithm<_>| this.len, // FFT len |this: &BluesteinsAlgorithm<_>| this.inner_fft_multiplier.len() + this.inner_fft.get_inplace_scratch_len(), // in-place scratch len |this: &BluesteinsAlgorithm<_>| this.inner_fft_multiplier.len() + this.inner_fft.get_inplace_scratch_len() // out of place scratch len ); #[cfg(test)] mod unit_tests { use super::*; use crate::algorithm::Dft; use crate::test_utils::check_fft_algorithm; use std::sync::Arc; #[test] fn test_bluesteins_scalar() { for &len in &[3, 5, 7, 11, 13] { test_bluesteins_with_length(len, FftDirection::Forward); test_bluesteins_with_length(len, FftDirection::Inverse); } } fn test_bluesteins_with_length(len: usize, direction: FftDirection) { let inner_fft = Arc::new(Dft::new( (len * 2 - 1).checked_next_power_of_two().unwrap(), direction, )); let fft = BluesteinsAlgorithm::new(len, inner_fft); check_fft_algorithm::(&fft, len, direction); } } rustfft-6.2.0/src/algorithm/butterflies.rs000064400000000000000000007332110072674642500170400ustar 00000000000000use num_complex::Complex; use crate::{common::FftNum, FftDirection}; use crate::array_utils::{self, DoubleBuf, LoadStore}; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::twiddles; use crate::{Direction, Fft, Length}; #[allow(unused)] macro_rules! boilerplate_fft_butterfly { ($struct_name:ident, $len:expr, $direction_fn:expr) => { impl $struct_name { #[inline(always)] pub(crate) unsafe fn perform_fft_butterfly(&self, buffer: impl LoadStore) { self.perform_fft_contiguous(buffer); } } impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], _scratch: &mut [Complex], ) { if input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } let result = array_utils::iter_chunks_zipped( input, output, self.len(), |in_chunk, out_chunk| { unsafe { self.perform_fft_butterfly(DoubleBuf { input: in_chunk, output: out_chunk, }) }; }, ); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); } } fn process_with_scratch(&self, buffer: &mut [Complex], _scratch: &mut [Complex]) { if buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace(self.len(), buffer.len(), 0, 0); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let result = array_utils::iter_chunks(buffer, self.len(), |chunk| unsafe { self.perform_fft_butterfly(chunk) }); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace(self.len(), buffer.len(), 0, 0); } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { 0 } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { 0 } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { $len } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { $direction_fn(self) } } }; } pub struct Butterfly1 { direction: FftDirection, _phantom: std::marker::PhantomData, } impl Butterfly1 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { Self { direction, _phantom: std::marker::PhantomData, } } } impl Fft for Butterfly1 { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], _scratch: &mut [Complex], ) { output.copy_from_slice(&input); } fn process_with_scratch(&self, _buffer: &mut [Complex], _scratch: &mut [Complex]) {} fn get_inplace_scratch_len(&self) -> usize { 0 } fn get_outofplace_scratch_len(&self) -> usize { 0 } } impl Length for Butterfly1 { fn len(&self) -> usize { 1 } } impl Direction for Butterfly1 { fn fft_direction(&self) -> FftDirection { self.direction } } pub struct Butterfly2 { direction: FftDirection, _phantom: std::marker::PhantomData, } boilerplate_fft_butterfly!(Butterfly2, 2, |this: &Butterfly2<_>| this.direction); impl Butterfly2 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { Self { direction, _phantom: std::marker::PhantomData, } } #[inline(always)] unsafe fn perform_fft_strided(left: &mut Complex, right: &mut Complex) { let temp = *left + *right; *right = *left - *right; *left = temp; } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { let value0 = buffer.load(0); let value1 = buffer.load(1); buffer.store(value0 + value1, 0); buffer.store(value0 - value1, 1); } } pub struct Butterfly3 { pub twiddle: Complex, direction: FftDirection, } boilerplate_fft_butterfly!(Butterfly3, 3, |this: &Butterfly3<_>| this.direction); impl Butterfly3 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { Self { twiddle: twiddles::compute_twiddle(1, 3, direction), direction, } } #[inline(always)] pub fn direction_of(fft: &Butterfly3) -> Self { Self { twiddle: fft.twiddle.conj(), direction: fft.direction.opposite_direction(), } } #[inline(always)] unsafe fn perform_fft_strided( &self, val0: &mut Complex, val1: &mut Complex, val2: &mut Complex, ) { let xp = *val1 + *val2; let xn = *val1 - *val2; let sum = *val0 + xp; let temp_a = *val0 + Complex { re: self.twiddle.re * xp.re, im: self.twiddle.re * xp.im, }; let temp_b = Complex { re: -self.twiddle.im * xn.im, im: self.twiddle.im * xn.re, }; *val0 = sum; *val1 = temp_a + temp_b; *val2 = temp_a - temp_b; } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { let xp = buffer.load(1) + buffer.load(2); let xn = buffer.load(1) - buffer.load(2); let sum = buffer.load(0) + xp; let temp_a = buffer.load(0) + Complex { re: self.twiddle.re * xp.re, im: self.twiddle.re * xp.im, }; let temp_b = Complex { re: -self.twiddle.im * xn.im, im: self.twiddle.im * xn.re, }; buffer.store(sum, 0); buffer.store(temp_a + temp_b, 1); buffer.store(temp_a - temp_b, 2); } } pub struct Butterfly4 { direction: FftDirection, _phantom: std::marker::PhantomData, } boilerplate_fft_butterfly!(Butterfly4, 4, |this: &Butterfly4<_>| this.direction); impl Butterfly4 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { Self { direction, _phantom: std::marker::PhantomData, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { //we're going to hardcode a step of mixed radix //aka we're going to do the six step algorithm // step 1: transpose, which we're skipping because we're just going to perform non-contiguous FFTs let mut value0 = buffer.load(0); let mut value1 = buffer.load(1); let mut value2 = buffer.load(2); let mut value3 = buffer.load(3); // step 2: column FFTs Butterfly2::perform_fft_strided(&mut value0, &mut value2); Butterfly2::perform_fft_strided(&mut value1, &mut value3); // step 3: apply twiddle factors (only one in this case, and it's either 0 + i or 0 - i) value3 = twiddles::rotate_90(value3, self.direction); // step 4: transpose, which we're skipping because we're the previous FFTs were non-contiguous // step 5: row FFTs Butterfly2::perform_fft_strided(&mut value0, &mut value1); Butterfly2::perform_fft_strided(&mut value2, &mut value3); // step 6: transpose by swapping index 1 and 2 buffer.store(value0, 0); buffer.store(value2, 1); buffer.store(value1, 2); buffer.store(value3, 3); } } pub struct Butterfly5 { twiddle1: Complex, twiddle2: Complex, direction: FftDirection, } boilerplate_fft_butterfly!(Butterfly5, 5, |this: &Butterfly5<_>| this.direction); impl Butterfly5 { pub fn new(direction: FftDirection) -> Self { Self { twiddle1: twiddles::compute_twiddle(1, 5, direction), twiddle2: twiddles::compute_twiddle(2, 5, direction), direction, } } #[inline(never)] // refusing to inline this code reduces code size, and doesn't hurt performance unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { // let mut outer = Butterfly2::perform_fft_array([buffer.load(1), buffer.load(4)]); // let mut inner = Butterfly2::perform_fft_array([buffer.load(2), buffer.load(3)]); // let input0 = buffer.load(0); // buffer.store(input0 + outer[0] + inner[0], 0); // inner[1] = twiddles::rotate_90(inner[1], true); // outer[1] = twiddles::rotate_90(outer[1], true); // { // let twiddled1 = outer[0] * self.twiddles[0].re; // let twiddled2 = inner[0] * self.twiddles[1].re; // let twiddled3 = inner[1] * self.twiddles[1].im; // let twiddled4 = outer[1] * self.twiddles[0].im; // let sum12 = twiddled1 + twiddled2; // let sum34 = twiddled4 + twiddled3; // let output1 = sum12 + sum34; // let output4 = sum12 - sum34; // buffer.store(input0 + output1, 1); // buffer.store(input0 + output4, 4); // } // { // let twiddled1 = outer[0] * self.twiddles[1].re; // let twiddled2 = inner[0] * self.twiddles[0].re; // let twiddled3 = inner[1] * self.twiddles[0].im; // let twiddled4 = outer[1] * self.twiddles[1].im; // } // Let's do a plain 5-point Dft // |X0| | W0 W0 W0 W0 W0 | |x0| // |X1| | W0 W1 W2 W3 W4 | |x1| // |X2| = | W0 W2 W4 W6 W8 | * |x2| // |X3| | W0 W3 W6 W9 W12 | |x3| // |X4| | W0 W4 W8 W12 W16 | |x4| // // where Wn = exp(-2*pi*n/5) for a forward transform, and exp(+2*pi*n/5) for an direction. // // This can be simplified a bit since exp(-2*pi*n/5) = exp(-2*pi*n/5 + m*2*pi) // |X0| | W0 W0 W0 W0 W0 | |x0| // |X1| | W0 W1 W2 W3 W4 | |x1| // |X2| = | W0 W2 W4 W1 W3 | * |x2| // |X3| | W0 W3 W1 W4 W2 | |x3| // |X4| | W0 W4 W3 W2 W1 | |x4| // // Next we can use the symmetry that W3 = W2* and W4 = W1* (where * means complex conjugate), and W0 = 1 // |X0| | 1 1 1 1 1 | |x0| // |X1| | 1 W1 W2 W2* W1* | |x1| // |X2| = | 1 W2 W1* W1 W2* | * |x2| // |X3| | 1 W2* W1 W1* W2 | |x3| // |X4| | 1 W1* W2* W2 W1 | |x4| // // Next, we write out the whole expression with real and imaginary parts. // X0 = x0 + x1 + x2 + x3 + x4 // X1 = x0 + (W1.re + j*W1.im)*x1 + (W2.re + j*W2.im)*x2 + (W2.re - j*W2.im)*x3 + (W1.re - j*W1.im)*x4 // X2 = x0 + (W2.re + j*W2.im)*x1 + (W1.re - j*W1.im)*x2 + (W1.re + j*W1.im)*x3 + (W2.re - j*W2.im)*x4 // X3 = x0 + (W2.re - j*W2.im)*x1 + (W1.re + j*W1.im)*x2 + (W1.re - j*W1.im)*x3 + (W2.re + j*W2.im)*x4 // X4 = x0 + (W1.re - j*W1.im)*x1 + (W2.re - j*W2.im)*x2 + (W2.re + j*W2.im)*x3 + (W1.re + j*W1.im)*x4 // // Then we rearrange and sort terms. // X0 = x0 + x1 + x2 + x3 + x4 // X1 = x0 + W1.re*(x1+x4) + W2.re*(x2+x3) + j*(W1.im*(x1-x4) + W2.im*(x2-x3)) // X2 = x0 + W1.re*(x2+x3) + W2.re*(x1+x4) - j*(W1.im*(x2-x3) - W2.im*(x1-x4)) // X3 = x0 + W1.re*(x2+x3) + W2.re*(x1+x4) + j*(W1.im*(x2-x3) - W2.im*(x1-x4)) // X4 = x0 + W1.re*(x1+x4) + W2.re*(x2+x3) - j*(W1.im*(x1-x4) + W2.im*(x2-x3)) // // Now we define x14p=x1+x4 x14n=x1-x4, x23p=x2+x3, x23n=x2-x3 // X0 = x0 + x1 + x2 + x3 + x4 // X1 = x0 + W1.re*(x14p) + W2.re*(x23p) + j*(W1.im*(x14n) + W2.im*(x23n)) // X2 = x0 + W1.re*(x23p) + W2.re*(x14p) - j*(W1.im*(x23n) - W2.im*(x14n)) // X3 = x0 + W1.re*(x23p) + W2.re*(x14p) + j*(W1.im*(x23n) - W2.im*(x14n)) // X4 = x0 + W1.re*(x14p) + W2.re*(x23p) - j*(W1.im*(x14n) + W2.im*(x23n)) // // The final step is to write out real and imaginary parts of x14n etc, and replacing using j*j=-1 // After this it's easy to remove any repeated calculation of the same values. let x14p = buffer.load(1) + buffer.load(4); let x14n = buffer.load(1) - buffer.load(4); let x23p = buffer.load(2) + buffer.load(3); let x23n = buffer.load(2) - buffer.load(3); let sum = buffer.load(0) + x14p + x23p; let b14re_a = buffer.load(0).re + self.twiddle1.re * x14p.re + self.twiddle2.re * x23p.re; let b14re_b = self.twiddle1.im * x14n.im + self.twiddle2.im * x23n.im; let b23re_a = buffer.load(0).re + self.twiddle2.re * x14p.re + self.twiddle1.re * x23p.re; let b23re_b = self.twiddle2.im * x14n.im + -self.twiddle1.im * x23n.im; let b14im_a = buffer.load(0).im + self.twiddle1.re * x14p.im + self.twiddle2.re * x23p.im; let b14im_b = self.twiddle1.im * x14n.re + self.twiddle2.im * x23n.re; let b23im_a = buffer.load(0).im + self.twiddle2.re * x14p.im + self.twiddle1.re * x23p.im; let b23im_b = self.twiddle2.im * x14n.re + -self.twiddle1.im * x23n.re; let out1re = b14re_a - b14re_b; let out1im = b14im_a + b14im_b; let out2re = b23re_a - b23re_b; let out2im = b23im_a + b23im_b; let out3re = b23re_a + b23re_b; let out3im = b23im_a - b23im_b; let out4re = b14re_a + b14re_b; let out4im = b14im_a - b14im_b; buffer.store(sum, 0); buffer.store( Complex { re: out1re, im: out1im, }, 1, ); buffer.store( Complex { re: out2re, im: out2im, }, 2, ); buffer.store( Complex { re: out3re, im: out3im, }, 3, ); buffer.store( Complex { re: out4re, im: out4im, }, 4, ); } } pub struct Butterfly6 { butterfly3: Butterfly3, } boilerplate_fft_butterfly!(Butterfly6, 6, |this: &Butterfly6<_>| this .butterfly3 .fft_direction()); impl Butterfly6 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { Self { butterfly3: Butterfly3::new(direction), } } #[inline(always)] pub fn direction_of(fft: &Butterfly6) -> Self { Self { butterfly3: Butterfly3::direction_of(&fft.butterfly3), } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { //since GCD(2,3) == 1 we're going to hardcode a step of the Good-Thomas algorithm to avoid twiddle factors // step 1: reorder the input directly into the scratch. normally there's a whole thing to compute this ordering //but thankfully we can just precompute it and hardcode it let mut scratch_a = [buffer.load(0), buffer.load(2), buffer.load(4)]; let mut scratch_b = [buffer.load(3), buffer.load(5), buffer.load(1)]; // step 2: column FFTs self.butterfly3.perform_fft_contiguous(&mut scratch_a); self.butterfly3.perform_fft_contiguous(&mut scratch_b); // step 3: apply twiddle factors -- SKIPPED because good-thomas doesn't have twiddle factors :) // step 4: SKIPPED because the next FFTs will be non-contiguous // step 5: row FFTs Butterfly2::perform_fft_strided(&mut scratch_a[0], &mut scratch_b[0]); Butterfly2::perform_fft_strided(&mut scratch_a[1], &mut scratch_b[1]); Butterfly2::perform_fft_strided(&mut scratch_a[2], &mut scratch_b[2]); // step 6: reorder the result back into the buffer. again we would normally have to do an expensive computation // but instead we can precompute and hardcode the ordering // note that we're also rolling a transpose step into this reorder buffer.store(scratch_a[0], 0); buffer.store(scratch_b[1], 1); buffer.store(scratch_a[2], 2); buffer.store(scratch_b[0], 3); buffer.store(scratch_a[1], 4); buffer.store(scratch_b[2], 5); } } pub struct Butterfly7 { twiddle1: Complex, twiddle2: Complex, twiddle3: Complex, direction: FftDirection, } boilerplate_fft_butterfly!(Butterfly7, 7, |this: &Butterfly7<_>| this.direction); impl Butterfly7 { pub fn new(direction: FftDirection) -> Self { Self { twiddle1: twiddles::compute_twiddle(1, 7, direction), twiddle2: twiddles::compute_twiddle(2, 7, direction), twiddle3: twiddles::compute_twiddle(3, 7, direction), direction, } } #[inline(never)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { // let mut outer = Butterfly2::perform_fft_array([buffer.load(1), buffer.load(6)]); // let mut mid = Butterfly2::perform_fft_array([buffer.load(2), buffer.load(5)]); // let mut inner = Butterfly2::perform_fft_array([buffer.load(3), buffer.load(4)]); // let input0 = buffer.load(0); // buffer.store(input0 + outer[0] + mid[0] + inner[0], 0); // inner[1] = twiddles::rotate_90(inner[1], true); // mid[1] = twiddles::rotate_90(mid[1], true); // outer[1] = twiddles::rotate_90(outer[1], true); // { // let twiddled1 = outer[0] * self.twiddles[0].re; // let twiddled2 = mid[0] * self.twiddles[1].re; // let twiddled3 = inner[0] * self.twiddles[2].re; // let twiddled4 = inner[1] * self.twiddles[2].im; // let twiddled5 = mid[1] * self.twiddles[1].im; // let twiddled6 = outer[1] * self.twiddles[0].im; // let sum123 = twiddled1 + twiddled2 + twiddled3; // let sum456 = twiddled4 + twiddled5 + twiddled6; // let output1 = sum123 + sum456; // let output6 = sum123 - sum456; // buffer.store(input0 + output1, 1); // buffer.store(input0 + output6, 6); // } // { // let twiddled1 = outer[0] * self.twiddles[1].re; // let twiddled2 = mid[0] * self.twiddles[2].re; // let twiddled3 = inner[0] * self.twiddles[0].re; // let twiddled4 = inner[1] * self.twiddles[0].im; // let twiddled5 = mid[1] * self.twiddles[2].im; // let twiddled6 = outer[1] * self.twiddles[1].im; // let sum123 = twiddled1 + twiddled2 + twiddled3; // let sum456 = twiddled6 - twiddled4 - twiddled5; // let output2 = sum123 + sum456; // let output5 = sum123 - sum456; // buffer.store(input0 + output2, 2); // buffer.store(input0 + output5, 5); // } // Let's do a plain 7-point Dft // |X0| | W0 W0 W0 W0 W0 W0 W0 | |x0| // |X1| | W0 W1 W2 W3 W4 W5 W6 | |x1| // |X2| | W0 W2 W4 W6 W8 W10 W12 | |x2| // |X3| = | W0 W3 W6 W9 W12 W15 W18 | * |x3| // |X4| | W0 W4 W8 W12 W16 W20 W24 | |x4| // |X5| | W0 W5 W10 W15 W20 W25 W30 | |x4| // |X6| | W0 W6 W12 W18 W24 W30 W36 | |x4| // // where Wn = exp(-2*pi*n/7) for a forward transform, and exp(+2*pi*n/7) for an direction. // // Using the same logic as for the 5-point butterfly, this can be simplified to: // |X0| | 1 1 1 1 1 1 1 | |x0| // |X1| | 1 W1 W2 W3 W3* W2* W1* | |x1| // |X2| | 1 W2 W3* W1* W1 W3 W2* | |x2| // |X3| = | 1 W3 W1* W2 W2* W1 W3* | * |x3| // |X4| | 1 W3* W1 W2* W2 W1* W3 | |x4| // |X5| | 1 W2* W3 W1 W1* W3* W2 | |x5| // |X6| | 1 W1* W2* W3* W3 W2 W1 | |x6| // // From here it's just about eliminating repeated calculations, following the same procedure as for the 5-point butterfly. let x16p = buffer.load(1) + buffer.load(6); let x16n = buffer.load(1) - buffer.load(6); let x25p = buffer.load(2) + buffer.load(5); let x25n = buffer.load(2) - buffer.load(5); let x34p = buffer.load(3) + buffer.load(4); let x34n = buffer.load(3) - buffer.load(4); let sum = buffer.load(0) + x16p + x25p + x34p; let x16re_a = buffer.load(0).re + self.twiddle1.re * x16p.re + self.twiddle2.re * x25p.re + self.twiddle3.re * x34p.re; let x16re_b = self.twiddle1.im * x16n.im + self.twiddle2.im * x25n.im + self.twiddle3.im * x34n.im; let x25re_a = buffer.load(0).re + self.twiddle1.re * x34p.re + self.twiddle2.re * x16p.re + self.twiddle3.re * x25p.re; let x25re_b = -self.twiddle1.im * x34n.im + self.twiddle2.im * x16n.im - self.twiddle3.im * x25n.im; let x34re_a = buffer.load(0).re + self.twiddle1.re * x25p.re + self.twiddle2.re * x34p.re + self.twiddle3.re * x16p.re; let x34re_b = -self.twiddle1.im * x25n.im + self.twiddle2.im * x34n.im + self.twiddle3.im * x16n.im; let x16im_a = buffer.load(0).im + self.twiddle1.re * x16p.im + self.twiddle2.re * x25p.im + self.twiddle3.re * x34p.im; let x16im_b = self.twiddle1.im * x16n.re + self.twiddle2.im * x25n.re + self.twiddle3.im * x34n.re; let x25im_a = buffer.load(0).im + self.twiddle1.re * x34p.im + self.twiddle2.re * x16p.im + self.twiddle3.re * x25p.im; let x25im_b = -self.twiddle1.im * x34n.re + self.twiddle2.im * x16n.re - self.twiddle3.im * x25n.re; let x34im_a = buffer.load(0).im + self.twiddle1.re * x25p.im + self.twiddle2.re * x34p.im + self.twiddle3.re * x16p.im; let x34im_b = self.twiddle1.im * x25n.re - self.twiddle2.im * x34n.re - self.twiddle3.im * x16n.re; let out1re = x16re_a - x16re_b; let out1im = x16im_a + x16im_b; let out2re = x25re_a - x25re_b; let out2im = x25im_a + x25im_b; let out3re = x34re_a - x34re_b; let out3im = x34im_a - x34im_b; let out4re = x34re_a + x34re_b; let out4im = x34im_a + x34im_b; let out5re = x25re_a + x25re_b; let out5im = x25im_a - x25im_b; let out6re = x16re_a + x16re_b; let out6im = x16im_a - x16im_b; buffer.store(sum, 0); buffer.store( Complex { re: out1re, im: out1im, }, 1, ); buffer.store( Complex { re: out2re, im: out2im, }, 2, ); buffer.store( Complex { re: out3re, im: out3im, }, 3, ); buffer.store( Complex { re: out4re, im: out4im, }, 4, ); buffer.store( Complex { re: out5re, im: out5im, }, 5, ); buffer.store( Complex { re: out6re, im: out6im, }, 6, ); } } pub struct Butterfly8 { root2: T, direction: FftDirection, } boilerplate_fft_butterfly!(Butterfly8, 8, |this: &Butterfly8<_>| this.direction); impl Butterfly8 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { Self { root2: T::from_f64(0.5f64.sqrt()).unwrap(), direction, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { let butterfly4 = Butterfly4::new(self.direction); //we're going to hardcode a step of mixed radix //aka we're going to do the six step algorithm // step 1: transpose the input into the scratch let mut scratch0 = [ buffer.load(0), buffer.load(2), buffer.load(4), buffer.load(6), ]; let mut scratch1 = [ buffer.load(1), buffer.load(3), buffer.load(5), buffer.load(7), ]; // step 2: column FFTs butterfly4.perform_fft_contiguous(&mut scratch0); butterfly4.perform_fft_contiguous(&mut scratch1); // step 3: apply twiddle factors scratch1[1] = (twiddles::rotate_90(scratch1[1], self.direction) + scratch1[1]) * self.root2; scratch1[2] = twiddles::rotate_90(scratch1[2], self.direction); scratch1[3] = (twiddles::rotate_90(scratch1[3], self.direction) - scratch1[3]) * self.root2; // step 4: transpose -- skipped because we're going to do the next FFTs non-contiguously // step 5: row FFTs for i in 0..4 { Butterfly2::perform_fft_strided(&mut scratch0[i], &mut scratch1[i]); } // step 6: copy data to the output. we don't need to transpose, because we skipped the step 4 transpose for i in 0..4 { buffer.store(scratch0[i], i); } for i in 0..4 { buffer.store(scratch1[i], i + 4); } } } pub struct Butterfly9 { butterfly3: Butterfly3, twiddle1: Complex, twiddle2: Complex, twiddle4: Complex, } boilerplate_fft_butterfly!(Butterfly9, 9, |this: &Butterfly9<_>| this .butterfly3 .fft_direction()); impl Butterfly9 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { Self { butterfly3: Butterfly3::new(direction), twiddle1: twiddles::compute_twiddle(1, 9, direction), twiddle2: twiddles::compute_twiddle(2, 9, direction), twiddle4: twiddles::compute_twiddle(4, 9, direction), } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { // algorithm: mixed radix with width=3 and height=3 // step 1: transpose the input into the scratch let mut scratch0 = [buffer.load(0), buffer.load(3), buffer.load(6)]; let mut scratch1 = [buffer.load(1), buffer.load(4), buffer.load(7)]; let mut scratch2 = [buffer.load(2), buffer.load(5), buffer.load(8)]; // step 2: column FFTs self.butterfly3.perform_fft_contiguous(&mut scratch0); self.butterfly3.perform_fft_contiguous(&mut scratch1); self.butterfly3.perform_fft_contiguous(&mut scratch2); // step 3: apply twiddle factors scratch1[1] = scratch1[1] * self.twiddle1; scratch1[2] = scratch1[2] * self.twiddle2; scratch2[1] = scratch2[1] * self.twiddle2; scratch2[2] = scratch2[2] * self.twiddle4; // step 4: SKIPPED because the next FFTs will be non-contiguous // step 5: row FFTs self.butterfly3 .perform_fft_strided(&mut scratch0[0], &mut scratch1[0], &mut scratch2[0]); self.butterfly3 .perform_fft_strided(&mut scratch0[1], &mut scratch1[1], &mut scratch2[1]); self.butterfly3 .perform_fft_strided(&mut scratch0[2], &mut scratch1[2], &mut scratch2[2]); // step 6: copy the result into the output. normally we'd need to do a transpose here, but we can skip it because we skipped the transpose in step 4 buffer.store(scratch0[0], 0); buffer.store(scratch0[1], 1); buffer.store(scratch0[2], 2); buffer.store(scratch1[0], 3); buffer.store(scratch1[1], 4); buffer.store(scratch1[2], 5); buffer.store(scratch2[0], 6); buffer.store(scratch2[1], 7); buffer.store(scratch2[2], 8); } } pub struct Butterfly11 { twiddle1: Complex, twiddle2: Complex, twiddle3: Complex, twiddle4: Complex, twiddle5: Complex, direction: FftDirection, } boilerplate_fft_butterfly!(Butterfly11, 11, |this: &Butterfly11<_>| this.direction); impl Butterfly11 { pub fn new(direction: FftDirection) -> Self { let twiddle1: Complex = twiddles::compute_twiddle(1, 11, direction); let twiddle2: Complex = twiddles::compute_twiddle(2, 11, direction); let twiddle3: Complex = twiddles::compute_twiddle(3, 11, direction); let twiddle4: Complex = twiddles::compute_twiddle(4, 11, direction); let twiddle5: Complex = twiddles::compute_twiddle(5, 11, direction); Self { twiddle1, twiddle2, twiddle3, twiddle4, twiddle5, direction, } } #[inline(never)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { // This function was derived in the same manner as the butterflies for length 3, 5 and 7. // However, instead of doing it by hand the actual code is autogenerated // with the `genbutterflies.py` script in the `tools` directory. let x110p = buffer.load(1) + buffer.load(10); let x110n = buffer.load(1) - buffer.load(10); let x29p = buffer.load(2) + buffer.load(9); let x29n = buffer.load(2) - buffer.load(9); let x38p = buffer.load(3) + buffer.load(8); let x38n = buffer.load(3) - buffer.load(8); let x47p = buffer.load(4) + buffer.load(7); let x47n = buffer.load(4) - buffer.load(7); let x56p = buffer.load(5) + buffer.load(6); let x56n = buffer.load(5) - buffer.load(6); let sum = buffer.load(0) + x110p + x29p + x38p + x47p + x56p; let b110re_a = buffer.load(0).re + self.twiddle1.re * x110p.re + self.twiddle2.re * x29p.re + self.twiddle3.re * x38p.re + self.twiddle4.re * x47p.re + self.twiddle5.re * x56p.re; let b110re_b = self.twiddle1.im * x110n.im + self.twiddle2.im * x29n.im + self.twiddle3.im * x38n.im + self.twiddle4.im * x47n.im + self.twiddle5.im * x56n.im; let b29re_a = buffer.load(0).re + self.twiddle2.re * x110p.re + self.twiddle4.re * x29p.re + self.twiddle5.re * x38p.re + self.twiddle3.re * x47p.re + self.twiddle1.re * x56p.re; let b29re_b = self.twiddle2.im * x110n.im + self.twiddle4.im * x29n.im + -self.twiddle5.im * x38n.im + -self.twiddle3.im * x47n.im + -self.twiddle1.im * x56n.im; let b38re_a = buffer.load(0).re + self.twiddle3.re * x110p.re + self.twiddle5.re * x29p.re + self.twiddle2.re * x38p.re + self.twiddle1.re * x47p.re + self.twiddle4.re * x56p.re; let b38re_b = self.twiddle3.im * x110n.im + -self.twiddle5.im * x29n.im + -self.twiddle2.im * x38n.im + self.twiddle1.im * x47n.im + self.twiddle4.im * x56n.im; let b47re_a = buffer.load(0).re + self.twiddle4.re * x110p.re + self.twiddle3.re * x29p.re + self.twiddle1.re * x38p.re + self.twiddle5.re * x47p.re + self.twiddle2.re * x56p.re; let b47re_b = self.twiddle4.im * x110n.im + -self.twiddle3.im * x29n.im + self.twiddle1.im * x38n.im + self.twiddle5.im * x47n.im + -self.twiddle2.im * x56n.im; let b56re_a = buffer.load(0).re + self.twiddle5.re * x110p.re + self.twiddle1.re * x29p.re + self.twiddle4.re * x38p.re + self.twiddle2.re * x47p.re + self.twiddle3.re * x56p.re; let b56re_b = self.twiddle5.im * x110n.im + -self.twiddle1.im * x29n.im + self.twiddle4.im * x38n.im + -self.twiddle2.im * x47n.im + self.twiddle3.im * x56n.im; let b110im_a = buffer.load(0).im + self.twiddle1.re * x110p.im + self.twiddle2.re * x29p.im + self.twiddle3.re * x38p.im + self.twiddle4.re * x47p.im + self.twiddle5.re * x56p.im; let b110im_b = self.twiddle1.im * x110n.re + self.twiddle2.im * x29n.re + self.twiddle3.im * x38n.re + self.twiddle4.im * x47n.re + self.twiddle5.im * x56n.re; let b29im_a = buffer.load(0).im + self.twiddle2.re * x110p.im + self.twiddle4.re * x29p.im + self.twiddle5.re * x38p.im + self.twiddle3.re * x47p.im + self.twiddle1.re * x56p.im; let b29im_b = self.twiddle2.im * x110n.re + self.twiddle4.im * x29n.re + -self.twiddle5.im * x38n.re + -self.twiddle3.im * x47n.re + -self.twiddle1.im * x56n.re; let b38im_a = buffer.load(0).im + self.twiddle3.re * x110p.im + self.twiddle5.re * x29p.im + self.twiddle2.re * x38p.im + self.twiddle1.re * x47p.im + self.twiddle4.re * x56p.im; let b38im_b = self.twiddle3.im * x110n.re + -self.twiddle5.im * x29n.re + -self.twiddle2.im * x38n.re + self.twiddle1.im * x47n.re + self.twiddle4.im * x56n.re; let b47im_a = buffer.load(0).im + self.twiddle4.re * x110p.im + self.twiddle3.re * x29p.im + self.twiddle1.re * x38p.im + self.twiddle5.re * x47p.im + self.twiddle2.re * x56p.im; let b47im_b = self.twiddle4.im * x110n.re + -self.twiddle3.im * x29n.re + self.twiddle1.im * x38n.re + self.twiddle5.im * x47n.re + -self.twiddle2.im * x56n.re; let b56im_a = buffer.load(0).im + self.twiddle5.re * x110p.im + self.twiddle1.re * x29p.im + self.twiddle4.re * x38p.im + self.twiddle2.re * x47p.im + self.twiddle3.re * x56p.im; let b56im_b = self.twiddle5.im * x110n.re + -self.twiddle1.im * x29n.re + self.twiddle4.im * x38n.re + -self.twiddle2.im * x47n.re + self.twiddle3.im * x56n.re; let out1re = b110re_a - b110re_b; let out1im = b110im_a + b110im_b; let out2re = b29re_a - b29re_b; let out2im = b29im_a + b29im_b; let out3re = b38re_a - b38re_b; let out3im = b38im_a + b38im_b; let out4re = b47re_a - b47re_b; let out4im = b47im_a + b47im_b; let out5re = b56re_a - b56re_b; let out5im = b56im_a + b56im_b; let out6re = b56re_a + b56re_b; let out6im = b56im_a - b56im_b; let out7re = b47re_a + b47re_b; let out7im = b47im_a - b47im_b; let out8re = b38re_a + b38re_b; let out8im = b38im_a - b38im_b; let out9re = b29re_a + b29re_b; let out9im = b29im_a - b29im_b; let out10re = b110re_a + b110re_b; let out10im = b110im_a - b110im_b; buffer.store(sum, 0); buffer.store( Complex { re: out1re, im: out1im, }, 1, ); buffer.store( Complex { re: out2re, im: out2im, }, 2, ); buffer.store( Complex { re: out3re, im: out3im, }, 3, ); buffer.store( Complex { re: out4re, im: out4im, }, 4, ); buffer.store( Complex { re: out5re, im: out5im, }, 5, ); buffer.store( Complex { re: out6re, im: out6im, }, 6, ); buffer.store( Complex { re: out7re, im: out7im, }, 7, ); buffer.store( Complex { re: out8re, im: out8im, }, 8, ); buffer.store( Complex { re: out9re, im: out9im, }, 9, ); buffer.store( Complex { re: out10re, im: out10im, }, 10, ); } } pub struct Butterfly13 { twiddle1: Complex, twiddle2: Complex, twiddle3: Complex, twiddle4: Complex, twiddle5: Complex, twiddle6: Complex, direction: FftDirection, } boilerplate_fft_butterfly!(Butterfly13, 13, |this: &Butterfly13<_>| this.direction); impl Butterfly13 { pub fn new(direction: FftDirection) -> Self { let twiddle1: Complex = twiddles::compute_twiddle(1, 13, direction); let twiddle2: Complex = twiddles::compute_twiddle(2, 13, direction); let twiddle3: Complex = twiddles::compute_twiddle(3, 13, direction); let twiddle4: Complex = twiddles::compute_twiddle(4, 13, direction); let twiddle5: Complex = twiddles::compute_twiddle(5, 13, direction); let twiddle6: Complex = twiddles::compute_twiddle(6, 13, direction); Self { twiddle1, twiddle2, twiddle3, twiddle4, twiddle5, twiddle6, direction, } } #[inline(never)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { // This function was derived in the same manner as the butterflies for length 3, 5 and 7. // However, instead of doing it by hand the actual code is autogenerated // with the `genbutterflies.py` script in the `tools` directory. let x112p = buffer.load(1) + buffer.load(12); let x112n = buffer.load(1) - buffer.load(12); let x211p = buffer.load(2) + buffer.load(11); let x211n = buffer.load(2) - buffer.load(11); let x310p = buffer.load(3) + buffer.load(10); let x310n = buffer.load(3) - buffer.load(10); let x49p = buffer.load(4) + buffer.load(9); let x49n = buffer.load(4) - buffer.load(9); let x58p = buffer.load(5) + buffer.load(8); let x58n = buffer.load(5) - buffer.load(8); let x67p = buffer.load(6) + buffer.load(7); let x67n = buffer.load(6) - buffer.load(7); let sum = buffer.load(0) + x112p + x211p + x310p + x49p + x58p + x67p; let b112re_a = buffer.load(0).re + self.twiddle1.re * x112p.re + self.twiddle2.re * x211p.re + self.twiddle3.re * x310p.re + self.twiddle4.re * x49p.re + self.twiddle5.re * x58p.re + self.twiddle6.re * x67p.re; let b112re_b = self.twiddle1.im * x112n.im + self.twiddle2.im * x211n.im + self.twiddle3.im * x310n.im + self.twiddle4.im * x49n.im + self.twiddle5.im * x58n.im + self.twiddle6.im * x67n.im; let b211re_a = buffer.load(0).re + self.twiddle2.re * x112p.re + self.twiddle4.re * x211p.re + self.twiddle6.re * x310p.re + self.twiddle5.re * x49p.re + self.twiddle3.re * x58p.re + self.twiddle1.re * x67p.re; let b211re_b = self.twiddle2.im * x112n.im + self.twiddle4.im * x211n.im + self.twiddle6.im * x310n.im + -self.twiddle5.im * x49n.im + -self.twiddle3.im * x58n.im + -self.twiddle1.im * x67n.im; let b310re_a = buffer.load(0).re + self.twiddle3.re * x112p.re + self.twiddle6.re * x211p.re + self.twiddle4.re * x310p.re + self.twiddle1.re * x49p.re + self.twiddle2.re * x58p.re + self.twiddle5.re * x67p.re; let b310re_b = self.twiddle3.im * x112n.im + self.twiddle6.im * x211n.im + -self.twiddle4.im * x310n.im + -self.twiddle1.im * x49n.im + self.twiddle2.im * x58n.im + self.twiddle5.im * x67n.im; let b49re_a = buffer.load(0).re + self.twiddle4.re * x112p.re + self.twiddle5.re * x211p.re + self.twiddle1.re * x310p.re + self.twiddle3.re * x49p.re + self.twiddle6.re * x58p.re + self.twiddle2.re * x67p.re; let b49re_b = self.twiddle4.im * x112n.im + -self.twiddle5.im * x211n.im + -self.twiddle1.im * x310n.im + self.twiddle3.im * x49n.im + -self.twiddle6.im * x58n.im + -self.twiddle2.im * x67n.im; let b58re_a = buffer.load(0).re + self.twiddle5.re * x112p.re + self.twiddle3.re * x211p.re + self.twiddle2.re * x310p.re + self.twiddle6.re * x49p.re + self.twiddle1.re * x58p.re + self.twiddle4.re * x67p.re; let b58re_b = self.twiddle5.im * x112n.im + -self.twiddle3.im * x211n.im + self.twiddle2.im * x310n.im + -self.twiddle6.im * x49n.im + -self.twiddle1.im * x58n.im + self.twiddle4.im * x67n.im; let b67re_a = buffer.load(0).re + self.twiddle6.re * x112p.re + self.twiddle1.re * x211p.re + self.twiddle5.re * x310p.re + self.twiddle2.re * x49p.re + self.twiddle4.re * x58p.re + self.twiddle3.re * x67p.re; let b67re_b = self.twiddle6.im * x112n.im + -self.twiddle1.im * x211n.im + self.twiddle5.im * x310n.im + -self.twiddle2.im * x49n.im + self.twiddle4.im * x58n.im + -self.twiddle3.im * x67n.im; let b112im_a = buffer.load(0).im + self.twiddle1.re * x112p.im + self.twiddle2.re * x211p.im + self.twiddle3.re * x310p.im + self.twiddle4.re * x49p.im + self.twiddle5.re * x58p.im + self.twiddle6.re * x67p.im; let b112im_b = self.twiddle1.im * x112n.re + self.twiddle2.im * x211n.re + self.twiddle3.im * x310n.re + self.twiddle4.im * x49n.re + self.twiddle5.im * x58n.re + self.twiddle6.im * x67n.re; let b211im_a = buffer.load(0).im + self.twiddle2.re * x112p.im + self.twiddle4.re * x211p.im + self.twiddle6.re * x310p.im + self.twiddle5.re * x49p.im + self.twiddle3.re * x58p.im + self.twiddle1.re * x67p.im; let b211im_b = self.twiddle2.im * x112n.re + self.twiddle4.im * x211n.re + self.twiddle6.im * x310n.re + -self.twiddle5.im * x49n.re + -self.twiddle3.im * x58n.re + -self.twiddle1.im * x67n.re; let b310im_a = buffer.load(0).im + self.twiddle3.re * x112p.im + self.twiddle6.re * x211p.im + self.twiddle4.re * x310p.im + self.twiddle1.re * x49p.im + self.twiddle2.re * x58p.im + self.twiddle5.re * x67p.im; let b310im_b = self.twiddle3.im * x112n.re + self.twiddle6.im * x211n.re + -self.twiddle4.im * x310n.re + -self.twiddle1.im * x49n.re + self.twiddle2.im * x58n.re + self.twiddle5.im * x67n.re; let b49im_a = buffer.load(0).im + self.twiddle4.re * x112p.im + self.twiddle5.re * x211p.im + self.twiddle1.re * x310p.im + self.twiddle3.re * x49p.im + self.twiddle6.re * x58p.im + self.twiddle2.re * x67p.im; let b49im_b = self.twiddle4.im * x112n.re + -self.twiddle5.im * x211n.re + -self.twiddle1.im * x310n.re + self.twiddle3.im * x49n.re + -self.twiddle6.im * x58n.re + -self.twiddle2.im * x67n.re; let b58im_a = buffer.load(0).im + self.twiddle5.re * x112p.im + self.twiddle3.re * x211p.im + self.twiddle2.re * x310p.im + self.twiddle6.re * x49p.im + self.twiddle1.re * x58p.im + self.twiddle4.re * x67p.im; let b58im_b = self.twiddle5.im * x112n.re + -self.twiddle3.im * x211n.re + self.twiddle2.im * x310n.re + -self.twiddle6.im * x49n.re + -self.twiddle1.im * x58n.re + self.twiddle4.im * x67n.re; let b67im_a = buffer.load(0).im + self.twiddle6.re * x112p.im + self.twiddle1.re * x211p.im + self.twiddle5.re * x310p.im + self.twiddle2.re * x49p.im + self.twiddle4.re * x58p.im + self.twiddle3.re * x67p.im; let b67im_b = self.twiddle6.im * x112n.re + -self.twiddle1.im * x211n.re + self.twiddle5.im * x310n.re + -self.twiddle2.im * x49n.re + self.twiddle4.im * x58n.re + -self.twiddle3.im * x67n.re; let out1re = b112re_a - b112re_b; let out1im = b112im_a + b112im_b; let out2re = b211re_a - b211re_b; let out2im = b211im_a + b211im_b; let out3re = b310re_a - b310re_b; let out3im = b310im_a + b310im_b; let out4re = b49re_a - b49re_b; let out4im = b49im_a + b49im_b; let out5re = b58re_a - b58re_b; let out5im = b58im_a + b58im_b; let out6re = b67re_a - b67re_b; let out6im = b67im_a + b67im_b; let out7re = b67re_a + b67re_b; let out7im = b67im_a - b67im_b; let out8re = b58re_a + b58re_b; let out8im = b58im_a - b58im_b; let out9re = b49re_a + b49re_b; let out9im = b49im_a - b49im_b; let out10re = b310re_a + b310re_b; let out10im = b310im_a - b310im_b; let out11re = b211re_a + b211re_b; let out11im = b211im_a - b211im_b; let out12re = b112re_a + b112re_b; let out12im = b112im_a - b112im_b; buffer.store(sum, 0); buffer.store( Complex { re: out1re, im: out1im, }, 1, ); buffer.store( Complex { re: out2re, im: out2im, }, 2, ); buffer.store( Complex { re: out3re, im: out3im, }, 3, ); buffer.store( Complex { re: out4re, im: out4im, }, 4, ); buffer.store( Complex { re: out5re, im: out5im, }, 5, ); buffer.store( Complex { re: out6re, im: out6im, }, 6, ); buffer.store( Complex { re: out7re, im: out7im, }, 7, ); buffer.store( Complex { re: out8re, im: out8im, }, 8, ); buffer.store( Complex { re: out9re, im: out9im, }, 9, ); buffer.store( Complex { re: out10re, im: out10im, }, 10, ); buffer.store( Complex { re: out11re, im: out11im, }, 11, ); buffer.store( Complex { re: out12re, im: out12im, }, 12, ); } } pub struct Butterfly16 { butterfly8: Butterfly8, twiddle1: Complex, twiddle2: Complex, twiddle3: Complex, } boilerplate_fft_butterfly!(Butterfly16, 16, |this: &Butterfly16<_>| this .butterfly8 .fft_direction()); impl Butterfly16 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { Self { butterfly8: Butterfly8::new(direction), twiddle1: twiddles::compute_twiddle(1, 16, direction), twiddle2: twiddles::compute_twiddle(2, 16, direction), twiddle3: twiddles::compute_twiddle(3, 16, direction), } } #[inline(never)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { let butterfly4 = Butterfly4::new(self.fft_direction()); // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch let mut scratch_evens = [ buffer.load(0), buffer.load(2), buffer.load(4), buffer.load(6), buffer.load(8), buffer.load(10), buffer.load(12), buffer.load(14), ]; let mut scratch_odds_n1 = [ buffer.load(1), buffer.load(5), buffer.load(9), buffer.load(13), ]; let mut scratch_odds_n3 = [ buffer.load(15), buffer.load(3), buffer.load(7), buffer.load(11), ]; // step 2: column FFTs self.butterfly8.perform_fft_contiguous(&mut scratch_evens); butterfly4.perform_fft_contiguous(&mut scratch_odds_n1); butterfly4.perform_fft_contiguous(&mut scratch_odds_n3); // step 3: apply twiddle factors scratch_odds_n1[1] = scratch_odds_n1[1] * self.twiddle1; scratch_odds_n3[1] = scratch_odds_n3[1] * self.twiddle1.conj(); scratch_odds_n1[2] = scratch_odds_n1[2] * self.twiddle2; scratch_odds_n3[2] = scratch_odds_n3[2] * self.twiddle2.conj(); scratch_odds_n1[3] = scratch_odds_n1[3] * self.twiddle3; scratch_odds_n3[3] = scratch_odds_n3[3] * self.twiddle3.conj(); // step 4: cross FFTs Butterfly2::perform_fft_strided(&mut scratch_odds_n1[0], &mut scratch_odds_n3[0]); Butterfly2::perform_fft_strided(&mut scratch_odds_n1[1], &mut scratch_odds_n3[1]); Butterfly2::perform_fft_strided(&mut scratch_odds_n1[2], &mut scratch_odds_n3[2]); Butterfly2::perform_fft_strided(&mut scratch_odds_n1[3], &mut scratch_odds_n3[3]); // apply the butterfly 4 twiddle factor, which is just a rotation scratch_odds_n3[0] = twiddles::rotate_90(scratch_odds_n3[0], self.fft_direction()); scratch_odds_n3[1] = twiddles::rotate_90(scratch_odds_n3[1], self.fft_direction()); scratch_odds_n3[2] = twiddles::rotate_90(scratch_odds_n3[2], self.fft_direction()); scratch_odds_n3[3] = twiddles::rotate_90(scratch_odds_n3[3], self.fft_direction()); //step 5: copy/add/subtract data back to buffer buffer.store(scratch_evens[0] + scratch_odds_n1[0], 0); buffer.store(scratch_evens[1] + scratch_odds_n1[1], 1); buffer.store(scratch_evens[2] + scratch_odds_n1[2], 2); buffer.store(scratch_evens[3] + scratch_odds_n1[3], 3); buffer.store(scratch_evens[4] + scratch_odds_n3[0], 4); buffer.store(scratch_evens[5] + scratch_odds_n3[1], 5); buffer.store(scratch_evens[6] + scratch_odds_n3[2], 6); buffer.store(scratch_evens[7] + scratch_odds_n3[3], 7); buffer.store(scratch_evens[0] - scratch_odds_n1[0], 8); buffer.store(scratch_evens[1] - scratch_odds_n1[1], 9); buffer.store(scratch_evens[2] - scratch_odds_n1[2], 10); buffer.store(scratch_evens[3] - scratch_odds_n1[3], 11); buffer.store(scratch_evens[4] - scratch_odds_n3[0], 12); buffer.store(scratch_evens[5] - scratch_odds_n3[1], 13); buffer.store(scratch_evens[6] - scratch_odds_n3[2], 14); buffer.store(scratch_evens[7] - scratch_odds_n3[3], 15); } } pub struct Butterfly17 { twiddle1: Complex, twiddle2: Complex, twiddle3: Complex, twiddle4: Complex, twiddle5: Complex, twiddle6: Complex, twiddle7: Complex, twiddle8: Complex, direction: FftDirection, } boilerplate_fft_butterfly!(Butterfly17, 17, |this: &Butterfly17<_>| this.direction); impl Butterfly17 { pub fn new(direction: FftDirection) -> Self { let twiddle1: Complex = twiddles::compute_twiddle(1, 17, direction); let twiddle2: Complex = twiddles::compute_twiddle(2, 17, direction); let twiddle3: Complex = twiddles::compute_twiddle(3, 17, direction); let twiddle4: Complex = twiddles::compute_twiddle(4, 17, direction); let twiddle5: Complex = twiddles::compute_twiddle(5, 17, direction); let twiddle6: Complex = twiddles::compute_twiddle(6, 17, direction); let twiddle7: Complex = twiddles::compute_twiddle(7, 17, direction); let twiddle8: Complex = twiddles::compute_twiddle(8, 17, direction); Self { twiddle1, twiddle2, twiddle3, twiddle4, twiddle5, twiddle6, twiddle7, twiddle8, direction, } } #[inline(never)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { // This function was derived in the same manner as the butterflies for length 3, 5 and 7. // However, instead of doing it by hand the actual code is autogenerated // with the `genbutterflies.py` script in the `tools` directory. let x116p = buffer.load(1) + buffer.load(16); let x116n = buffer.load(1) - buffer.load(16); let x215p = buffer.load(2) + buffer.load(15); let x215n = buffer.load(2) - buffer.load(15); let x314p = buffer.load(3) + buffer.load(14); let x314n = buffer.load(3) - buffer.load(14); let x413p = buffer.load(4) + buffer.load(13); let x413n = buffer.load(4) - buffer.load(13); let x512p = buffer.load(5) + buffer.load(12); let x512n = buffer.load(5) - buffer.load(12); let x611p = buffer.load(6) + buffer.load(11); let x611n = buffer.load(6) - buffer.load(11); let x710p = buffer.load(7) + buffer.load(10); let x710n = buffer.load(7) - buffer.load(10); let x89p = buffer.load(8) + buffer.load(9); let x89n = buffer.load(8) - buffer.load(9); let sum = buffer.load(0) + x116p + x215p + x314p + x413p + x512p + x611p + x710p + x89p; let b116re_a = buffer.load(0).re + self.twiddle1.re * x116p.re + self.twiddle2.re * x215p.re + self.twiddle3.re * x314p.re + self.twiddle4.re * x413p.re + self.twiddle5.re * x512p.re + self.twiddle6.re * x611p.re + self.twiddle7.re * x710p.re + self.twiddle8.re * x89p.re; let b116re_b = self.twiddle1.im * x116n.im + self.twiddle2.im * x215n.im + self.twiddle3.im * x314n.im + self.twiddle4.im * x413n.im + self.twiddle5.im * x512n.im + self.twiddle6.im * x611n.im + self.twiddle7.im * x710n.im + self.twiddle8.im * x89n.im; let b215re_a = buffer.load(0).re + self.twiddle2.re * x116p.re + self.twiddle4.re * x215p.re + self.twiddle6.re * x314p.re + self.twiddle8.re * x413p.re + self.twiddle7.re * x512p.re + self.twiddle5.re * x611p.re + self.twiddle3.re * x710p.re + self.twiddle1.re * x89p.re; let b215re_b = self.twiddle2.im * x116n.im + self.twiddle4.im * x215n.im + self.twiddle6.im * x314n.im + self.twiddle8.im * x413n.im + -self.twiddle7.im * x512n.im + -self.twiddle5.im * x611n.im + -self.twiddle3.im * x710n.im + -self.twiddle1.im * x89n.im; let b314re_a = buffer.load(0).re + self.twiddle3.re * x116p.re + self.twiddle6.re * x215p.re + self.twiddle8.re * x314p.re + self.twiddle5.re * x413p.re + self.twiddle2.re * x512p.re + self.twiddle1.re * x611p.re + self.twiddle4.re * x710p.re + self.twiddle7.re * x89p.re; let b314re_b = self.twiddle3.im * x116n.im + self.twiddle6.im * x215n.im + -self.twiddle8.im * x314n.im + -self.twiddle5.im * x413n.im + -self.twiddle2.im * x512n.im + self.twiddle1.im * x611n.im + self.twiddle4.im * x710n.im + self.twiddle7.im * x89n.im; let b413re_a = buffer.load(0).re + self.twiddle4.re * x116p.re + self.twiddle8.re * x215p.re + self.twiddle5.re * x314p.re + self.twiddle1.re * x413p.re + self.twiddle3.re * x512p.re + self.twiddle7.re * x611p.re + self.twiddle6.re * x710p.re + self.twiddle2.re * x89p.re; let b413re_b = self.twiddle4.im * x116n.im + self.twiddle8.im * x215n.im + -self.twiddle5.im * x314n.im + -self.twiddle1.im * x413n.im + self.twiddle3.im * x512n.im + self.twiddle7.im * x611n.im + -self.twiddle6.im * x710n.im + -self.twiddle2.im * x89n.im; let b512re_a = buffer.load(0).re + self.twiddle5.re * x116p.re + self.twiddle7.re * x215p.re + self.twiddle2.re * x314p.re + self.twiddle3.re * x413p.re + self.twiddle8.re * x512p.re + self.twiddle4.re * x611p.re + self.twiddle1.re * x710p.re + self.twiddle6.re * x89p.re; let b512re_b = self.twiddle5.im * x116n.im + -self.twiddle7.im * x215n.im + -self.twiddle2.im * x314n.im + self.twiddle3.im * x413n.im + self.twiddle8.im * x512n.im + -self.twiddle4.im * x611n.im + self.twiddle1.im * x710n.im + self.twiddle6.im * x89n.im; let b611re_a = buffer.load(0).re + self.twiddle6.re * x116p.re + self.twiddle5.re * x215p.re + self.twiddle1.re * x314p.re + self.twiddle7.re * x413p.re + self.twiddle4.re * x512p.re + self.twiddle2.re * x611p.re + self.twiddle8.re * x710p.re + self.twiddle3.re * x89p.re; let b611re_b = self.twiddle6.im * x116n.im + -self.twiddle5.im * x215n.im + self.twiddle1.im * x314n.im + self.twiddle7.im * x413n.im + -self.twiddle4.im * x512n.im + self.twiddle2.im * x611n.im + self.twiddle8.im * x710n.im + -self.twiddle3.im * x89n.im; let b710re_a = buffer.load(0).re + self.twiddle7.re * x116p.re + self.twiddle3.re * x215p.re + self.twiddle4.re * x314p.re + self.twiddle6.re * x413p.re + self.twiddle1.re * x512p.re + self.twiddle8.re * x611p.re + self.twiddle2.re * x710p.re + self.twiddle5.re * x89p.re; let b710re_b = self.twiddle7.im * x116n.im + -self.twiddle3.im * x215n.im + self.twiddle4.im * x314n.im + -self.twiddle6.im * x413n.im + self.twiddle1.im * x512n.im + self.twiddle8.im * x611n.im + -self.twiddle2.im * x710n.im + self.twiddle5.im * x89n.im; let b89re_a = buffer.load(0).re + self.twiddle8.re * x116p.re + self.twiddle1.re * x215p.re + self.twiddle7.re * x314p.re + self.twiddle2.re * x413p.re + self.twiddle6.re * x512p.re + self.twiddle3.re * x611p.re + self.twiddle5.re * x710p.re + self.twiddle4.re * x89p.re; let b89re_b = self.twiddle8.im * x116n.im + -self.twiddle1.im * x215n.im + self.twiddle7.im * x314n.im + -self.twiddle2.im * x413n.im + self.twiddle6.im * x512n.im + -self.twiddle3.im * x611n.im + self.twiddle5.im * x710n.im + -self.twiddle4.im * x89n.im; let b116im_a = buffer.load(0).im + self.twiddle1.re * x116p.im + self.twiddle2.re * x215p.im + self.twiddle3.re * x314p.im + self.twiddle4.re * x413p.im + self.twiddle5.re * x512p.im + self.twiddle6.re * x611p.im + self.twiddle7.re * x710p.im + self.twiddle8.re * x89p.im; let b116im_b = self.twiddle1.im * x116n.re + self.twiddle2.im * x215n.re + self.twiddle3.im * x314n.re + self.twiddle4.im * x413n.re + self.twiddle5.im * x512n.re + self.twiddle6.im * x611n.re + self.twiddle7.im * x710n.re + self.twiddle8.im * x89n.re; let b215im_a = buffer.load(0).im + self.twiddle2.re * x116p.im + self.twiddle4.re * x215p.im + self.twiddle6.re * x314p.im + self.twiddle8.re * x413p.im + self.twiddle7.re * x512p.im + self.twiddle5.re * x611p.im + self.twiddle3.re * x710p.im + self.twiddle1.re * x89p.im; let b215im_b = self.twiddle2.im * x116n.re + self.twiddle4.im * x215n.re + self.twiddle6.im * x314n.re + self.twiddle8.im * x413n.re + -self.twiddle7.im * x512n.re + -self.twiddle5.im * x611n.re + -self.twiddle3.im * x710n.re + -self.twiddle1.im * x89n.re; let b314im_a = buffer.load(0).im + self.twiddle3.re * x116p.im + self.twiddle6.re * x215p.im + self.twiddle8.re * x314p.im + self.twiddle5.re * x413p.im + self.twiddle2.re * x512p.im + self.twiddle1.re * x611p.im + self.twiddle4.re * x710p.im + self.twiddle7.re * x89p.im; let b314im_b = self.twiddle3.im * x116n.re + self.twiddle6.im * x215n.re + -self.twiddle8.im * x314n.re + -self.twiddle5.im * x413n.re + -self.twiddle2.im * x512n.re + self.twiddle1.im * x611n.re + self.twiddle4.im * x710n.re + self.twiddle7.im * x89n.re; let b413im_a = buffer.load(0).im + self.twiddle4.re * x116p.im + self.twiddle8.re * x215p.im + self.twiddle5.re * x314p.im + self.twiddle1.re * x413p.im + self.twiddle3.re * x512p.im + self.twiddle7.re * x611p.im + self.twiddle6.re * x710p.im + self.twiddle2.re * x89p.im; let b413im_b = self.twiddle4.im * x116n.re + self.twiddle8.im * x215n.re + -self.twiddle5.im * x314n.re + -self.twiddle1.im * x413n.re + self.twiddle3.im * x512n.re + self.twiddle7.im * x611n.re + -self.twiddle6.im * x710n.re + -self.twiddle2.im * x89n.re; let b512im_a = buffer.load(0).im + self.twiddle5.re * x116p.im + self.twiddle7.re * x215p.im + self.twiddle2.re * x314p.im + self.twiddle3.re * x413p.im + self.twiddle8.re * x512p.im + self.twiddle4.re * x611p.im + self.twiddle1.re * x710p.im + self.twiddle6.re * x89p.im; let b512im_b = self.twiddle5.im * x116n.re + -self.twiddle7.im * x215n.re + -self.twiddle2.im * x314n.re + self.twiddle3.im * x413n.re + self.twiddle8.im * x512n.re + -self.twiddle4.im * x611n.re + self.twiddle1.im * x710n.re + self.twiddle6.im * x89n.re; let b611im_a = buffer.load(0).im + self.twiddle6.re * x116p.im + self.twiddle5.re * x215p.im + self.twiddle1.re * x314p.im + self.twiddle7.re * x413p.im + self.twiddle4.re * x512p.im + self.twiddle2.re * x611p.im + self.twiddle8.re * x710p.im + self.twiddle3.re * x89p.im; let b611im_b = self.twiddle6.im * x116n.re + -self.twiddle5.im * x215n.re + self.twiddle1.im * x314n.re + self.twiddle7.im * x413n.re + -self.twiddle4.im * x512n.re + self.twiddle2.im * x611n.re + self.twiddle8.im * x710n.re + -self.twiddle3.im * x89n.re; let b710im_a = buffer.load(0).im + self.twiddle7.re * x116p.im + self.twiddle3.re * x215p.im + self.twiddle4.re * x314p.im + self.twiddle6.re * x413p.im + self.twiddle1.re * x512p.im + self.twiddle8.re * x611p.im + self.twiddle2.re * x710p.im + self.twiddle5.re * x89p.im; let b710im_b = self.twiddle7.im * x116n.re + -self.twiddle3.im * x215n.re + self.twiddle4.im * x314n.re + -self.twiddle6.im * x413n.re + self.twiddle1.im * x512n.re + self.twiddle8.im * x611n.re + -self.twiddle2.im * x710n.re + self.twiddle5.im * x89n.re; let b89im_a = buffer.load(0).im + self.twiddle8.re * x116p.im + self.twiddle1.re * x215p.im + self.twiddle7.re * x314p.im + self.twiddle2.re * x413p.im + self.twiddle6.re * x512p.im + self.twiddle3.re * x611p.im + self.twiddle5.re * x710p.im + self.twiddle4.re * x89p.im; let b89im_b = self.twiddle8.im * x116n.re + -self.twiddle1.im * x215n.re + self.twiddle7.im * x314n.re + -self.twiddle2.im * x413n.re + self.twiddle6.im * x512n.re + -self.twiddle3.im * x611n.re + self.twiddle5.im * x710n.re + -self.twiddle4.im * x89n.re; let out1re = b116re_a - b116re_b; let out1im = b116im_a + b116im_b; let out2re = b215re_a - b215re_b; let out2im = b215im_a + b215im_b; let out3re = b314re_a - b314re_b; let out3im = b314im_a + b314im_b; let out4re = b413re_a - b413re_b; let out4im = b413im_a + b413im_b; let out5re = b512re_a - b512re_b; let out5im = b512im_a + b512im_b; let out6re = b611re_a - b611re_b; let out6im = b611im_a + b611im_b; let out7re = b710re_a - b710re_b; let out7im = b710im_a + b710im_b; let out8re = b89re_a - b89re_b; let out8im = b89im_a + b89im_b; let out9re = b89re_a + b89re_b; let out9im = b89im_a - b89im_b; let out10re = b710re_a + b710re_b; let out10im = b710im_a - b710im_b; let out11re = b611re_a + b611re_b; let out11im = b611im_a - b611im_b; let out12re = b512re_a + b512re_b; let out12im = b512im_a - b512im_b; let out13re = b413re_a + b413re_b; let out13im = b413im_a - b413im_b; let out14re = b314re_a + b314re_b; let out14im = b314im_a - b314im_b; let out15re = b215re_a + b215re_b; let out15im = b215im_a - b215im_b; let out16re = b116re_a + b116re_b; let out16im = b116im_a - b116im_b; buffer.store(sum, 0); buffer.store( Complex { re: out1re, im: out1im, }, 1, ); buffer.store( Complex { re: out2re, im: out2im, }, 2, ); buffer.store( Complex { re: out3re, im: out3im, }, 3, ); buffer.store( Complex { re: out4re, im: out4im, }, 4, ); buffer.store( Complex { re: out5re, im: out5im, }, 5, ); buffer.store( Complex { re: out6re, im: out6im, }, 6, ); buffer.store( Complex { re: out7re, im: out7im, }, 7, ); buffer.store( Complex { re: out8re, im: out8im, }, 8, ); buffer.store( Complex { re: out9re, im: out9im, }, 9, ); buffer.store( Complex { re: out10re, im: out10im, }, 10, ); buffer.store( Complex { re: out11re, im: out11im, }, 11, ); buffer.store( Complex { re: out12re, im: out12im, }, 12, ); buffer.store( Complex { re: out13re, im: out13im, }, 13, ); buffer.store( Complex { re: out14re, im: out14im, }, 14, ); buffer.store( Complex { re: out15re, im: out15im, }, 15, ); buffer.store( Complex { re: out16re, im: out16im, }, 16, ); } } pub struct Butterfly19 { twiddle1: Complex, twiddle2: Complex, twiddle3: Complex, twiddle4: Complex, twiddle5: Complex, twiddle6: Complex, twiddle7: Complex, twiddle8: Complex, twiddle9: Complex, direction: FftDirection, } boilerplate_fft_butterfly!(Butterfly19, 19, |this: &Butterfly19<_>| this.direction); impl Butterfly19 { pub fn new(direction: FftDirection) -> Self { let twiddle1: Complex = twiddles::compute_twiddle(1, 19, direction); let twiddle2: Complex = twiddles::compute_twiddle(2, 19, direction); let twiddle3: Complex = twiddles::compute_twiddle(3, 19, direction); let twiddle4: Complex = twiddles::compute_twiddle(4, 19, direction); let twiddle5: Complex = twiddles::compute_twiddle(5, 19, direction); let twiddle6: Complex = twiddles::compute_twiddle(6, 19, direction); let twiddle7: Complex = twiddles::compute_twiddle(7, 19, direction); let twiddle8: Complex = twiddles::compute_twiddle(8, 19, direction); let twiddle9: Complex = twiddles::compute_twiddle(9, 19, direction); Self { twiddle1, twiddle2, twiddle3, twiddle4, twiddle5, twiddle6, twiddle7, twiddle8, twiddle9, direction, } } #[inline(never)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { // This function was derived in the same manner as the butterflies for length 3, 5 and 7. // However, instead of doing it by hand the actual code is autogenerated // with the `genbutterflies.py` script in the `tools` directory. let x118p = buffer.load(1) + buffer.load(18); let x118n = buffer.load(1) - buffer.load(18); let x217p = buffer.load(2) + buffer.load(17); let x217n = buffer.load(2) - buffer.load(17); let x316p = buffer.load(3) + buffer.load(16); let x316n = buffer.load(3) - buffer.load(16); let x415p = buffer.load(4) + buffer.load(15); let x415n = buffer.load(4) - buffer.load(15); let x514p = buffer.load(5) + buffer.load(14); let x514n = buffer.load(5) - buffer.load(14); let x613p = buffer.load(6) + buffer.load(13); let x613n = buffer.load(6) - buffer.load(13); let x712p = buffer.load(7) + buffer.load(12); let x712n = buffer.load(7) - buffer.load(12); let x811p = buffer.load(8) + buffer.load(11); let x811n = buffer.load(8) - buffer.load(11); let x910p = buffer.load(9) + buffer.load(10); let x910n = buffer.load(9) - buffer.load(10); let sum = buffer.load(0) + x118p + x217p + x316p + x415p + x514p + x613p + x712p + x811p + x910p; let b118re_a = buffer.load(0).re + self.twiddle1.re * x118p.re + self.twiddle2.re * x217p.re + self.twiddle3.re * x316p.re + self.twiddle4.re * x415p.re + self.twiddle5.re * x514p.re + self.twiddle6.re * x613p.re + self.twiddle7.re * x712p.re + self.twiddle8.re * x811p.re + self.twiddle9.re * x910p.re; let b118re_b = self.twiddle1.im * x118n.im + self.twiddle2.im * x217n.im + self.twiddle3.im * x316n.im + self.twiddle4.im * x415n.im + self.twiddle5.im * x514n.im + self.twiddle6.im * x613n.im + self.twiddle7.im * x712n.im + self.twiddle8.im * x811n.im + self.twiddle9.im * x910n.im; let b217re_a = buffer.load(0).re + self.twiddle2.re * x118p.re + self.twiddle4.re * x217p.re + self.twiddle6.re * x316p.re + self.twiddle8.re * x415p.re + self.twiddle9.re * x514p.re + self.twiddle7.re * x613p.re + self.twiddle5.re * x712p.re + self.twiddle3.re * x811p.re + self.twiddle1.re * x910p.re; let b217re_b = self.twiddle2.im * x118n.im + self.twiddle4.im * x217n.im + self.twiddle6.im * x316n.im + self.twiddle8.im * x415n.im + -self.twiddle9.im * x514n.im + -self.twiddle7.im * x613n.im + -self.twiddle5.im * x712n.im + -self.twiddle3.im * x811n.im + -self.twiddle1.im * x910n.im; let b316re_a = buffer.load(0).re + self.twiddle3.re * x118p.re + self.twiddle6.re * x217p.re + self.twiddle9.re * x316p.re + self.twiddle7.re * x415p.re + self.twiddle4.re * x514p.re + self.twiddle1.re * x613p.re + self.twiddle2.re * x712p.re + self.twiddle5.re * x811p.re + self.twiddle8.re * x910p.re; let b316re_b = self.twiddle3.im * x118n.im + self.twiddle6.im * x217n.im + self.twiddle9.im * x316n.im + -self.twiddle7.im * x415n.im + -self.twiddle4.im * x514n.im + -self.twiddle1.im * x613n.im + self.twiddle2.im * x712n.im + self.twiddle5.im * x811n.im + self.twiddle8.im * x910n.im; let b415re_a = buffer.load(0).re + self.twiddle4.re * x118p.re + self.twiddle8.re * x217p.re + self.twiddle7.re * x316p.re + self.twiddle3.re * x415p.re + self.twiddle1.re * x514p.re + self.twiddle5.re * x613p.re + self.twiddle9.re * x712p.re + self.twiddle6.re * x811p.re + self.twiddle2.re * x910p.re; let b415re_b = self.twiddle4.im * x118n.im + self.twiddle8.im * x217n.im + -self.twiddle7.im * x316n.im + -self.twiddle3.im * x415n.im + self.twiddle1.im * x514n.im + self.twiddle5.im * x613n.im + self.twiddle9.im * x712n.im + -self.twiddle6.im * x811n.im + -self.twiddle2.im * x910n.im; let b514re_a = buffer.load(0).re + self.twiddle5.re * x118p.re + self.twiddle9.re * x217p.re + self.twiddle4.re * x316p.re + self.twiddle1.re * x415p.re + self.twiddle6.re * x514p.re + self.twiddle8.re * x613p.re + self.twiddle3.re * x712p.re + self.twiddle2.re * x811p.re + self.twiddle7.re * x910p.re; let b514re_b = self.twiddle5.im * x118n.im + -self.twiddle9.im * x217n.im + -self.twiddle4.im * x316n.im + self.twiddle1.im * x415n.im + self.twiddle6.im * x514n.im + -self.twiddle8.im * x613n.im + -self.twiddle3.im * x712n.im + self.twiddle2.im * x811n.im + self.twiddle7.im * x910n.im; let b613re_a = buffer.load(0).re + self.twiddle6.re * x118p.re + self.twiddle7.re * x217p.re + self.twiddle1.re * x316p.re + self.twiddle5.re * x415p.re + self.twiddle8.re * x514p.re + self.twiddle2.re * x613p.re + self.twiddle4.re * x712p.re + self.twiddle9.re * x811p.re + self.twiddle3.re * x910p.re; let b613re_b = self.twiddle6.im * x118n.im + -self.twiddle7.im * x217n.im + -self.twiddle1.im * x316n.im + self.twiddle5.im * x415n.im + -self.twiddle8.im * x514n.im + -self.twiddle2.im * x613n.im + self.twiddle4.im * x712n.im + -self.twiddle9.im * x811n.im + -self.twiddle3.im * x910n.im; let b712re_a = buffer.load(0).re + self.twiddle7.re * x118p.re + self.twiddle5.re * x217p.re + self.twiddle2.re * x316p.re + self.twiddle9.re * x415p.re + self.twiddle3.re * x514p.re + self.twiddle4.re * x613p.re + self.twiddle8.re * x712p.re + self.twiddle1.re * x811p.re + self.twiddle6.re * x910p.re; let b712re_b = self.twiddle7.im * x118n.im + -self.twiddle5.im * x217n.im + self.twiddle2.im * x316n.im + self.twiddle9.im * x415n.im + -self.twiddle3.im * x514n.im + self.twiddle4.im * x613n.im + -self.twiddle8.im * x712n.im + -self.twiddle1.im * x811n.im + self.twiddle6.im * x910n.im; let b811re_a = buffer.load(0).re + self.twiddle8.re * x118p.re + self.twiddle3.re * x217p.re + self.twiddle5.re * x316p.re + self.twiddle6.re * x415p.re + self.twiddle2.re * x514p.re + self.twiddle9.re * x613p.re + self.twiddle1.re * x712p.re + self.twiddle7.re * x811p.re + self.twiddle4.re * x910p.re; let b811re_b = self.twiddle8.im * x118n.im + -self.twiddle3.im * x217n.im + self.twiddle5.im * x316n.im + -self.twiddle6.im * x415n.im + self.twiddle2.im * x514n.im + -self.twiddle9.im * x613n.im + -self.twiddle1.im * x712n.im + self.twiddle7.im * x811n.im + -self.twiddle4.im * x910n.im; let b910re_a = buffer.load(0).re + self.twiddle9.re * x118p.re + self.twiddle1.re * x217p.re + self.twiddle8.re * x316p.re + self.twiddle2.re * x415p.re + self.twiddle7.re * x514p.re + self.twiddle3.re * x613p.re + self.twiddle6.re * x712p.re + self.twiddle4.re * x811p.re + self.twiddle5.re * x910p.re; let b910re_b = self.twiddle9.im * x118n.im + -self.twiddle1.im * x217n.im + self.twiddle8.im * x316n.im + -self.twiddle2.im * x415n.im + self.twiddle7.im * x514n.im + -self.twiddle3.im * x613n.im + self.twiddle6.im * x712n.im + -self.twiddle4.im * x811n.im + self.twiddle5.im * x910n.im; let b118im_a = buffer.load(0).im + self.twiddle1.re * x118p.im + self.twiddle2.re * x217p.im + self.twiddle3.re * x316p.im + self.twiddle4.re * x415p.im + self.twiddle5.re * x514p.im + self.twiddle6.re * x613p.im + self.twiddle7.re * x712p.im + self.twiddle8.re * x811p.im + self.twiddle9.re * x910p.im; let b118im_b = self.twiddle1.im * x118n.re + self.twiddle2.im * x217n.re + self.twiddle3.im * x316n.re + self.twiddle4.im * x415n.re + self.twiddle5.im * x514n.re + self.twiddle6.im * x613n.re + self.twiddle7.im * x712n.re + self.twiddle8.im * x811n.re + self.twiddle9.im * x910n.re; let b217im_a = buffer.load(0).im + self.twiddle2.re * x118p.im + self.twiddle4.re * x217p.im + self.twiddle6.re * x316p.im + self.twiddle8.re * x415p.im + self.twiddle9.re * x514p.im + self.twiddle7.re * x613p.im + self.twiddle5.re * x712p.im + self.twiddle3.re * x811p.im + self.twiddle1.re * x910p.im; let b217im_b = self.twiddle2.im * x118n.re + self.twiddle4.im * x217n.re + self.twiddle6.im * x316n.re + self.twiddle8.im * x415n.re + -self.twiddle9.im * x514n.re + -self.twiddle7.im * x613n.re + -self.twiddle5.im * x712n.re + -self.twiddle3.im * x811n.re + -self.twiddle1.im * x910n.re; let b316im_a = buffer.load(0).im + self.twiddle3.re * x118p.im + self.twiddle6.re * x217p.im + self.twiddle9.re * x316p.im + self.twiddle7.re * x415p.im + self.twiddle4.re * x514p.im + self.twiddle1.re * x613p.im + self.twiddle2.re * x712p.im + self.twiddle5.re * x811p.im + self.twiddle8.re * x910p.im; let b316im_b = self.twiddle3.im * x118n.re + self.twiddle6.im * x217n.re + self.twiddle9.im * x316n.re + -self.twiddle7.im * x415n.re + -self.twiddle4.im * x514n.re + -self.twiddle1.im * x613n.re + self.twiddle2.im * x712n.re + self.twiddle5.im * x811n.re + self.twiddle8.im * x910n.re; let b415im_a = buffer.load(0).im + self.twiddle4.re * x118p.im + self.twiddle8.re * x217p.im + self.twiddle7.re * x316p.im + self.twiddle3.re * x415p.im + self.twiddle1.re * x514p.im + self.twiddle5.re * x613p.im + self.twiddle9.re * x712p.im + self.twiddle6.re * x811p.im + self.twiddle2.re * x910p.im; let b415im_b = self.twiddle4.im * x118n.re + self.twiddle8.im * x217n.re + -self.twiddle7.im * x316n.re + -self.twiddle3.im * x415n.re + self.twiddle1.im * x514n.re + self.twiddle5.im * x613n.re + self.twiddle9.im * x712n.re + -self.twiddle6.im * x811n.re + -self.twiddle2.im * x910n.re; let b514im_a = buffer.load(0).im + self.twiddle5.re * x118p.im + self.twiddle9.re * x217p.im + self.twiddle4.re * x316p.im + self.twiddle1.re * x415p.im + self.twiddle6.re * x514p.im + self.twiddle8.re * x613p.im + self.twiddle3.re * x712p.im + self.twiddle2.re * x811p.im + self.twiddle7.re * x910p.im; let b514im_b = self.twiddle5.im * x118n.re + -self.twiddle9.im * x217n.re + -self.twiddle4.im * x316n.re + self.twiddle1.im * x415n.re + self.twiddle6.im * x514n.re + -self.twiddle8.im * x613n.re + -self.twiddle3.im * x712n.re + self.twiddle2.im * x811n.re + self.twiddle7.im * x910n.re; let b613im_a = buffer.load(0).im + self.twiddle6.re * x118p.im + self.twiddle7.re * x217p.im + self.twiddle1.re * x316p.im + self.twiddle5.re * x415p.im + self.twiddle8.re * x514p.im + self.twiddle2.re * x613p.im + self.twiddle4.re * x712p.im + self.twiddle9.re * x811p.im + self.twiddle3.re * x910p.im; let b613im_b = self.twiddle6.im * x118n.re + -self.twiddle7.im * x217n.re + -self.twiddle1.im * x316n.re + self.twiddle5.im * x415n.re + -self.twiddle8.im * x514n.re + -self.twiddle2.im * x613n.re + self.twiddle4.im * x712n.re + -self.twiddle9.im * x811n.re + -self.twiddle3.im * x910n.re; let b712im_a = buffer.load(0).im + self.twiddle7.re * x118p.im + self.twiddle5.re * x217p.im + self.twiddle2.re * x316p.im + self.twiddle9.re * x415p.im + self.twiddle3.re * x514p.im + self.twiddle4.re * x613p.im + self.twiddle8.re * x712p.im + self.twiddle1.re * x811p.im + self.twiddle6.re * x910p.im; let b712im_b = self.twiddle7.im * x118n.re + -self.twiddle5.im * x217n.re + self.twiddle2.im * x316n.re + self.twiddle9.im * x415n.re + -self.twiddle3.im * x514n.re + self.twiddle4.im * x613n.re + -self.twiddle8.im * x712n.re + -self.twiddle1.im * x811n.re + self.twiddle6.im * x910n.re; let b811im_a = buffer.load(0).im + self.twiddle8.re * x118p.im + self.twiddle3.re * x217p.im + self.twiddle5.re * x316p.im + self.twiddle6.re * x415p.im + self.twiddle2.re * x514p.im + self.twiddle9.re * x613p.im + self.twiddle1.re * x712p.im + self.twiddle7.re * x811p.im + self.twiddle4.re * x910p.im; let b811im_b = self.twiddle8.im * x118n.re + -self.twiddle3.im * x217n.re + self.twiddle5.im * x316n.re + -self.twiddle6.im * x415n.re + self.twiddle2.im * x514n.re + -self.twiddle9.im * x613n.re + -self.twiddle1.im * x712n.re + self.twiddle7.im * x811n.re + -self.twiddle4.im * x910n.re; let b910im_a = buffer.load(0).im + self.twiddle9.re * x118p.im + self.twiddle1.re * x217p.im + self.twiddle8.re * x316p.im + self.twiddle2.re * x415p.im + self.twiddle7.re * x514p.im + self.twiddle3.re * x613p.im + self.twiddle6.re * x712p.im + self.twiddle4.re * x811p.im + self.twiddle5.re * x910p.im; let b910im_b = self.twiddle9.im * x118n.re + -self.twiddle1.im * x217n.re + self.twiddle8.im * x316n.re + -self.twiddle2.im * x415n.re + self.twiddle7.im * x514n.re + -self.twiddle3.im * x613n.re + self.twiddle6.im * x712n.re + -self.twiddle4.im * x811n.re + self.twiddle5.im * x910n.re; let out1re = b118re_a - b118re_b; let out1im = b118im_a + b118im_b; let out2re = b217re_a - b217re_b; let out2im = b217im_a + b217im_b; let out3re = b316re_a - b316re_b; let out3im = b316im_a + b316im_b; let out4re = b415re_a - b415re_b; let out4im = b415im_a + b415im_b; let out5re = b514re_a - b514re_b; let out5im = b514im_a + b514im_b; let out6re = b613re_a - b613re_b; let out6im = b613im_a + b613im_b; let out7re = b712re_a - b712re_b; let out7im = b712im_a + b712im_b; let out8re = b811re_a - b811re_b; let out8im = b811im_a + b811im_b; let out9re = b910re_a - b910re_b; let out9im = b910im_a + b910im_b; let out10re = b910re_a + b910re_b; let out10im = b910im_a - b910im_b; let out11re = b811re_a + b811re_b; let out11im = b811im_a - b811im_b; let out12re = b712re_a + b712re_b; let out12im = b712im_a - b712im_b; let out13re = b613re_a + b613re_b; let out13im = b613im_a - b613im_b; let out14re = b514re_a + b514re_b; let out14im = b514im_a - b514im_b; let out15re = b415re_a + b415re_b; let out15im = b415im_a - b415im_b; let out16re = b316re_a + b316re_b; let out16im = b316im_a - b316im_b; let out17re = b217re_a + b217re_b; let out17im = b217im_a - b217im_b; let out18re = b118re_a + b118re_b; let out18im = b118im_a - b118im_b; buffer.store(sum, 0); buffer.store( Complex { re: out1re, im: out1im, }, 1, ); buffer.store( Complex { re: out2re, im: out2im, }, 2, ); buffer.store( Complex { re: out3re, im: out3im, }, 3, ); buffer.store( Complex { re: out4re, im: out4im, }, 4, ); buffer.store( Complex { re: out5re, im: out5im, }, 5, ); buffer.store( Complex { re: out6re, im: out6im, }, 6, ); buffer.store( Complex { re: out7re, im: out7im, }, 7, ); buffer.store( Complex { re: out8re, im: out8im, }, 8, ); buffer.store( Complex { re: out9re, im: out9im, }, 9, ); buffer.store( Complex { re: out10re, im: out10im, }, 10, ); buffer.store( Complex { re: out11re, im: out11im, }, 11, ); buffer.store( Complex { re: out12re, im: out12im, }, 12, ); buffer.store( Complex { re: out13re, im: out13im, }, 13, ); buffer.store( Complex { re: out14re, im: out14im, }, 14, ); buffer.store( Complex { re: out15re, im: out15im, }, 15, ); buffer.store( Complex { re: out16re, im: out16im, }, 16, ); buffer.store( Complex { re: out17re, im: out17im, }, 17, ); buffer.store( Complex { re: out18re, im: out18im, }, 18, ); } } pub struct Butterfly23 { twiddle1: Complex, twiddle2: Complex, twiddle3: Complex, twiddle4: Complex, twiddle5: Complex, twiddle6: Complex, twiddle7: Complex, twiddle8: Complex, twiddle9: Complex, twiddle10: Complex, twiddle11: Complex, direction: FftDirection, } boilerplate_fft_butterfly!(Butterfly23, 23, |this: &Butterfly23<_>| this.direction); impl Butterfly23 { pub fn new(direction: FftDirection) -> Self { let twiddle1: Complex = twiddles::compute_twiddle(1, 23, direction); let twiddle2: Complex = twiddles::compute_twiddle(2, 23, direction); let twiddle3: Complex = twiddles::compute_twiddle(3, 23, direction); let twiddle4: Complex = twiddles::compute_twiddle(4, 23, direction); let twiddle5: Complex = twiddles::compute_twiddle(5, 23, direction); let twiddle6: Complex = twiddles::compute_twiddle(6, 23, direction); let twiddle7: Complex = twiddles::compute_twiddle(7, 23, direction); let twiddle8: Complex = twiddles::compute_twiddle(8, 23, direction); let twiddle9: Complex = twiddles::compute_twiddle(9, 23, direction); let twiddle10: Complex = twiddles::compute_twiddle(10, 23, direction); let twiddle11: Complex = twiddles::compute_twiddle(11, 23, direction); Self { twiddle1, twiddle2, twiddle3, twiddle4, twiddle5, twiddle6, twiddle7, twiddle8, twiddle9, twiddle10, twiddle11, direction, } } #[inline(never)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { // This function was derived in the same manner as the butterflies for length 3, 5 and 7. // However, instead of doing it by hand the actual code is autogenerated // with the `genbutterflies.py` script in the `tools` directory. let x122p = buffer.load(1) + buffer.load(22); let x122n = buffer.load(1) - buffer.load(22); let x221p = buffer.load(2) + buffer.load(21); let x221n = buffer.load(2) - buffer.load(21); let x320p = buffer.load(3) + buffer.load(20); let x320n = buffer.load(3) - buffer.load(20); let x419p = buffer.load(4) + buffer.load(19); let x419n = buffer.load(4) - buffer.load(19); let x518p = buffer.load(5) + buffer.load(18); let x518n = buffer.load(5) - buffer.load(18); let x617p = buffer.load(6) + buffer.load(17); let x617n = buffer.load(6) - buffer.load(17); let x716p = buffer.load(7) + buffer.load(16); let x716n = buffer.load(7) - buffer.load(16); let x815p = buffer.load(8) + buffer.load(15); let x815n = buffer.load(8) - buffer.load(15); let x914p = buffer.load(9) + buffer.load(14); let x914n = buffer.load(9) - buffer.load(14); let x1013p = buffer.load(10) + buffer.load(13); let x1013n = buffer.load(10) - buffer.load(13); let x1112p = buffer.load(11) + buffer.load(12); let x1112n = buffer.load(11) - buffer.load(12); let sum = buffer.load(0) + x122p + x221p + x320p + x419p + x518p + x617p + x716p + x815p + x914p + x1013p + x1112p; let b122re_a = buffer.load(0).re + self.twiddle1.re * x122p.re + self.twiddle2.re * x221p.re + self.twiddle3.re * x320p.re + self.twiddle4.re * x419p.re + self.twiddle5.re * x518p.re + self.twiddle6.re * x617p.re + self.twiddle7.re * x716p.re + self.twiddle8.re * x815p.re + self.twiddle9.re * x914p.re + self.twiddle10.re * x1013p.re + self.twiddle11.re * x1112p.re; let b122re_b = self.twiddle1.im * x122n.im + self.twiddle2.im * x221n.im + self.twiddle3.im * x320n.im + self.twiddle4.im * x419n.im + self.twiddle5.im * x518n.im + self.twiddle6.im * x617n.im + self.twiddle7.im * x716n.im + self.twiddle8.im * x815n.im + self.twiddle9.im * x914n.im + self.twiddle10.im * x1013n.im + self.twiddle11.im * x1112n.im; let b221re_a = buffer.load(0).re + self.twiddle2.re * x122p.re + self.twiddle4.re * x221p.re + self.twiddle6.re * x320p.re + self.twiddle8.re * x419p.re + self.twiddle10.re * x518p.re + self.twiddle11.re * x617p.re + self.twiddle9.re * x716p.re + self.twiddle7.re * x815p.re + self.twiddle5.re * x914p.re + self.twiddle3.re * x1013p.re + self.twiddle1.re * x1112p.re; let b221re_b = self.twiddle2.im * x122n.im + self.twiddle4.im * x221n.im + self.twiddle6.im * x320n.im + self.twiddle8.im * x419n.im + self.twiddle10.im * x518n.im + -self.twiddle11.im * x617n.im + -self.twiddle9.im * x716n.im + -self.twiddle7.im * x815n.im + -self.twiddle5.im * x914n.im + -self.twiddle3.im * x1013n.im + -self.twiddle1.im * x1112n.im; let b320re_a = buffer.load(0).re + self.twiddle3.re * x122p.re + self.twiddle6.re * x221p.re + self.twiddle9.re * x320p.re + self.twiddle11.re * x419p.re + self.twiddle8.re * x518p.re + self.twiddle5.re * x617p.re + self.twiddle2.re * x716p.re + self.twiddle1.re * x815p.re + self.twiddle4.re * x914p.re + self.twiddle7.re * x1013p.re + self.twiddle10.re * x1112p.re; let b320re_b = self.twiddle3.im * x122n.im + self.twiddle6.im * x221n.im + self.twiddle9.im * x320n.im + -self.twiddle11.im * x419n.im + -self.twiddle8.im * x518n.im + -self.twiddle5.im * x617n.im + -self.twiddle2.im * x716n.im + self.twiddle1.im * x815n.im + self.twiddle4.im * x914n.im + self.twiddle7.im * x1013n.im + self.twiddle10.im * x1112n.im; let b419re_a = buffer.load(0).re + self.twiddle4.re * x122p.re + self.twiddle8.re * x221p.re + self.twiddle11.re * x320p.re + self.twiddle7.re * x419p.re + self.twiddle3.re * x518p.re + self.twiddle1.re * x617p.re + self.twiddle5.re * x716p.re + self.twiddle9.re * x815p.re + self.twiddle10.re * x914p.re + self.twiddle6.re * x1013p.re + self.twiddle2.re * x1112p.re; let b419re_b = self.twiddle4.im * x122n.im + self.twiddle8.im * x221n.im + -self.twiddle11.im * x320n.im + -self.twiddle7.im * x419n.im + -self.twiddle3.im * x518n.im + self.twiddle1.im * x617n.im + self.twiddle5.im * x716n.im + self.twiddle9.im * x815n.im + -self.twiddle10.im * x914n.im + -self.twiddle6.im * x1013n.im + -self.twiddle2.im * x1112n.im; let b518re_a = buffer.load(0).re + self.twiddle5.re * x122p.re + self.twiddle10.re * x221p.re + self.twiddle8.re * x320p.re + self.twiddle3.re * x419p.re + self.twiddle2.re * x518p.re + self.twiddle7.re * x617p.re + self.twiddle11.re * x716p.re + self.twiddle6.re * x815p.re + self.twiddle1.re * x914p.re + self.twiddle4.re * x1013p.re + self.twiddle9.re * x1112p.re; let b518re_b = self.twiddle5.im * x122n.im + self.twiddle10.im * x221n.im + -self.twiddle8.im * x320n.im + -self.twiddle3.im * x419n.im + self.twiddle2.im * x518n.im + self.twiddle7.im * x617n.im + -self.twiddle11.im * x716n.im + -self.twiddle6.im * x815n.im + -self.twiddle1.im * x914n.im + self.twiddle4.im * x1013n.im + self.twiddle9.im * x1112n.im; let b617re_a = buffer.load(0).re + self.twiddle6.re * x122p.re + self.twiddle11.re * x221p.re + self.twiddle5.re * x320p.re + self.twiddle1.re * x419p.re + self.twiddle7.re * x518p.re + self.twiddle10.re * x617p.re + self.twiddle4.re * x716p.re + self.twiddle2.re * x815p.re + self.twiddle8.re * x914p.re + self.twiddle9.re * x1013p.re + self.twiddle3.re * x1112p.re; let b617re_b = self.twiddle6.im * x122n.im + -self.twiddle11.im * x221n.im + -self.twiddle5.im * x320n.im + self.twiddle1.im * x419n.im + self.twiddle7.im * x518n.im + -self.twiddle10.im * x617n.im + -self.twiddle4.im * x716n.im + self.twiddle2.im * x815n.im + self.twiddle8.im * x914n.im + -self.twiddle9.im * x1013n.im + -self.twiddle3.im * x1112n.im; let b716re_a = buffer.load(0).re + self.twiddle7.re * x122p.re + self.twiddle9.re * x221p.re + self.twiddle2.re * x320p.re + self.twiddle5.re * x419p.re + self.twiddle11.re * x518p.re + self.twiddle4.re * x617p.re + self.twiddle3.re * x716p.re + self.twiddle10.re * x815p.re + self.twiddle6.re * x914p.re + self.twiddle1.re * x1013p.re + self.twiddle8.re * x1112p.re; let b716re_b = self.twiddle7.im * x122n.im + -self.twiddle9.im * x221n.im + -self.twiddle2.im * x320n.im + self.twiddle5.im * x419n.im + -self.twiddle11.im * x518n.im + -self.twiddle4.im * x617n.im + self.twiddle3.im * x716n.im + self.twiddle10.im * x815n.im + -self.twiddle6.im * x914n.im + self.twiddle1.im * x1013n.im + self.twiddle8.im * x1112n.im; let b815re_a = buffer.load(0).re + self.twiddle8.re * x122p.re + self.twiddle7.re * x221p.re + self.twiddle1.re * x320p.re + self.twiddle9.re * x419p.re + self.twiddle6.re * x518p.re + self.twiddle2.re * x617p.re + self.twiddle10.re * x716p.re + self.twiddle5.re * x815p.re + self.twiddle3.re * x914p.re + self.twiddle11.re * x1013p.re + self.twiddle4.re * x1112p.re; let b815re_b = self.twiddle8.im * x122n.im + -self.twiddle7.im * x221n.im + self.twiddle1.im * x320n.im + self.twiddle9.im * x419n.im + -self.twiddle6.im * x518n.im + self.twiddle2.im * x617n.im + self.twiddle10.im * x716n.im + -self.twiddle5.im * x815n.im + self.twiddle3.im * x914n.im + self.twiddle11.im * x1013n.im + -self.twiddle4.im * x1112n.im; let b914re_a = buffer.load(0).re + self.twiddle9.re * x122p.re + self.twiddle5.re * x221p.re + self.twiddle4.re * x320p.re + self.twiddle10.re * x419p.re + self.twiddle1.re * x518p.re + self.twiddle8.re * x617p.re + self.twiddle6.re * x716p.re + self.twiddle3.re * x815p.re + self.twiddle11.re * x914p.re + self.twiddle2.re * x1013p.re + self.twiddle7.re * x1112p.re; let b914re_b = self.twiddle9.im * x122n.im + -self.twiddle5.im * x221n.im + self.twiddle4.im * x320n.im + -self.twiddle10.im * x419n.im + -self.twiddle1.im * x518n.im + self.twiddle8.im * x617n.im + -self.twiddle6.im * x716n.im + self.twiddle3.im * x815n.im + -self.twiddle11.im * x914n.im + -self.twiddle2.im * x1013n.im + self.twiddle7.im * x1112n.im; let b1013re_a = buffer.load(0).re + self.twiddle10.re * x122p.re + self.twiddle3.re * x221p.re + self.twiddle7.re * x320p.re + self.twiddle6.re * x419p.re + self.twiddle4.re * x518p.re + self.twiddle9.re * x617p.re + self.twiddle1.re * x716p.re + self.twiddle11.re * x815p.re + self.twiddle2.re * x914p.re + self.twiddle8.re * x1013p.re + self.twiddle5.re * x1112p.re; let b1013re_b = self.twiddle10.im * x122n.im + -self.twiddle3.im * x221n.im + self.twiddle7.im * x320n.im + -self.twiddle6.im * x419n.im + self.twiddle4.im * x518n.im + -self.twiddle9.im * x617n.im + self.twiddle1.im * x716n.im + self.twiddle11.im * x815n.im + -self.twiddle2.im * x914n.im + self.twiddle8.im * x1013n.im + -self.twiddle5.im * x1112n.im; let b1112re_a = buffer.load(0).re + self.twiddle11.re * x122p.re + self.twiddle1.re * x221p.re + self.twiddle10.re * x320p.re + self.twiddle2.re * x419p.re + self.twiddle9.re * x518p.re + self.twiddle3.re * x617p.re + self.twiddle8.re * x716p.re + self.twiddle4.re * x815p.re + self.twiddle7.re * x914p.re + self.twiddle5.re * x1013p.re + self.twiddle6.re * x1112p.re; let b1112re_b = self.twiddle11.im * x122n.im + -self.twiddle1.im * x221n.im + self.twiddle10.im * x320n.im + -self.twiddle2.im * x419n.im + self.twiddle9.im * x518n.im + -self.twiddle3.im * x617n.im + self.twiddle8.im * x716n.im + -self.twiddle4.im * x815n.im + self.twiddle7.im * x914n.im + -self.twiddle5.im * x1013n.im + self.twiddle6.im * x1112n.im; let b122im_a = buffer.load(0).im + self.twiddle1.re * x122p.im + self.twiddle2.re * x221p.im + self.twiddle3.re * x320p.im + self.twiddle4.re * x419p.im + self.twiddle5.re * x518p.im + self.twiddle6.re * x617p.im + self.twiddle7.re * x716p.im + self.twiddle8.re * x815p.im + self.twiddle9.re * x914p.im + self.twiddle10.re * x1013p.im + self.twiddle11.re * x1112p.im; let b122im_b = self.twiddle1.im * x122n.re + self.twiddle2.im * x221n.re + self.twiddle3.im * x320n.re + self.twiddle4.im * x419n.re + self.twiddle5.im * x518n.re + self.twiddle6.im * x617n.re + self.twiddle7.im * x716n.re + self.twiddle8.im * x815n.re + self.twiddle9.im * x914n.re + self.twiddle10.im * x1013n.re + self.twiddle11.im * x1112n.re; let b221im_a = buffer.load(0).im + self.twiddle2.re * x122p.im + self.twiddle4.re * x221p.im + self.twiddle6.re * x320p.im + self.twiddle8.re * x419p.im + self.twiddle10.re * x518p.im + self.twiddle11.re * x617p.im + self.twiddle9.re * x716p.im + self.twiddle7.re * x815p.im + self.twiddle5.re * x914p.im + self.twiddle3.re * x1013p.im + self.twiddle1.re * x1112p.im; let b221im_b = self.twiddle2.im * x122n.re + self.twiddle4.im * x221n.re + self.twiddle6.im * x320n.re + self.twiddle8.im * x419n.re + self.twiddle10.im * x518n.re + -self.twiddle11.im * x617n.re + -self.twiddle9.im * x716n.re + -self.twiddle7.im * x815n.re + -self.twiddle5.im * x914n.re + -self.twiddle3.im * x1013n.re + -self.twiddle1.im * x1112n.re; let b320im_a = buffer.load(0).im + self.twiddle3.re * x122p.im + self.twiddle6.re * x221p.im + self.twiddle9.re * x320p.im + self.twiddle11.re * x419p.im + self.twiddle8.re * x518p.im + self.twiddle5.re * x617p.im + self.twiddle2.re * x716p.im + self.twiddle1.re * x815p.im + self.twiddle4.re * x914p.im + self.twiddle7.re * x1013p.im + self.twiddle10.re * x1112p.im; let b320im_b = self.twiddle3.im * x122n.re + self.twiddle6.im * x221n.re + self.twiddle9.im * x320n.re + -self.twiddle11.im * x419n.re + -self.twiddle8.im * x518n.re + -self.twiddle5.im * x617n.re + -self.twiddle2.im * x716n.re + self.twiddle1.im * x815n.re + self.twiddle4.im * x914n.re + self.twiddle7.im * x1013n.re + self.twiddle10.im * x1112n.re; let b419im_a = buffer.load(0).im + self.twiddle4.re * x122p.im + self.twiddle8.re * x221p.im + self.twiddle11.re * x320p.im + self.twiddle7.re * x419p.im + self.twiddle3.re * x518p.im + self.twiddle1.re * x617p.im + self.twiddle5.re * x716p.im + self.twiddle9.re * x815p.im + self.twiddle10.re * x914p.im + self.twiddle6.re * x1013p.im + self.twiddle2.re * x1112p.im; let b419im_b = self.twiddle4.im * x122n.re + self.twiddle8.im * x221n.re + -self.twiddle11.im * x320n.re + -self.twiddle7.im * x419n.re + -self.twiddle3.im * x518n.re + self.twiddle1.im * x617n.re + self.twiddle5.im * x716n.re + self.twiddle9.im * x815n.re + -self.twiddle10.im * x914n.re + -self.twiddle6.im * x1013n.re + -self.twiddle2.im * x1112n.re; let b518im_a = buffer.load(0).im + self.twiddle5.re * x122p.im + self.twiddle10.re * x221p.im + self.twiddle8.re * x320p.im + self.twiddle3.re * x419p.im + self.twiddle2.re * x518p.im + self.twiddle7.re * x617p.im + self.twiddle11.re * x716p.im + self.twiddle6.re * x815p.im + self.twiddle1.re * x914p.im + self.twiddle4.re * x1013p.im + self.twiddle9.re * x1112p.im; let b518im_b = self.twiddle5.im * x122n.re + self.twiddle10.im * x221n.re + -self.twiddle8.im * x320n.re + -self.twiddle3.im * x419n.re + self.twiddle2.im * x518n.re + self.twiddle7.im * x617n.re + -self.twiddle11.im * x716n.re + -self.twiddle6.im * x815n.re + -self.twiddle1.im * x914n.re + self.twiddle4.im * x1013n.re + self.twiddle9.im * x1112n.re; let b617im_a = buffer.load(0).im + self.twiddle6.re * x122p.im + self.twiddle11.re * x221p.im + self.twiddle5.re * x320p.im + self.twiddle1.re * x419p.im + self.twiddle7.re * x518p.im + self.twiddle10.re * x617p.im + self.twiddle4.re * x716p.im + self.twiddle2.re * x815p.im + self.twiddle8.re * x914p.im + self.twiddle9.re * x1013p.im + self.twiddle3.re * x1112p.im; let b617im_b = self.twiddle6.im * x122n.re + -self.twiddle11.im * x221n.re + -self.twiddle5.im * x320n.re + self.twiddle1.im * x419n.re + self.twiddle7.im * x518n.re + -self.twiddle10.im * x617n.re + -self.twiddle4.im * x716n.re + self.twiddle2.im * x815n.re + self.twiddle8.im * x914n.re + -self.twiddle9.im * x1013n.re + -self.twiddle3.im * x1112n.re; let b716im_a = buffer.load(0).im + self.twiddle7.re * x122p.im + self.twiddle9.re * x221p.im + self.twiddle2.re * x320p.im + self.twiddle5.re * x419p.im + self.twiddle11.re * x518p.im + self.twiddle4.re * x617p.im + self.twiddle3.re * x716p.im + self.twiddle10.re * x815p.im + self.twiddle6.re * x914p.im + self.twiddle1.re * x1013p.im + self.twiddle8.re * x1112p.im; let b716im_b = self.twiddle7.im * x122n.re + -self.twiddle9.im * x221n.re + -self.twiddle2.im * x320n.re + self.twiddle5.im * x419n.re + -self.twiddle11.im * x518n.re + -self.twiddle4.im * x617n.re + self.twiddle3.im * x716n.re + self.twiddle10.im * x815n.re + -self.twiddle6.im * x914n.re + self.twiddle1.im * x1013n.re + self.twiddle8.im * x1112n.re; let b815im_a = buffer.load(0).im + self.twiddle8.re * x122p.im + self.twiddle7.re * x221p.im + self.twiddle1.re * x320p.im + self.twiddle9.re * x419p.im + self.twiddle6.re * x518p.im + self.twiddle2.re * x617p.im + self.twiddle10.re * x716p.im + self.twiddle5.re * x815p.im + self.twiddle3.re * x914p.im + self.twiddle11.re * x1013p.im + self.twiddle4.re * x1112p.im; let b815im_b = self.twiddle8.im * x122n.re + -self.twiddle7.im * x221n.re + self.twiddle1.im * x320n.re + self.twiddle9.im * x419n.re + -self.twiddle6.im * x518n.re + self.twiddle2.im * x617n.re + self.twiddle10.im * x716n.re + -self.twiddle5.im * x815n.re + self.twiddle3.im * x914n.re + self.twiddle11.im * x1013n.re + -self.twiddle4.im * x1112n.re; let b914im_a = buffer.load(0).im + self.twiddle9.re * x122p.im + self.twiddle5.re * x221p.im + self.twiddle4.re * x320p.im + self.twiddle10.re * x419p.im + self.twiddle1.re * x518p.im + self.twiddle8.re * x617p.im + self.twiddle6.re * x716p.im + self.twiddle3.re * x815p.im + self.twiddle11.re * x914p.im + self.twiddle2.re * x1013p.im + self.twiddle7.re * x1112p.im; let b914im_b = self.twiddle9.im * x122n.re + -self.twiddle5.im * x221n.re + self.twiddle4.im * x320n.re + -self.twiddle10.im * x419n.re + -self.twiddle1.im * x518n.re + self.twiddle8.im * x617n.re + -self.twiddle6.im * x716n.re + self.twiddle3.im * x815n.re + -self.twiddle11.im * x914n.re + -self.twiddle2.im * x1013n.re + self.twiddle7.im * x1112n.re; let b1013im_a = buffer.load(0).im + self.twiddle10.re * x122p.im + self.twiddle3.re * x221p.im + self.twiddle7.re * x320p.im + self.twiddle6.re * x419p.im + self.twiddle4.re * x518p.im + self.twiddle9.re * x617p.im + self.twiddle1.re * x716p.im + self.twiddle11.re * x815p.im + self.twiddle2.re * x914p.im + self.twiddle8.re * x1013p.im + self.twiddle5.re * x1112p.im; let b1013im_b = self.twiddle10.im * x122n.re + -self.twiddle3.im * x221n.re + self.twiddle7.im * x320n.re + -self.twiddle6.im * x419n.re + self.twiddle4.im * x518n.re + -self.twiddle9.im * x617n.re + self.twiddle1.im * x716n.re + self.twiddle11.im * x815n.re + -self.twiddle2.im * x914n.re + self.twiddle8.im * x1013n.re + -self.twiddle5.im * x1112n.re; let b1112im_a = buffer.load(0).im + self.twiddle11.re * x122p.im + self.twiddle1.re * x221p.im + self.twiddle10.re * x320p.im + self.twiddle2.re * x419p.im + self.twiddle9.re * x518p.im + self.twiddle3.re * x617p.im + self.twiddle8.re * x716p.im + self.twiddle4.re * x815p.im + self.twiddle7.re * x914p.im + self.twiddle5.re * x1013p.im + self.twiddle6.re * x1112p.im; let b1112im_b = self.twiddle11.im * x122n.re + -self.twiddle1.im * x221n.re + self.twiddle10.im * x320n.re + -self.twiddle2.im * x419n.re + self.twiddle9.im * x518n.re + -self.twiddle3.im * x617n.re + self.twiddle8.im * x716n.re + -self.twiddle4.im * x815n.re + self.twiddle7.im * x914n.re + -self.twiddle5.im * x1013n.re + self.twiddle6.im * x1112n.re; let out1re = b122re_a - b122re_b; let out1im = b122im_a + b122im_b; let out2re = b221re_a - b221re_b; let out2im = b221im_a + b221im_b; let out3re = b320re_a - b320re_b; let out3im = b320im_a + b320im_b; let out4re = b419re_a - b419re_b; let out4im = b419im_a + b419im_b; let out5re = b518re_a - b518re_b; let out5im = b518im_a + b518im_b; let out6re = b617re_a - b617re_b; let out6im = b617im_a + b617im_b; let out7re = b716re_a - b716re_b; let out7im = b716im_a + b716im_b; let out8re = b815re_a - b815re_b; let out8im = b815im_a + b815im_b; let out9re = b914re_a - b914re_b; let out9im = b914im_a + b914im_b; let out10re = b1013re_a - b1013re_b; let out10im = b1013im_a + b1013im_b; let out11re = b1112re_a - b1112re_b; let out11im = b1112im_a + b1112im_b; let out12re = b1112re_a + b1112re_b; let out12im = b1112im_a - b1112im_b; let out13re = b1013re_a + b1013re_b; let out13im = b1013im_a - b1013im_b; let out14re = b914re_a + b914re_b; let out14im = b914im_a - b914im_b; let out15re = b815re_a + b815re_b; let out15im = b815im_a - b815im_b; let out16re = b716re_a + b716re_b; let out16im = b716im_a - b716im_b; let out17re = b617re_a + b617re_b; let out17im = b617im_a - b617im_b; let out18re = b518re_a + b518re_b; let out18im = b518im_a - b518im_b; let out19re = b419re_a + b419re_b; let out19im = b419im_a - b419im_b; let out20re = b320re_a + b320re_b; let out20im = b320im_a - b320im_b; let out21re = b221re_a + b221re_b; let out21im = b221im_a - b221im_b; let out22re = b122re_a + b122re_b; let out22im = b122im_a - b122im_b; buffer.store(sum, 0); buffer.store( Complex { re: out1re, im: out1im, }, 1, ); buffer.store( Complex { re: out2re, im: out2im, }, 2, ); buffer.store( Complex { re: out3re, im: out3im, }, 3, ); buffer.store( Complex { re: out4re, im: out4im, }, 4, ); buffer.store( Complex { re: out5re, im: out5im, }, 5, ); buffer.store( Complex { re: out6re, im: out6im, }, 6, ); buffer.store( Complex { re: out7re, im: out7im, }, 7, ); buffer.store( Complex { re: out8re, im: out8im, }, 8, ); buffer.store( Complex { re: out9re, im: out9im, }, 9, ); buffer.store( Complex { re: out10re, im: out10im, }, 10, ); buffer.store( Complex { re: out11re, im: out11im, }, 11, ); buffer.store( Complex { re: out12re, im: out12im, }, 12, ); buffer.store( Complex { re: out13re, im: out13im, }, 13, ); buffer.store( Complex { re: out14re, im: out14im, }, 14, ); buffer.store( Complex { re: out15re, im: out15im, }, 15, ); buffer.store( Complex { re: out16re, im: out16im, }, 16, ); buffer.store( Complex { re: out17re, im: out17im, }, 17, ); buffer.store( Complex { re: out18re, im: out18im, }, 18, ); buffer.store( Complex { re: out19re, im: out19im, }, 19, ); buffer.store( Complex { re: out20re, im: out20im, }, 20, ); buffer.store( Complex { re: out21re, im: out21im, }, 21, ); buffer.store( Complex { re: out22re, im: out22im, }, 22, ); } } pub struct Butterfly27 { butterfly9: Butterfly9, twiddles: [Complex; 12], } boilerplate_fft_butterfly!(Butterfly27, 27, |this: &Butterfly27<_>| this .butterfly9 .fft_direction()); impl Butterfly27 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { Self { butterfly9: Butterfly9::new(direction), twiddles: [ twiddles::compute_twiddle(1, 27, direction), twiddles::compute_twiddle(2, 27, direction), twiddles::compute_twiddle(3, 27, direction), twiddles::compute_twiddle(4, 27, direction), twiddles::compute_twiddle(5, 27, direction), twiddles::compute_twiddle(6, 27, direction), twiddles::compute_twiddle(7, 27, direction), twiddles::compute_twiddle(8, 27, direction), twiddles::compute_twiddle(10, 27, direction), twiddles::compute_twiddle(12, 27, direction), twiddles::compute_twiddle(14, 27, direction), twiddles::compute_twiddle(16, 27, direction), ], } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { // algorithm: mixed radix with width=9 and height=3 // step 1: transpose the input into the scratch let mut scratch0 = [ buffer.load(0), buffer.load(3), buffer.load(6), buffer.load(9), buffer.load(12), buffer.load(15), buffer.load(18), buffer.load(21), buffer.load(24), ]; let mut scratch1 = [ buffer.load(1 + 0), buffer.load(1 + 3), buffer.load(1 + 6), buffer.load(1 + 9), buffer.load(1 + 12), buffer.load(1 + 15), buffer.load(1 + 18), buffer.load(1 + 21), buffer.load(1 + 24), ]; let mut scratch2 = [ buffer.load(2 + 0), buffer.load(2 + 3), buffer.load(2 + 6), buffer.load(2 + 9), buffer.load(2 + 12), buffer.load(2 + 15), buffer.load(2 + 18), buffer.load(2 + 21), buffer.load(2 + 24), ]; // step 2: column FFTs self.butterfly9.perform_fft_contiguous(&mut scratch0); self.butterfly9.perform_fft_contiguous(&mut scratch1); self.butterfly9.perform_fft_contiguous(&mut scratch2); // step 3: apply twiddle factors scratch1[1] = scratch1[1] * self.twiddles[0]; scratch1[2] = scratch1[2] * self.twiddles[1]; scratch1[3] = scratch1[3] * self.twiddles[2]; scratch1[4] = scratch1[4] * self.twiddles[3]; scratch1[5] = scratch1[5] * self.twiddles[4]; scratch1[6] = scratch1[6] * self.twiddles[5]; scratch1[7] = scratch1[7] * self.twiddles[6]; scratch1[8] = scratch1[8] * self.twiddles[7]; scratch2[1] = scratch2[1] * self.twiddles[1]; scratch2[2] = scratch2[2] * self.twiddles[3]; scratch2[3] = scratch2[3] * self.twiddles[5]; scratch2[4] = scratch2[4] * self.twiddles[7]; scratch2[5] = scratch2[5] * self.twiddles[8]; scratch2[6] = scratch2[6] * self.twiddles[9]; scratch2[7] = scratch2[7] * self.twiddles[10]; scratch2[8] = scratch2[8] * self.twiddles[11]; // step 4: SKIPPED because the next FFTs will be non-contiguous // step 5: row FFTs self.butterfly9.butterfly3.perform_fft_strided( &mut scratch0[0], &mut scratch1[0], &mut scratch2[0], ); self.butterfly9.butterfly3.perform_fft_strided( &mut scratch0[1], &mut scratch1[1], &mut scratch2[1], ); self.butterfly9.butterfly3.perform_fft_strided( &mut scratch0[2], &mut scratch1[2], &mut scratch2[2], ); self.butterfly9.butterfly3.perform_fft_strided( &mut scratch0[3], &mut scratch1[3], &mut scratch2[3], ); self.butterfly9.butterfly3.perform_fft_strided( &mut scratch0[4], &mut scratch1[4], &mut scratch2[4], ); self.butterfly9.butterfly3.perform_fft_strided( &mut scratch0[5], &mut scratch1[5], &mut scratch2[5], ); self.butterfly9.butterfly3.perform_fft_strided( &mut scratch0[6], &mut scratch1[6], &mut scratch2[6], ); self.butterfly9.butterfly3.perform_fft_strided( &mut scratch0[7], &mut scratch1[7], &mut scratch2[7], ); self.butterfly9.butterfly3.perform_fft_strided( &mut scratch0[8], &mut scratch1[8], &mut scratch2[8], ); // step 6: copy the result into the output. normally we'd need to do a transpose here, but we can skip it because we skipped the transpose in step 4 buffer.store(scratch0[0], 0); buffer.store(scratch0[1], 1); buffer.store(scratch0[2], 2); buffer.store(scratch0[3], 3); buffer.store(scratch0[4], 4); buffer.store(scratch0[5], 5); buffer.store(scratch0[6], 6); buffer.store(scratch0[7], 7); buffer.store(scratch0[8], 8); buffer.store(scratch1[0], 9 + 0); buffer.store(scratch1[1], 9 + 1); buffer.store(scratch1[2], 9 + 2); buffer.store(scratch1[3], 9 + 3); buffer.store(scratch1[4], 9 + 4); buffer.store(scratch1[5], 9 + 5); buffer.store(scratch1[6], 9 + 6); buffer.store(scratch1[7], 9 + 7); buffer.store(scratch1[8], 9 + 8); buffer.store(scratch2[0], 18 + 0); buffer.store(scratch2[1], 18 + 1); buffer.store(scratch2[2], 18 + 2); buffer.store(scratch2[3], 18 + 3); buffer.store(scratch2[4], 18 + 4); buffer.store(scratch2[5], 18 + 5); buffer.store(scratch2[6], 18 + 6); buffer.store(scratch2[7], 18 + 7); buffer.store(scratch2[8], 18 + 8); } } pub struct Butterfly29 { twiddle1: Complex, twiddle2: Complex, twiddle3: Complex, twiddle4: Complex, twiddle5: Complex, twiddle6: Complex, twiddle7: Complex, twiddle8: Complex, twiddle9: Complex, twiddle10: Complex, twiddle11: Complex, twiddle12: Complex, twiddle13: Complex, twiddle14: Complex, direction: FftDirection, } boilerplate_fft_butterfly!(Butterfly29, 29, |this: &Butterfly29<_>| this.direction); impl Butterfly29 { pub fn new(direction: FftDirection) -> Self { let twiddle1: Complex = twiddles::compute_twiddle(1, 29, direction); let twiddle2: Complex = twiddles::compute_twiddle(2, 29, direction); let twiddle3: Complex = twiddles::compute_twiddle(3, 29, direction); let twiddle4: Complex = twiddles::compute_twiddle(4, 29, direction); let twiddle5: Complex = twiddles::compute_twiddle(5, 29, direction); let twiddle6: Complex = twiddles::compute_twiddle(6, 29, direction); let twiddle7: Complex = twiddles::compute_twiddle(7, 29, direction); let twiddle8: Complex = twiddles::compute_twiddle(8, 29, direction); let twiddle9: Complex = twiddles::compute_twiddle(9, 29, direction); let twiddle10: Complex = twiddles::compute_twiddle(10, 29, direction); let twiddle11: Complex = twiddles::compute_twiddle(11, 29, direction); let twiddle12: Complex = twiddles::compute_twiddle(12, 29, direction); let twiddle13: Complex = twiddles::compute_twiddle(13, 29, direction); let twiddle14: Complex = twiddles::compute_twiddle(14, 29, direction); Self { twiddle1, twiddle2, twiddle3, twiddle4, twiddle5, twiddle6, twiddle7, twiddle8, twiddle9, twiddle10, twiddle11, twiddle12, twiddle13, twiddle14, direction, } } #[inline(never)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { // This function was derived in the same manner as the butterflies for length 3, 5 and 7. // However, instead of doing it by hand the actual code is autogenerated // with the `genbutterflies.py` script in the `tools` directory. let x128p = buffer.load(1) + buffer.load(28); let x128n = buffer.load(1) - buffer.load(28); let x227p = buffer.load(2) + buffer.load(27); let x227n = buffer.load(2) - buffer.load(27); let x326p = buffer.load(3) + buffer.load(26); let x326n = buffer.load(3) - buffer.load(26); let x425p = buffer.load(4) + buffer.load(25); let x425n = buffer.load(4) - buffer.load(25); let x524p = buffer.load(5) + buffer.load(24); let x524n = buffer.load(5) - buffer.load(24); let x623p = buffer.load(6) + buffer.load(23); let x623n = buffer.load(6) - buffer.load(23); let x722p = buffer.load(7) + buffer.load(22); let x722n = buffer.load(7) - buffer.load(22); let x821p = buffer.load(8) + buffer.load(21); let x821n = buffer.load(8) - buffer.load(21); let x920p = buffer.load(9) + buffer.load(20); let x920n = buffer.load(9) - buffer.load(20); let x1019p = buffer.load(10) + buffer.load(19); let x1019n = buffer.load(10) - buffer.load(19); let x1118p = buffer.load(11) + buffer.load(18); let x1118n = buffer.load(11) - buffer.load(18); let x1217p = buffer.load(12) + buffer.load(17); let x1217n = buffer.load(12) - buffer.load(17); let x1316p = buffer.load(13) + buffer.load(16); let x1316n = buffer.load(13) - buffer.load(16); let x1415p = buffer.load(14) + buffer.load(15); let x1415n = buffer.load(14) - buffer.load(15); let sum = buffer.load(0) + x128p + x227p + x326p + x425p + x524p + x623p + x722p + x821p + x920p + x1019p + x1118p + x1217p + x1316p + x1415p; let b128re_a = buffer.load(0).re + self.twiddle1.re * x128p.re + self.twiddle2.re * x227p.re + self.twiddle3.re * x326p.re + self.twiddle4.re * x425p.re + self.twiddle5.re * x524p.re + self.twiddle6.re * x623p.re + self.twiddle7.re * x722p.re + self.twiddle8.re * x821p.re + self.twiddle9.re * x920p.re + self.twiddle10.re * x1019p.re + self.twiddle11.re * x1118p.re + self.twiddle12.re * x1217p.re + self.twiddle13.re * x1316p.re + self.twiddle14.re * x1415p.re; let b128re_b = self.twiddle1.im * x128n.im + self.twiddle2.im * x227n.im + self.twiddle3.im * x326n.im + self.twiddle4.im * x425n.im + self.twiddle5.im * x524n.im + self.twiddle6.im * x623n.im + self.twiddle7.im * x722n.im + self.twiddle8.im * x821n.im + self.twiddle9.im * x920n.im + self.twiddle10.im * x1019n.im + self.twiddle11.im * x1118n.im + self.twiddle12.im * x1217n.im + self.twiddle13.im * x1316n.im + self.twiddle14.im * x1415n.im; let b227re_a = buffer.load(0).re + self.twiddle2.re * x128p.re + self.twiddle4.re * x227p.re + self.twiddle6.re * x326p.re + self.twiddle8.re * x425p.re + self.twiddle10.re * x524p.re + self.twiddle12.re * x623p.re + self.twiddle14.re * x722p.re + self.twiddle13.re * x821p.re + self.twiddle11.re * x920p.re + self.twiddle9.re * x1019p.re + self.twiddle7.re * x1118p.re + self.twiddle5.re * x1217p.re + self.twiddle3.re * x1316p.re + self.twiddle1.re * x1415p.re; let b227re_b = self.twiddle2.im * x128n.im + self.twiddle4.im * x227n.im + self.twiddle6.im * x326n.im + self.twiddle8.im * x425n.im + self.twiddle10.im * x524n.im + self.twiddle12.im * x623n.im + self.twiddle14.im * x722n.im + -self.twiddle13.im * x821n.im + -self.twiddle11.im * x920n.im + -self.twiddle9.im * x1019n.im + -self.twiddle7.im * x1118n.im + -self.twiddle5.im * x1217n.im + -self.twiddle3.im * x1316n.im + -self.twiddle1.im * x1415n.im; let b326re_a = buffer.load(0).re + self.twiddle3.re * x128p.re + self.twiddle6.re * x227p.re + self.twiddle9.re * x326p.re + self.twiddle12.re * x425p.re + self.twiddle14.re * x524p.re + self.twiddle11.re * x623p.re + self.twiddle8.re * x722p.re + self.twiddle5.re * x821p.re + self.twiddle2.re * x920p.re + self.twiddle1.re * x1019p.re + self.twiddle4.re * x1118p.re + self.twiddle7.re * x1217p.re + self.twiddle10.re * x1316p.re + self.twiddle13.re * x1415p.re; let b326re_b = self.twiddle3.im * x128n.im + self.twiddle6.im * x227n.im + self.twiddle9.im * x326n.im + self.twiddle12.im * x425n.im + -self.twiddle14.im * x524n.im + -self.twiddle11.im * x623n.im + -self.twiddle8.im * x722n.im + -self.twiddle5.im * x821n.im + -self.twiddle2.im * x920n.im + self.twiddle1.im * x1019n.im + self.twiddle4.im * x1118n.im + self.twiddle7.im * x1217n.im + self.twiddle10.im * x1316n.im + self.twiddle13.im * x1415n.im; let b425re_a = buffer.load(0).re + self.twiddle4.re * x128p.re + self.twiddle8.re * x227p.re + self.twiddle12.re * x326p.re + self.twiddle13.re * x425p.re + self.twiddle9.re * x524p.re + self.twiddle5.re * x623p.re + self.twiddle1.re * x722p.re + self.twiddle3.re * x821p.re + self.twiddle7.re * x920p.re + self.twiddle11.re * x1019p.re + self.twiddle14.re * x1118p.re + self.twiddle10.re * x1217p.re + self.twiddle6.re * x1316p.re + self.twiddle2.re * x1415p.re; let b425re_b = self.twiddle4.im * x128n.im + self.twiddle8.im * x227n.im + self.twiddle12.im * x326n.im + -self.twiddle13.im * x425n.im + -self.twiddle9.im * x524n.im + -self.twiddle5.im * x623n.im + -self.twiddle1.im * x722n.im + self.twiddle3.im * x821n.im + self.twiddle7.im * x920n.im + self.twiddle11.im * x1019n.im + -self.twiddle14.im * x1118n.im + -self.twiddle10.im * x1217n.im + -self.twiddle6.im * x1316n.im + -self.twiddle2.im * x1415n.im; let b524re_a = buffer.load(0).re + self.twiddle5.re * x128p.re + self.twiddle10.re * x227p.re + self.twiddle14.re * x326p.re + self.twiddle9.re * x425p.re + self.twiddle4.re * x524p.re + self.twiddle1.re * x623p.re + self.twiddle6.re * x722p.re + self.twiddle11.re * x821p.re + self.twiddle13.re * x920p.re + self.twiddle8.re * x1019p.re + self.twiddle3.re * x1118p.re + self.twiddle2.re * x1217p.re + self.twiddle7.re * x1316p.re + self.twiddle12.re * x1415p.re; let b524re_b = self.twiddle5.im * x128n.im + self.twiddle10.im * x227n.im + -self.twiddle14.im * x326n.im + -self.twiddle9.im * x425n.im + -self.twiddle4.im * x524n.im + self.twiddle1.im * x623n.im + self.twiddle6.im * x722n.im + self.twiddle11.im * x821n.im + -self.twiddle13.im * x920n.im + -self.twiddle8.im * x1019n.im + -self.twiddle3.im * x1118n.im + self.twiddle2.im * x1217n.im + self.twiddle7.im * x1316n.im + self.twiddle12.im * x1415n.im; let b623re_a = buffer.load(0).re + self.twiddle6.re * x128p.re + self.twiddle12.re * x227p.re + self.twiddle11.re * x326p.re + self.twiddle5.re * x425p.re + self.twiddle1.re * x524p.re + self.twiddle7.re * x623p.re + self.twiddle13.re * x722p.re + self.twiddle10.re * x821p.re + self.twiddle4.re * x920p.re + self.twiddle2.re * x1019p.re + self.twiddle8.re * x1118p.re + self.twiddle14.re * x1217p.re + self.twiddle9.re * x1316p.re + self.twiddle3.re * x1415p.re; let b623re_b = self.twiddle6.im * x128n.im + self.twiddle12.im * x227n.im + -self.twiddle11.im * x326n.im + -self.twiddle5.im * x425n.im + self.twiddle1.im * x524n.im + self.twiddle7.im * x623n.im + self.twiddle13.im * x722n.im + -self.twiddle10.im * x821n.im + -self.twiddle4.im * x920n.im + self.twiddle2.im * x1019n.im + self.twiddle8.im * x1118n.im + self.twiddle14.im * x1217n.im + -self.twiddle9.im * x1316n.im + -self.twiddle3.im * x1415n.im; let b722re_a = buffer.load(0).re + self.twiddle7.re * x128p.re + self.twiddle14.re * x227p.re + self.twiddle8.re * x326p.re + self.twiddle1.re * x425p.re + self.twiddle6.re * x524p.re + self.twiddle13.re * x623p.re + self.twiddle9.re * x722p.re + self.twiddle2.re * x821p.re + self.twiddle5.re * x920p.re + self.twiddle12.re * x1019p.re + self.twiddle10.re * x1118p.re + self.twiddle3.re * x1217p.re + self.twiddle4.re * x1316p.re + self.twiddle11.re * x1415p.re; let b722re_b = self.twiddle7.im * x128n.im + self.twiddle14.im * x227n.im + -self.twiddle8.im * x326n.im + -self.twiddle1.im * x425n.im + self.twiddle6.im * x524n.im + self.twiddle13.im * x623n.im + -self.twiddle9.im * x722n.im + -self.twiddle2.im * x821n.im + self.twiddle5.im * x920n.im + self.twiddle12.im * x1019n.im + -self.twiddle10.im * x1118n.im + -self.twiddle3.im * x1217n.im + self.twiddle4.im * x1316n.im + self.twiddle11.im * x1415n.im; let b821re_a = buffer.load(0).re + self.twiddle8.re * x128p.re + self.twiddle13.re * x227p.re + self.twiddle5.re * x326p.re + self.twiddle3.re * x425p.re + self.twiddle11.re * x524p.re + self.twiddle10.re * x623p.re + self.twiddle2.re * x722p.re + self.twiddle6.re * x821p.re + self.twiddle14.re * x920p.re + self.twiddle7.re * x1019p.re + self.twiddle1.re * x1118p.re + self.twiddle9.re * x1217p.re + self.twiddle12.re * x1316p.re + self.twiddle4.re * x1415p.re; let b821re_b = self.twiddle8.im * x128n.im + -self.twiddle13.im * x227n.im + -self.twiddle5.im * x326n.im + self.twiddle3.im * x425n.im + self.twiddle11.im * x524n.im + -self.twiddle10.im * x623n.im + -self.twiddle2.im * x722n.im + self.twiddle6.im * x821n.im + self.twiddle14.im * x920n.im + -self.twiddle7.im * x1019n.im + self.twiddle1.im * x1118n.im + self.twiddle9.im * x1217n.im + -self.twiddle12.im * x1316n.im + -self.twiddle4.im * x1415n.im; let b920re_a = buffer.load(0).re + self.twiddle9.re * x128p.re + self.twiddle11.re * x227p.re + self.twiddle2.re * x326p.re + self.twiddle7.re * x425p.re + self.twiddle13.re * x524p.re + self.twiddle4.re * x623p.re + self.twiddle5.re * x722p.re + self.twiddle14.re * x821p.re + self.twiddle6.re * x920p.re + self.twiddle3.re * x1019p.re + self.twiddle12.re * x1118p.re + self.twiddle8.re * x1217p.re + self.twiddle1.re * x1316p.re + self.twiddle10.re * x1415p.re; let b920re_b = self.twiddle9.im * x128n.im + -self.twiddle11.im * x227n.im + -self.twiddle2.im * x326n.im + self.twiddle7.im * x425n.im + -self.twiddle13.im * x524n.im + -self.twiddle4.im * x623n.im + self.twiddle5.im * x722n.im + self.twiddle14.im * x821n.im + -self.twiddle6.im * x920n.im + self.twiddle3.im * x1019n.im + self.twiddle12.im * x1118n.im + -self.twiddle8.im * x1217n.im + self.twiddle1.im * x1316n.im + self.twiddle10.im * x1415n.im; let b1019re_a = buffer.load(0).re + self.twiddle10.re * x128p.re + self.twiddle9.re * x227p.re + self.twiddle1.re * x326p.re + self.twiddle11.re * x425p.re + self.twiddle8.re * x524p.re + self.twiddle2.re * x623p.re + self.twiddle12.re * x722p.re + self.twiddle7.re * x821p.re + self.twiddle3.re * x920p.re + self.twiddle13.re * x1019p.re + self.twiddle6.re * x1118p.re + self.twiddle4.re * x1217p.re + self.twiddle14.re * x1316p.re + self.twiddle5.re * x1415p.re; let b1019re_b = self.twiddle10.im * x128n.im + -self.twiddle9.im * x227n.im + self.twiddle1.im * x326n.im + self.twiddle11.im * x425n.im + -self.twiddle8.im * x524n.im + self.twiddle2.im * x623n.im + self.twiddle12.im * x722n.im + -self.twiddle7.im * x821n.im + self.twiddle3.im * x920n.im + self.twiddle13.im * x1019n.im + -self.twiddle6.im * x1118n.im + self.twiddle4.im * x1217n.im + self.twiddle14.im * x1316n.im + -self.twiddle5.im * x1415n.im; let b1118re_a = buffer.load(0).re + self.twiddle11.re * x128p.re + self.twiddle7.re * x227p.re + self.twiddle4.re * x326p.re + self.twiddle14.re * x425p.re + self.twiddle3.re * x524p.re + self.twiddle8.re * x623p.re + self.twiddle10.re * x722p.re + self.twiddle1.re * x821p.re + self.twiddle12.re * x920p.re + self.twiddle6.re * x1019p.re + self.twiddle5.re * x1118p.re + self.twiddle13.re * x1217p.re + self.twiddle2.re * x1316p.re + self.twiddle9.re * x1415p.re; let b1118re_b = self.twiddle11.im * x128n.im + -self.twiddle7.im * x227n.im + self.twiddle4.im * x326n.im + -self.twiddle14.im * x425n.im + -self.twiddle3.im * x524n.im + self.twiddle8.im * x623n.im + -self.twiddle10.im * x722n.im + self.twiddle1.im * x821n.im + self.twiddle12.im * x920n.im + -self.twiddle6.im * x1019n.im + self.twiddle5.im * x1118n.im + -self.twiddle13.im * x1217n.im + -self.twiddle2.im * x1316n.im + self.twiddle9.im * x1415n.im; let b1217re_a = buffer.load(0).re + self.twiddle12.re * x128p.re + self.twiddle5.re * x227p.re + self.twiddle7.re * x326p.re + self.twiddle10.re * x425p.re + self.twiddle2.re * x524p.re + self.twiddle14.re * x623p.re + self.twiddle3.re * x722p.re + self.twiddle9.re * x821p.re + self.twiddle8.re * x920p.re + self.twiddle4.re * x1019p.re + self.twiddle13.re * x1118p.re + self.twiddle1.re * x1217p.re + self.twiddle11.re * x1316p.re + self.twiddle6.re * x1415p.re; let b1217re_b = self.twiddle12.im * x128n.im + -self.twiddle5.im * x227n.im + self.twiddle7.im * x326n.im + -self.twiddle10.im * x425n.im + self.twiddle2.im * x524n.im + self.twiddle14.im * x623n.im + -self.twiddle3.im * x722n.im + self.twiddle9.im * x821n.im + -self.twiddle8.im * x920n.im + self.twiddle4.im * x1019n.im + -self.twiddle13.im * x1118n.im + -self.twiddle1.im * x1217n.im + self.twiddle11.im * x1316n.im + -self.twiddle6.im * x1415n.im; let b1316re_a = buffer.load(0).re + self.twiddle13.re * x128p.re + self.twiddle3.re * x227p.re + self.twiddle10.re * x326p.re + self.twiddle6.re * x425p.re + self.twiddle7.re * x524p.re + self.twiddle9.re * x623p.re + self.twiddle4.re * x722p.re + self.twiddle12.re * x821p.re + self.twiddle1.re * x920p.re + self.twiddle14.re * x1019p.re + self.twiddle2.re * x1118p.re + self.twiddle11.re * x1217p.re + self.twiddle5.re * x1316p.re + self.twiddle8.re * x1415p.re; let b1316re_b = self.twiddle13.im * x128n.im + -self.twiddle3.im * x227n.im + self.twiddle10.im * x326n.im + -self.twiddle6.im * x425n.im + self.twiddle7.im * x524n.im + -self.twiddle9.im * x623n.im + self.twiddle4.im * x722n.im + -self.twiddle12.im * x821n.im + self.twiddle1.im * x920n.im + self.twiddle14.im * x1019n.im + -self.twiddle2.im * x1118n.im + self.twiddle11.im * x1217n.im + -self.twiddle5.im * x1316n.im + self.twiddle8.im * x1415n.im; let b1415re_a = buffer.load(0).re + self.twiddle14.re * x128p.re + self.twiddle1.re * x227p.re + self.twiddle13.re * x326p.re + self.twiddle2.re * x425p.re + self.twiddle12.re * x524p.re + self.twiddle3.re * x623p.re + self.twiddle11.re * x722p.re + self.twiddle4.re * x821p.re + self.twiddle10.re * x920p.re + self.twiddle5.re * x1019p.re + self.twiddle9.re * x1118p.re + self.twiddle6.re * x1217p.re + self.twiddle8.re * x1316p.re + self.twiddle7.re * x1415p.re; let b1415re_b = self.twiddle14.im * x128n.im + -self.twiddle1.im * x227n.im + self.twiddle13.im * x326n.im + -self.twiddle2.im * x425n.im + self.twiddle12.im * x524n.im + -self.twiddle3.im * x623n.im + self.twiddle11.im * x722n.im + -self.twiddle4.im * x821n.im + self.twiddle10.im * x920n.im + -self.twiddle5.im * x1019n.im + self.twiddle9.im * x1118n.im + -self.twiddle6.im * x1217n.im + self.twiddle8.im * x1316n.im + -self.twiddle7.im * x1415n.im; let b128im_a = buffer.load(0).im + self.twiddle1.re * x128p.im + self.twiddle2.re * x227p.im + self.twiddle3.re * x326p.im + self.twiddle4.re * x425p.im + self.twiddle5.re * x524p.im + self.twiddle6.re * x623p.im + self.twiddle7.re * x722p.im + self.twiddle8.re * x821p.im + self.twiddle9.re * x920p.im + self.twiddle10.re * x1019p.im + self.twiddle11.re * x1118p.im + self.twiddle12.re * x1217p.im + self.twiddle13.re * x1316p.im + self.twiddle14.re * x1415p.im; let b128im_b = self.twiddle1.im * x128n.re + self.twiddle2.im * x227n.re + self.twiddle3.im * x326n.re + self.twiddle4.im * x425n.re + self.twiddle5.im * x524n.re + self.twiddle6.im * x623n.re + self.twiddle7.im * x722n.re + self.twiddle8.im * x821n.re + self.twiddle9.im * x920n.re + self.twiddle10.im * x1019n.re + self.twiddle11.im * x1118n.re + self.twiddle12.im * x1217n.re + self.twiddle13.im * x1316n.re + self.twiddle14.im * x1415n.re; let b227im_a = buffer.load(0).im + self.twiddle2.re * x128p.im + self.twiddle4.re * x227p.im + self.twiddle6.re * x326p.im + self.twiddle8.re * x425p.im + self.twiddle10.re * x524p.im + self.twiddle12.re * x623p.im + self.twiddle14.re * x722p.im + self.twiddle13.re * x821p.im + self.twiddle11.re * x920p.im + self.twiddle9.re * x1019p.im + self.twiddle7.re * x1118p.im + self.twiddle5.re * x1217p.im + self.twiddle3.re * x1316p.im + self.twiddle1.re * x1415p.im; let b227im_b = self.twiddle2.im * x128n.re + self.twiddle4.im * x227n.re + self.twiddle6.im * x326n.re + self.twiddle8.im * x425n.re + self.twiddle10.im * x524n.re + self.twiddle12.im * x623n.re + self.twiddle14.im * x722n.re + -self.twiddle13.im * x821n.re + -self.twiddle11.im * x920n.re + -self.twiddle9.im * x1019n.re + -self.twiddle7.im * x1118n.re + -self.twiddle5.im * x1217n.re + -self.twiddle3.im * x1316n.re + -self.twiddle1.im * x1415n.re; let b326im_a = buffer.load(0).im + self.twiddle3.re * x128p.im + self.twiddle6.re * x227p.im + self.twiddle9.re * x326p.im + self.twiddle12.re * x425p.im + self.twiddle14.re * x524p.im + self.twiddle11.re * x623p.im + self.twiddle8.re * x722p.im + self.twiddle5.re * x821p.im + self.twiddle2.re * x920p.im + self.twiddle1.re * x1019p.im + self.twiddle4.re * x1118p.im + self.twiddle7.re * x1217p.im + self.twiddle10.re * x1316p.im + self.twiddle13.re * x1415p.im; let b326im_b = self.twiddle3.im * x128n.re + self.twiddle6.im * x227n.re + self.twiddle9.im * x326n.re + self.twiddle12.im * x425n.re + -self.twiddle14.im * x524n.re + -self.twiddle11.im * x623n.re + -self.twiddle8.im * x722n.re + -self.twiddle5.im * x821n.re + -self.twiddle2.im * x920n.re + self.twiddle1.im * x1019n.re + self.twiddle4.im * x1118n.re + self.twiddle7.im * x1217n.re + self.twiddle10.im * x1316n.re + self.twiddle13.im * x1415n.re; let b425im_a = buffer.load(0).im + self.twiddle4.re * x128p.im + self.twiddle8.re * x227p.im + self.twiddle12.re * x326p.im + self.twiddle13.re * x425p.im + self.twiddle9.re * x524p.im + self.twiddle5.re * x623p.im + self.twiddle1.re * x722p.im + self.twiddle3.re * x821p.im + self.twiddle7.re * x920p.im + self.twiddle11.re * x1019p.im + self.twiddle14.re * x1118p.im + self.twiddle10.re * x1217p.im + self.twiddle6.re * x1316p.im + self.twiddle2.re * x1415p.im; let b425im_b = self.twiddle4.im * x128n.re + self.twiddle8.im * x227n.re + self.twiddle12.im * x326n.re + -self.twiddle13.im * x425n.re + -self.twiddle9.im * x524n.re + -self.twiddle5.im * x623n.re + -self.twiddle1.im * x722n.re + self.twiddle3.im * x821n.re + self.twiddle7.im * x920n.re + self.twiddle11.im * x1019n.re + -self.twiddle14.im * x1118n.re + -self.twiddle10.im * x1217n.re + -self.twiddle6.im * x1316n.re + -self.twiddle2.im * x1415n.re; let b524im_a = buffer.load(0).im + self.twiddle5.re * x128p.im + self.twiddle10.re * x227p.im + self.twiddle14.re * x326p.im + self.twiddle9.re * x425p.im + self.twiddle4.re * x524p.im + self.twiddle1.re * x623p.im + self.twiddle6.re * x722p.im + self.twiddle11.re * x821p.im + self.twiddle13.re * x920p.im + self.twiddle8.re * x1019p.im + self.twiddle3.re * x1118p.im + self.twiddle2.re * x1217p.im + self.twiddle7.re * x1316p.im + self.twiddle12.re * x1415p.im; let b524im_b = self.twiddle5.im * x128n.re + self.twiddle10.im * x227n.re + -self.twiddle14.im * x326n.re + -self.twiddle9.im * x425n.re + -self.twiddle4.im * x524n.re + self.twiddle1.im * x623n.re + self.twiddle6.im * x722n.re + self.twiddle11.im * x821n.re + -self.twiddle13.im * x920n.re + -self.twiddle8.im * x1019n.re + -self.twiddle3.im * x1118n.re + self.twiddle2.im * x1217n.re + self.twiddle7.im * x1316n.re + self.twiddle12.im * x1415n.re; let b623im_a = buffer.load(0).im + self.twiddle6.re * x128p.im + self.twiddle12.re * x227p.im + self.twiddle11.re * x326p.im + self.twiddle5.re * x425p.im + self.twiddle1.re * x524p.im + self.twiddle7.re * x623p.im + self.twiddle13.re * x722p.im + self.twiddle10.re * x821p.im + self.twiddle4.re * x920p.im + self.twiddle2.re * x1019p.im + self.twiddle8.re * x1118p.im + self.twiddle14.re * x1217p.im + self.twiddle9.re * x1316p.im + self.twiddle3.re * x1415p.im; let b623im_b = self.twiddle6.im * x128n.re + self.twiddle12.im * x227n.re + -self.twiddle11.im * x326n.re + -self.twiddle5.im * x425n.re + self.twiddle1.im * x524n.re + self.twiddle7.im * x623n.re + self.twiddle13.im * x722n.re + -self.twiddle10.im * x821n.re + -self.twiddle4.im * x920n.re + self.twiddle2.im * x1019n.re + self.twiddle8.im * x1118n.re + self.twiddle14.im * x1217n.re + -self.twiddle9.im * x1316n.re + -self.twiddle3.im * x1415n.re; let b722im_a = buffer.load(0).im + self.twiddle7.re * x128p.im + self.twiddle14.re * x227p.im + self.twiddle8.re * x326p.im + self.twiddle1.re * x425p.im + self.twiddle6.re * x524p.im + self.twiddle13.re * x623p.im + self.twiddle9.re * x722p.im + self.twiddle2.re * x821p.im + self.twiddle5.re * x920p.im + self.twiddle12.re * x1019p.im + self.twiddle10.re * x1118p.im + self.twiddle3.re * x1217p.im + self.twiddle4.re * x1316p.im + self.twiddle11.re * x1415p.im; let b722im_b = self.twiddle7.im * x128n.re + self.twiddle14.im * x227n.re + -self.twiddle8.im * x326n.re + -self.twiddle1.im * x425n.re + self.twiddle6.im * x524n.re + self.twiddle13.im * x623n.re + -self.twiddle9.im * x722n.re + -self.twiddle2.im * x821n.re + self.twiddle5.im * x920n.re + self.twiddle12.im * x1019n.re + -self.twiddle10.im * x1118n.re + -self.twiddle3.im * x1217n.re + self.twiddle4.im * x1316n.re + self.twiddle11.im * x1415n.re; let b821im_a = buffer.load(0).im + self.twiddle8.re * x128p.im + self.twiddle13.re * x227p.im + self.twiddle5.re * x326p.im + self.twiddle3.re * x425p.im + self.twiddle11.re * x524p.im + self.twiddle10.re * x623p.im + self.twiddle2.re * x722p.im + self.twiddle6.re * x821p.im + self.twiddle14.re * x920p.im + self.twiddle7.re * x1019p.im + self.twiddle1.re * x1118p.im + self.twiddle9.re * x1217p.im + self.twiddle12.re * x1316p.im + self.twiddle4.re * x1415p.im; let b821im_b = self.twiddle8.im * x128n.re + -self.twiddle13.im * x227n.re + -self.twiddle5.im * x326n.re + self.twiddle3.im * x425n.re + self.twiddle11.im * x524n.re + -self.twiddle10.im * x623n.re + -self.twiddle2.im * x722n.re + self.twiddle6.im * x821n.re + self.twiddle14.im * x920n.re + -self.twiddle7.im * x1019n.re + self.twiddle1.im * x1118n.re + self.twiddle9.im * x1217n.re + -self.twiddle12.im * x1316n.re + -self.twiddle4.im * x1415n.re; let b920im_a = buffer.load(0).im + self.twiddle9.re * x128p.im + self.twiddle11.re * x227p.im + self.twiddle2.re * x326p.im + self.twiddle7.re * x425p.im + self.twiddle13.re * x524p.im + self.twiddle4.re * x623p.im + self.twiddle5.re * x722p.im + self.twiddle14.re * x821p.im + self.twiddle6.re * x920p.im + self.twiddle3.re * x1019p.im + self.twiddle12.re * x1118p.im + self.twiddle8.re * x1217p.im + self.twiddle1.re * x1316p.im + self.twiddle10.re * x1415p.im; let b920im_b = self.twiddle9.im * x128n.re + -self.twiddle11.im * x227n.re + -self.twiddle2.im * x326n.re + self.twiddle7.im * x425n.re + -self.twiddle13.im * x524n.re + -self.twiddle4.im * x623n.re + self.twiddle5.im * x722n.re + self.twiddle14.im * x821n.re + -self.twiddle6.im * x920n.re + self.twiddle3.im * x1019n.re + self.twiddle12.im * x1118n.re + -self.twiddle8.im * x1217n.re + self.twiddle1.im * x1316n.re + self.twiddle10.im * x1415n.re; let b1019im_a = buffer.load(0).im + self.twiddle10.re * x128p.im + self.twiddle9.re * x227p.im + self.twiddle1.re * x326p.im + self.twiddle11.re * x425p.im + self.twiddle8.re * x524p.im + self.twiddle2.re * x623p.im + self.twiddle12.re * x722p.im + self.twiddle7.re * x821p.im + self.twiddle3.re * x920p.im + self.twiddle13.re * x1019p.im + self.twiddle6.re * x1118p.im + self.twiddle4.re * x1217p.im + self.twiddle14.re * x1316p.im + self.twiddle5.re * x1415p.im; let b1019im_b = self.twiddle10.im * x128n.re + -self.twiddle9.im * x227n.re + self.twiddle1.im * x326n.re + self.twiddle11.im * x425n.re + -self.twiddle8.im * x524n.re + self.twiddle2.im * x623n.re + self.twiddle12.im * x722n.re + -self.twiddle7.im * x821n.re + self.twiddle3.im * x920n.re + self.twiddle13.im * x1019n.re + -self.twiddle6.im * x1118n.re + self.twiddle4.im * x1217n.re + self.twiddle14.im * x1316n.re + -self.twiddle5.im * x1415n.re; let b1118im_a = buffer.load(0).im + self.twiddle11.re * x128p.im + self.twiddle7.re * x227p.im + self.twiddle4.re * x326p.im + self.twiddle14.re * x425p.im + self.twiddle3.re * x524p.im + self.twiddle8.re * x623p.im + self.twiddle10.re * x722p.im + self.twiddle1.re * x821p.im + self.twiddle12.re * x920p.im + self.twiddle6.re * x1019p.im + self.twiddle5.re * x1118p.im + self.twiddle13.re * x1217p.im + self.twiddle2.re * x1316p.im + self.twiddle9.re * x1415p.im; let b1118im_b = self.twiddle11.im * x128n.re + -self.twiddle7.im * x227n.re + self.twiddle4.im * x326n.re + -self.twiddle14.im * x425n.re + -self.twiddle3.im * x524n.re + self.twiddle8.im * x623n.re + -self.twiddle10.im * x722n.re + self.twiddle1.im * x821n.re + self.twiddle12.im * x920n.re + -self.twiddle6.im * x1019n.re + self.twiddle5.im * x1118n.re + -self.twiddle13.im * x1217n.re + -self.twiddle2.im * x1316n.re + self.twiddle9.im * x1415n.re; let b1217im_a = buffer.load(0).im + self.twiddle12.re * x128p.im + self.twiddle5.re * x227p.im + self.twiddle7.re * x326p.im + self.twiddle10.re * x425p.im + self.twiddle2.re * x524p.im + self.twiddle14.re * x623p.im + self.twiddle3.re * x722p.im + self.twiddle9.re * x821p.im + self.twiddle8.re * x920p.im + self.twiddle4.re * x1019p.im + self.twiddle13.re * x1118p.im + self.twiddle1.re * x1217p.im + self.twiddle11.re * x1316p.im + self.twiddle6.re * x1415p.im; let b1217im_b = self.twiddle12.im * x128n.re + -self.twiddle5.im * x227n.re + self.twiddle7.im * x326n.re + -self.twiddle10.im * x425n.re + self.twiddle2.im * x524n.re + self.twiddle14.im * x623n.re + -self.twiddle3.im * x722n.re + self.twiddle9.im * x821n.re + -self.twiddle8.im * x920n.re + self.twiddle4.im * x1019n.re + -self.twiddle13.im * x1118n.re + -self.twiddle1.im * x1217n.re + self.twiddle11.im * x1316n.re + -self.twiddle6.im * x1415n.re; let b1316im_a = buffer.load(0).im + self.twiddle13.re * x128p.im + self.twiddle3.re * x227p.im + self.twiddle10.re * x326p.im + self.twiddle6.re * x425p.im + self.twiddle7.re * x524p.im + self.twiddle9.re * x623p.im + self.twiddle4.re * x722p.im + self.twiddle12.re * x821p.im + self.twiddle1.re * x920p.im + self.twiddle14.re * x1019p.im + self.twiddle2.re * x1118p.im + self.twiddle11.re * x1217p.im + self.twiddle5.re * x1316p.im + self.twiddle8.re * x1415p.im; let b1316im_b = self.twiddle13.im * x128n.re + -self.twiddle3.im * x227n.re + self.twiddle10.im * x326n.re + -self.twiddle6.im * x425n.re + self.twiddle7.im * x524n.re + -self.twiddle9.im * x623n.re + self.twiddle4.im * x722n.re + -self.twiddle12.im * x821n.re + self.twiddle1.im * x920n.re + self.twiddle14.im * x1019n.re + -self.twiddle2.im * x1118n.re + self.twiddle11.im * x1217n.re + -self.twiddle5.im * x1316n.re + self.twiddle8.im * x1415n.re; let b1415im_a = buffer.load(0).im + self.twiddle14.re * x128p.im + self.twiddle1.re * x227p.im + self.twiddle13.re * x326p.im + self.twiddle2.re * x425p.im + self.twiddle12.re * x524p.im + self.twiddle3.re * x623p.im + self.twiddle11.re * x722p.im + self.twiddle4.re * x821p.im + self.twiddle10.re * x920p.im + self.twiddle5.re * x1019p.im + self.twiddle9.re * x1118p.im + self.twiddle6.re * x1217p.im + self.twiddle8.re * x1316p.im + self.twiddle7.re * x1415p.im; let b1415im_b = self.twiddle14.im * x128n.re + -self.twiddle1.im * x227n.re + self.twiddle13.im * x326n.re + -self.twiddle2.im * x425n.re + self.twiddle12.im * x524n.re + -self.twiddle3.im * x623n.re + self.twiddle11.im * x722n.re + -self.twiddle4.im * x821n.re + self.twiddle10.im * x920n.re + -self.twiddle5.im * x1019n.re + self.twiddle9.im * x1118n.re + -self.twiddle6.im * x1217n.re + self.twiddle8.im * x1316n.re + -self.twiddle7.im * x1415n.re; let out1re = b128re_a - b128re_b; let out1im = b128im_a + b128im_b; let out2re = b227re_a - b227re_b; let out2im = b227im_a + b227im_b; let out3re = b326re_a - b326re_b; let out3im = b326im_a + b326im_b; let out4re = b425re_a - b425re_b; let out4im = b425im_a + b425im_b; let out5re = b524re_a - b524re_b; let out5im = b524im_a + b524im_b; let out6re = b623re_a - b623re_b; let out6im = b623im_a + b623im_b; let out7re = b722re_a - b722re_b; let out7im = b722im_a + b722im_b; let out8re = b821re_a - b821re_b; let out8im = b821im_a + b821im_b; let out9re = b920re_a - b920re_b; let out9im = b920im_a + b920im_b; let out10re = b1019re_a - b1019re_b; let out10im = b1019im_a + b1019im_b; let out11re = b1118re_a - b1118re_b; let out11im = b1118im_a + b1118im_b; let out12re = b1217re_a - b1217re_b; let out12im = b1217im_a + b1217im_b; let out13re = b1316re_a - b1316re_b; let out13im = b1316im_a + b1316im_b; let out14re = b1415re_a - b1415re_b; let out14im = b1415im_a + b1415im_b; let out15re = b1415re_a + b1415re_b; let out15im = b1415im_a - b1415im_b; let out16re = b1316re_a + b1316re_b; let out16im = b1316im_a - b1316im_b; let out17re = b1217re_a + b1217re_b; let out17im = b1217im_a - b1217im_b; let out18re = b1118re_a + b1118re_b; let out18im = b1118im_a - b1118im_b; let out19re = b1019re_a + b1019re_b; let out19im = b1019im_a - b1019im_b; let out20re = b920re_a + b920re_b; let out20im = b920im_a - b920im_b; let out21re = b821re_a + b821re_b; let out21im = b821im_a - b821im_b; let out22re = b722re_a + b722re_b; let out22im = b722im_a - b722im_b; let out23re = b623re_a + b623re_b; let out23im = b623im_a - b623im_b; let out24re = b524re_a + b524re_b; let out24im = b524im_a - b524im_b; let out25re = b425re_a + b425re_b; let out25im = b425im_a - b425im_b; let out26re = b326re_a + b326re_b; let out26im = b326im_a - b326im_b; let out27re = b227re_a + b227re_b; let out27im = b227im_a - b227im_b; let out28re = b128re_a + b128re_b; let out28im = b128im_a - b128im_b; buffer.store(sum, 0); buffer.store( Complex { re: out1re, im: out1im, }, 1, ); buffer.store( Complex { re: out2re, im: out2im, }, 2, ); buffer.store( Complex { re: out3re, im: out3im, }, 3, ); buffer.store( Complex { re: out4re, im: out4im, }, 4, ); buffer.store( Complex { re: out5re, im: out5im, }, 5, ); buffer.store( Complex { re: out6re, im: out6im, }, 6, ); buffer.store( Complex { re: out7re, im: out7im, }, 7, ); buffer.store( Complex { re: out8re, im: out8im, }, 8, ); buffer.store( Complex { re: out9re, im: out9im, }, 9, ); buffer.store( Complex { re: out10re, im: out10im, }, 10, ); buffer.store( Complex { re: out11re, im: out11im, }, 11, ); buffer.store( Complex { re: out12re, im: out12im, }, 12, ); buffer.store( Complex { re: out13re, im: out13im, }, 13, ); buffer.store( Complex { re: out14re, im: out14im, }, 14, ); buffer.store( Complex { re: out15re, im: out15im, }, 15, ); buffer.store( Complex { re: out16re, im: out16im, }, 16, ); buffer.store( Complex { re: out17re, im: out17im, }, 17, ); buffer.store( Complex { re: out18re, im: out18im, }, 18, ); buffer.store( Complex { re: out19re, im: out19im, }, 19, ); buffer.store( Complex { re: out20re, im: out20im, }, 20, ); buffer.store( Complex { re: out21re, im: out21im, }, 21, ); buffer.store( Complex { re: out22re, im: out22im, }, 22, ); buffer.store( Complex { re: out23re, im: out23im, }, 23, ); buffer.store( Complex { re: out24re, im: out24im, }, 24, ); buffer.store( Complex { re: out25re, im: out25im, }, 25, ); buffer.store( Complex { re: out26re, im: out26im, }, 26, ); buffer.store( Complex { re: out27re, im: out27im, }, 27, ); buffer.store( Complex { re: out28re, im: out28im, }, 28, ); } } pub struct Butterfly31 { twiddle1: Complex, twiddle2: Complex, twiddle3: Complex, twiddle4: Complex, twiddle5: Complex, twiddle6: Complex, twiddle7: Complex, twiddle8: Complex, twiddle9: Complex, twiddle10: Complex, twiddle11: Complex, twiddle12: Complex, twiddle13: Complex, twiddle14: Complex, twiddle15: Complex, direction: FftDirection, } boilerplate_fft_butterfly!(Butterfly31, 31, |this: &Butterfly31<_>| this.direction); impl Butterfly31 { pub fn new(direction: FftDirection) -> Self { let twiddle1: Complex = twiddles::compute_twiddle(1, 31, direction); let twiddle2: Complex = twiddles::compute_twiddle(2, 31, direction); let twiddle3: Complex = twiddles::compute_twiddle(3, 31, direction); let twiddle4: Complex = twiddles::compute_twiddle(4, 31, direction); let twiddle5: Complex = twiddles::compute_twiddle(5, 31, direction); let twiddle6: Complex = twiddles::compute_twiddle(6, 31, direction); let twiddle7: Complex = twiddles::compute_twiddle(7, 31, direction); let twiddle8: Complex = twiddles::compute_twiddle(8, 31, direction); let twiddle9: Complex = twiddles::compute_twiddle(9, 31, direction); let twiddle10: Complex = twiddles::compute_twiddle(10, 31, direction); let twiddle11: Complex = twiddles::compute_twiddle(11, 31, direction); let twiddle12: Complex = twiddles::compute_twiddle(12, 31, direction); let twiddle13: Complex = twiddles::compute_twiddle(13, 31, direction); let twiddle14: Complex = twiddles::compute_twiddle(14, 31, direction); let twiddle15: Complex = twiddles::compute_twiddle(15, 31, direction); Self { twiddle1, twiddle2, twiddle3, twiddle4, twiddle5, twiddle6, twiddle7, twiddle8, twiddle9, twiddle10, twiddle11, twiddle12, twiddle13, twiddle14, twiddle15, direction, } } #[inline(never)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { // This function was derived in the same manner as the butterflies for length 3, 5 and 7. // However, instead of doing it by hand the actual code is autogenerated // with the `genbutterflies.py` script in the `tools` directory. let x130p = buffer.load(1) + buffer.load(30); let x130n = buffer.load(1) - buffer.load(30); let x229p = buffer.load(2) + buffer.load(29); let x229n = buffer.load(2) - buffer.load(29); let x328p = buffer.load(3) + buffer.load(28); let x328n = buffer.load(3) - buffer.load(28); let x427p = buffer.load(4) + buffer.load(27); let x427n = buffer.load(4) - buffer.load(27); let x526p = buffer.load(5) + buffer.load(26); let x526n = buffer.load(5) - buffer.load(26); let x625p = buffer.load(6) + buffer.load(25); let x625n = buffer.load(6) - buffer.load(25); let x724p = buffer.load(7) + buffer.load(24); let x724n = buffer.load(7) - buffer.load(24); let x823p = buffer.load(8) + buffer.load(23); let x823n = buffer.load(8) - buffer.load(23); let x922p = buffer.load(9) + buffer.load(22); let x922n = buffer.load(9) - buffer.load(22); let x1021p = buffer.load(10) + buffer.load(21); let x1021n = buffer.load(10) - buffer.load(21); let x1120p = buffer.load(11) + buffer.load(20); let x1120n = buffer.load(11) - buffer.load(20); let x1219p = buffer.load(12) + buffer.load(19); let x1219n = buffer.load(12) - buffer.load(19); let x1318p = buffer.load(13) + buffer.load(18); let x1318n = buffer.load(13) - buffer.load(18); let x1417p = buffer.load(14) + buffer.load(17); let x1417n = buffer.load(14) - buffer.load(17); let x1516p = buffer.load(15) + buffer.load(16); let x1516n = buffer.load(15) - buffer.load(16); let sum = buffer.load(0) + x130p + x229p + x328p + x427p + x526p + x625p + x724p + x823p + x922p + x1021p + x1120p + x1219p + x1318p + x1417p + x1516p; let b130re_a = buffer.load(0).re + self.twiddle1.re * x130p.re + self.twiddle2.re * x229p.re + self.twiddle3.re * x328p.re + self.twiddle4.re * x427p.re + self.twiddle5.re * x526p.re + self.twiddle6.re * x625p.re + self.twiddle7.re * x724p.re + self.twiddle8.re * x823p.re + self.twiddle9.re * x922p.re + self.twiddle10.re * x1021p.re + self.twiddle11.re * x1120p.re + self.twiddle12.re * x1219p.re + self.twiddle13.re * x1318p.re + self.twiddle14.re * x1417p.re + self.twiddle15.re * x1516p.re; let b130re_b = self.twiddle1.im * x130n.im + self.twiddle2.im * x229n.im + self.twiddle3.im * x328n.im + self.twiddle4.im * x427n.im + self.twiddle5.im * x526n.im + self.twiddle6.im * x625n.im + self.twiddle7.im * x724n.im + self.twiddle8.im * x823n.im + self.twiddle9.im * x922n.im + self.twiddle10.im * x1021n.im + self.twiddle11.im * x1120n.im + self.twiddle12.im * x1219n.im + self.twiddle13.im * x1318n.im + self.twiddle14.im * x1417n.im + self.twiddle15.im * x1516n.im; let b229re_a = buffer.load(0).re + self.twiddle2.re * x130p.re + self.twiddle4.re * x229p.re + self.twiddle6.re * x328p.re + self.twiddle8.re * x427p.re + self.twiddle10.re * x526p.re + self.twiddle12.re * x625p.re + self.twiddle14.re * x724p.re + self.twiddle15.re * x823p.re + self.twiddle13.re * x922p.re + self.twiddle11.re * x1021p.re + self.twiddle9.re * x1120p.re + self.twiddle7.re * x1219p.re + self.twiddle5.re * x1318p.re + self.twiddle3.re * x1417p.re + self.twiddle1.re * x1516p.re; let b229re_b = self.twiddle2.im * x130n.im + self.twiddle4.im * x229n.im + self.twiddle6.im * x328n.im + self.twiddle8.im * x427n.im + self.twiddle10.im * x526n.im + self.twiddle12.im * x625n.im + self.twiddle14.im * x724n.im + -self.twiddle15.im * x823n.im + -self.twiddle13.im * x922n.im + -self.twiddle11.im * x1021n.im + -self.twiddle9.im * x1120n.im + -self.twiddle7.im * x1219n.im + -self.twiddle5.im * x1318n.im + -self.twiddle3.im * x1417n.im + -self.twiddle1.im * x1516n.im; let b328re_a = buffer.load(0).re + self.twiddle3.re * x130p.re + self.twiddle6.re * x229p.re + self.twiddle9.re * x328p.re + self.twiddle12.re * x427p.re + self.twiddle15.re * x526p.re + self.twiddle13.re * x625p.re + self.twiddle10.re * x724p.re + self.twiddle7.re * x823p.re + self.twiddle4.re * x922p.re + self.twiddle1.re * x1021p.re + self.twiddle2.re * x1120p.re + self.twiddle5.re * x1219p.re + self.twiddle8.re * x1318p.re + self.twiddle11.re * x1417p.re + self.twiddle14.re * x1516p.re; let b328re_b = self.twiddle3.im * x130n.im + self.twiddle6.im * x229n.im + self.twiddle9.im * x328n.im + self.twiddle12.im * x427n.im + self.twiddle15.im * x526n.im + -self.twiddle13.im * x625n.im + -self.twiddle10.im * x724n.im + -self.twiddle7.im * x823n.im + -self.twiddle4.im * x922n.im + -self.twiddle1.im * x1021n.im + self.twiddle2.im * x1120n.im + self.twiddle5.im * x1219n.im + self.twiddle8.im * x1318n.im + self.twiddle11.im * x1417n.im + self.twiddle14.im * x1516n.im; let b427re_a = buffer.load(0).re + self.twiddle4.re * x130p.re + self.twiddle8.re * x229p.re + self.twiddle12.re * x328p.re + self.twiddle15.re * x427p.re + self.twiddle11.re * x526p.re + self.twiddle7.re * x625p.re + self.twiddle3.re * x724p.re + self.twiddle1.re * x823p.re + self.twiddle5.re * x922p.re + self.twiddle9.re * x1021p.re + self.twiddle13.re * x1120p.re + self.twiddle14.re * x1219p.re + self.twiddle10.re * x1318p.re + self.twiddle6.re * x1417p.re + self.twiddle2.re * x1516p.re; let b427re_b = self.twiddle4.im * x130n.im + self.twiddle8.im * x229n.im + self.twiddle12.im * x328n.im + -self.twiddle15.im * x427n.im + -self.twiddle11.im * x526n.im + -self.twiddle7.im * x625n.im + -self.twiddle3.im * x724n.im + self.twiddle1.im * x823n.im + self.twiddle5.im * x922n.im + self.twiddle9.im * x1021n.im + self.twiddle13.im * x1120n.im + -self.twiddle14.im * x1219n.im + -self.twiddle10.im * x1318n.im + -self.twiddle6.im * x1417n.im + -self.twiddle2.im * x1516n.im; let b526re_a = buffer.load(0).re + self.twiddle5.re * x130p.re + self.twiddle10.re * x229p.re + self.twiddle15.re * x328p.re + self.twiddle11.re * x427p.re + self.twiddle6.re * x526p.re + self.twiddle1.re * x625p.re + self.twiddle4.re * x724p.re + self.twiddle9.re * x823p.re + self.twiddle14.re * x922p.re + self.twiddle12.re * x1021p.re + self.twiddle7.re * x1120p.re + self.twiddle2.re * x1219p.re + self.twiddle3.re * x1318p.re + self.twiddle8.re * x1417p.re + self.twiddle13.re * x1516p.re; let b526re_b = self.twiddle5.im * x130n.im + self.twiddle10.im * x229n.im + self.twiddle15.im * x328n.im + -self.twiddle11.im * x427n.im + -self.twiddle6.im * x526n.im + -self.twiddle1.im * x625n.im + self.twiddle4.im * x724n.im + self.twiddle9.im * x823n.im + self.twiddle14.im * x922n.im + -self.twiddle12.im * x1021n.im + -self.twiddle7.im * x1120n.im + -self.twiddle2.im * x1219n.im + self.twiddle3.im * x1318n.im + self.twiddle8.im * x1417n.im + self.twiddle13.im * x1516n.im; let b625re_a = buffer.load(0).re + self.twiddle6.re * x130p.re + self.twiddle12.re * x229p.re + self.twiddle13.re * x328p.re + self.twiddle7.re * x427p.re + self.twiddle1.re * x526p.re + self.twiddle5.re * x625p.re + self.twiddle11.re * x724p.re + self.twiddle14.re * x823p.re + self.twiddle8.re * x922p.re + self.twiddle2.re * x1021p.re + self.twiddle4.re * x1120p.re + self.twiddle10.re * x1219p.re + self.twiddle15.re * x1318p.re + self.twiddle9.re * x1417p.re + self.twiddle3.re * x1516p.re; let b625re_b = self.twiddle6.im * x130n.im + self.twiddle12.im * x229n.im + -self.twiddle13.im * x328n.im + -self.twiddle7.im * x427n.im + -self.twiddle1.im * x526n.im + self.twiddle5.im * x625n.im + self.twiddle11.im * x724n.im + -self.twiddle14.im * x823n.im + -self.twiddle8.im * x922n.im + -self.twiddle2.im * x1021n.im + self.twiddle4.im * x1120n.im + self.twiddle10.im * x1219n.im + -self.twiddle15.im * x1318n.im + -self.twiddle9.im * x1417n.im + -self.twiddle3.im * x1516n.im; let b724re_a = buffer.load(0).re + self.twiddle7.re * x130p.re + self.twiddle14.re * x229p.re + self.twiddle10.re * x328p.re + self.twiddle3.re * x427p.re + self.twiddle4.re * x526p.re + self.twiddle11.re * x625p.re + self.twiddle13.re * x724p.re + self.twiddle6.re * x823p.re + self.twiddle1.re * x922p.re + self.twiddle8.re * x1021p.re + self.twiddle15.re * x1120p.re + self.twiddle9.re * x1219p.re + self.twiddle2.re * x1318p.re + self.twiddle5.re * x1417p.re + self.twiddle12.re * x1516p.re; let b724re_b = self.twiddle7.im * x130n.im + self.twiddle14.im * x229n.im + -self.twiddle10.im * x328n.im + -self.twiddle3.im * x427n.im + self.twiddle4.im * x526n.im + self.twiddle11.im * x625n.im + -self.twiddle13.im * x724n.im + -self.twiddle6.im * x823n.im + self.twiddle1.im * x922n.im + self.twiddle8.im * x1021n.im + self.twiddle15.im * x1120n.im + -self.twiddle9.im * x1219n.im + -self.twiddle2.im * x1318n.im + self.twiddle5.im * x1417n.im + self.twiddle12.im * x1516n.im; let b823re_a = buffer.load(0).re + self.twiddle8.re * x130p.re + self.twiddle15.re * x229p.re + self.twiddle7.re * x328p.re + self.twiddle1.re * x427p.re + self.twiddle9.re * x526p.re + self.twiddle14.re * x625p.re + self.twiddle6.re * x724p.re + self.twiddle2.re * x823p.re + self.twiddle10.re * x922p.re + self.twiddle13.re * x1021p.re + self.twiddle5.re * x1120p.re + self.twiddle3.re * x1219p.re + self.twiddle11.re * x1318p.re + self.twiddle12.re * x1417p.re + self.twiddle4.re * x1516p.re; let b823re_b = self.twiddle8.im * x130n.im + -self.twiddle15.im * x229n.im + -self.twiddle7.im * x328n.im + self.twiddle1.im * x427n.im + self.twiddle9.im * x526n.im + -self.twiddle14.im * x625n.im + -self.twiddle6.im * x724n.im + self.twiddle2.im * x823n.im + self.twiddle10.im * x922n.im + -self.twiddle13.im * x1021n.im + -self.twiddle5.im * x1120n.im + self.twiddle3.im * x1219n.im + self.twiddle11.im * x1318n.im + -self.twiddle12.im * x1417n.im + -self.twiddle4.im * x1516n.im; let b922re_a = buffer.load(0).re + self.twiddle9.re * x130p.re + self.twiddle13.re * x229p.re + self.twiddle4.re * x328p.re + self.twiddle5.re * x427p.re + self.twiddle14.re * x526p.re + self.twiddle8.re * x625p.re + self.twiddle1.re * x724p.re + self.twiddle10.re * x823p.re + self.twiddle12.re * x922p.re + self.twiddle3.re * x1021p.re + self.twiddle6.re * x1120p.re + self.twiddle15.re * x1219p.re + self.twiddle7.re * x1318p.re + self.twiddle2.re * x1417p.re + self.twiddle11.re * x1516p.re; let b922re_b = self.twiddle9.im * x130n.im + -self.twiddle13.im * x229n.im + -self.twiddle4.im * x328n.im + self.twiddle5.im * x427n.im + self.twiddle14.im * x526n.im + -self.twiddle8.im * x625n.im + self.twiddle1.im * x724n.im + self.twiddle10.im * x823n.im + -self.twiddle12.im * x922n.im + -self.twiddle3.im * x1021n.im + self.twiddle6.im * x1120n.im + self.twiddle15.im * x1219n.im + -self.twiddle7.im * x1318n.im + self.twiddle2.im * x1417n.im + self.twiddle11.im * x1516n.im; let b1021re_a = buffer.load(0).re + self.twiddle10.re * x130p.re + self.twiddle11.re * x229p.re + self.twiddle1.re * x328p.re + self.twiddle9.re * x427p.re + self.twiddle12.re * x526p.re + self.twiddle2.re * x625p.re + self.twiddle8.re * x724p.re + self.twiddle13.re * x823p.re + self.twiddle3.re * x922p.re + self.twiddle7.re * x1021p.re + self.twiddle14.re * x1120p.re + self.twiddle4.re * x1219p.re + self.twiddle6.re * x1318p.re + self.twiddle15.re * x1417p.re + self.twiddle5.re * x1516p.re; let b1021re_b = self.twiddle10.im * x130n.im + -self.twiddle11.im * x229n.im + -self.twiddle1.im * x328n.im + self.twiddle9.im * x427n.im + -self.twiddle12.im * x526n.im + -self.twiddle2.im * x625n.im + self.twiddle8.im * x724n.im + -self.twiddle13.im * x823n.im + -self.twiddle3.im * x922n.im + self.twiddle7.im * x1021n.im + -self.twiddle14.im * x1120n.im + -self.twiddle4.im * x1219n.im + self.twiddle6.im * x1318n.im + -self.twiddle15.im * x1417n.im + -self.twiddle5.im * x1516n.im; let b1120re_a = buffer.load(0).re + self.twiddle11.re * x130p.re + self.twiddle9.re * x229p.re + self.twiddle2.re * x328p.re + self.twiddle13.re * x427p.re + self.twiddle7.re * x526p.re + self.twiddle4.re * x625p.re + self.twiddle15.re * x724p.re + self.twiddle5.re * x823p.re + self.twiddle6.re * x922p.re + self.twiddle14.re * x1021p.re + self.twiddle3.re * x1120p.re + self.twiddle8.re * x1219p.re + self.twiddle12.re * x1318p.re + self.twiddle1.re * x1417p.re + self.twiddle10.re * x1516p.re; let b1120re_b = self.twiddle11.im * x130n.im + -self.twiddle9.im * x229n.im + self.twiddle2.im * x328n.im + self.twiddle13.im * x427n.im + -self.twiddle7.im * x526n.im + self.twiddle4.im * x625n.im + self.twiddle15.im * x724n.im + -self.twiddle5.im * x823n.im + self.twiddle6.im * x922n.im + -self.twiddle14.im * x1021n.im + -self.twiddle3.im * x1120n.im + self.twiddle8.im * x1219n.im + -self.twiddle12.im * x1318n.im + -self.twiddle1.im * x1417n.im + self.twiddle10.im * x1516n.im; let b1219re_a = buffer.load(0).re + self.twiddle12.re * x130p.re + self.twiddle7.re * x229p.re + self.twiddle5.re * x328p.re + self.twiddle14.re * x427p.re + self.twiddle2.re * x526p.re + self.twiddle10.re * x625p.re + self.twiddle9.re * x724p.re + self.twiddle3.re * x823p.re + self.twiddle15.re * x922p.re + self.twiddle4.re * x1021p.re + self.twiddle8.re * x1120p.re + self.twiddle11.re * x1219p.re + self.twiddle1.re * x1318p.re + self.twiddle13.re * x1417p.re + self.twiddle6.re * x1516p.re; let b1219re_b = self.twiddle12.im * x130n.im + -self.twiddle7.im * x229n.im + self.twiddle5.im * x328n.im + -self.twiddle14.im * x427n.im + -self.twiddle2.im * x526n.im + self.twiddle10.im * x625n.im + -self.twiddle9.im * x724n.im + self.twiddle3.im * x823n.im + self.twiddle15.im * x922n.im + -self.twiddle4.im * x1021n.im + self.twiddle8.im * x1120n.im + -self.twiddle11.im * x1219n.im + self.twiddle1.im * x1318n.im + self.twiddle13.im * x1417n.im + -self.twiddle6.im * x1516n.im; let b1318re_a = buffer.load(0).re + self.twiddle13.re * x130p.re + self.twiddle5.re * x229p.re + self.twiddle8.re * x328p.re + self.twiddle10.re * x427p.re + self.twiddle3.re * x526p.re + self.twiddle15.re * x625p.re + self.twiddle2.re * x724p.re + self.twiddle11.re * x823p.re + self.twiddle7.re * x922p.re + self.twiddle6.re * x1021p.re + self.twiddle12.re * x1120p.re + self.twiddle1.re * x1219p.re + self.twiddle14.re * x1318p.re + self.twiddle4.re * x1417p.re + self.twiddle9.re * x1516p.re; let b1318re_b = self.twiddle13.im * x130n.im + -self.twiddle5.im * x229n.im + self.twiddle8.im * x328n.im + -self.twiddle10.im * x427n.im + self.twiddle3.im * x526n.im + -self.twiddle15.im * x625n.im + -self.twiddle2.im * x724n.im + self.twiddle11.im * x823n.im + -self.twiddle7.im * x922n.im + self.twiddle6.im * x1021n.im + -self.twiddle12.im * x1120n.im + self.twiddle1.im * x1219n.im + self.twiddle14.im * x1318n.im + -self.twiddle4.im * x1417n.im + self.twiddle9.im * x1516n.im; let b1417re_a = buffer.load(0).re + self.twiddle14.re * x130p.re + self.twiddle3.re * x229p.re + self.twiddle11.re * x328p.re + self.twiddle6.re * x427p.re + self.twiddle8.re * x526p.re + self.twiddle9.re * x625p.re + self.twiddle5.re * x724p.re + self.twiddle12.re * x823p.re + self.twiddle2.re * x922p.re + self.twiddle15.re * x1021p.re + self.twiddle1.re * x1120p.re + self.twiddle13.re * x1219p.re + self.twiddle4.re * x1318p.re + self.twiddle10.re * x1417p.re + self.twiddle7.re * x1516p.re; let b1417re_b = self.twiddle14.im * x130n.im + -self.twiddle3.im * x229n.im + self.twiddle11.im * x328n.im + -self.twiddle6.im * x427n.im + self.twiddle8.im * x526n.im + -self.twiddle9.im * x625n.im + self.twiddle5.im * x724n.im + -self.twiddle12.im * x823n.im + self.twiddle2.im * x922n.im + -self.twiddle15.im * x1021n.im + -self.twiddle1.im * x1120n.im + self.twiddle13.im * x1219n.im + -self.twiddle4.im * x1318n.im + self.twiddle10.im * x1417n.im + -self.twiddle7.im * x1516n.im; let b1516re_a = buffer.load(0).re + self.twiddle15.re * x130p.re + self.twiddle1.re * x229p.re + self.twiddle14.re * x328p.re + self.twiddle2.re * x427p.re + self.twiddle13.re * x526p.re + self.twiddle3.re * x625p.re + self.twiddle12.re * x724p.re + self.twiddle4.re * x823p.re + self.twiddle11.re * x922p.re + self.twiddle5.re * x1021p.re + self.twiddle10.re * x1120p.re + self.twiddle6.re * x1219p.re + self.twiddle9.re * x1318p.re + self.twiddle7.re * x1417p.re + self.twiddle8.re * x1516p.re; let b1516re_b = self.twiddle15.im * x130n.im + -self.twiddle1.im * x229n.im + self.twiddle14.im * x328n.im + -self.twiddle2.im * x427n.im + self.twiddle13.im * x526n.im + -self.twiddle3.im * x625n.im + self.twiddle12.im * x724n.im + -self.twiddle4.im * x823n.im + self.twiddle11.im * x922n.im + -self.twiddle5.im * x1021n.im + self.twiddle10.im * x1120n.im + -self.twiddle6.im * x1219n.im + self.twiddle9.im * x1318n.im + -self.twiddle7.im * x1417n.im + self.twiddle8.im * x1516n.im; let b130im_a = buffer.load(0).im + self.twiddle1.re * x130p.im + self.twiddle2.re * x229p.im + self.twiddle3.re * x328p.im + self.twiddle4.re * x427p.im + self.twiddle5.re * x526p.im + self.twiddle6.re * x625p.im + self.twiddle7.re * x724p.im + self.twiddle8.re * x823p.im + self.twiddle9.re * x922p.im + self.twiddle10.re * x1021p.im + self.twiddle11.re * x1120p.im + self.twiddle12.re * x1219p.im + self.twiddle13.re * x1318p.im + self.twiddle14.re * x1417p.im + self.twiddle15.re * x1516p.im; let b130im_b = self.twiddle1.im * x130n.re + self.twiddle2.im * x229n.re + self.twiddle3.im * x328n.re + self.twiddle4.im * x427n.re + self.twiddle5.im * x526n.re + self.twiddle6.im * x625n.re + self.twiddle7.im * x724n.re + self.twiddle8.im * x823n.re + self.twiddle9.im * x922n.re + self.twiddle10.im * x1021n.re + self.twiddle11.im * x1120n.re + self.twiddle12.im * x1219n.re + self.twiddle13.im * x1318n.re + self.twiddle14.im * x1417n.re + self.twiddle15.im * x1516n.re; let b229im_a = buffer.load(0).im + self.twiddle2.re * x130p.im + self.twiddle4.re * x229p.im + self.twiddle6.re * x328p.im + self.twiddle8.re * x427p.im + self.twiddle10.re * x526p.im + self.twiddle12.re * x625p.im + self.twiddle14.re * x724p.im + self.twiddle15.re * x823p.im + self.twiddle13.re * x922p.im + self.twiddle11.re * x1021p.im + self.twiddle9.re * x1120p.im + self.twiddle7.re * x1219p.im + self.twiddle5.re * x1318p.im + self.twiddle3.re * x1417p.im + self.twiddle1.re * x1516p.im; let b229im_b = self.twiddle2.im * x130n.re + self.twiddle4.im * x229n.re + self.twiddle6.im * x328n.re + self.twiddle8.im * x427n.re + self.twiddle10.im * x526n.re + self.twiddle12.im * x625n.re + self.twiddle14.im * x724n.re + -self.twiddle15.im * x823n.re + -self.twiddle13.im * x922n.re + -self.twiddle11.im * x1021n.re + -self.twiddle9.im * x1120n.re + -self.twiddle7.im * x1219n.re + -self.twiddle5.im * x1318n.re + -self.twiddle3.im * x1417n.re + -self.twiddle1.im * x1516n.re; let b328im_a = buffer.load(0).im + self.twiddle3.re * x130p.im + self.twiddle6.re * x229p.im + self.twiddle9.re * x328p.im + self.twiddle12.re * x427p.im + self.twiddle15.re * x526p.im + self.twiddle13.re * x625p.im + self.twiddle10.re * x724p.im + self.twiddle7.re * x823p.im + self.twiddle4.re * x922p.im + self.twiddle1.re * x1021p.im + self.twiddle2.re * x1120p.im + self.twiddle5.re * x1219p.im + self.twiddle8.re * x1318p.im + self.twiddle11.re * x1417p.im + self.twiddle14.re * x1516p.im; let b328im_b = self.twiddle3.im * x130n.re + self.twiddle6.im * x229n.re + self.twiddle9.im * x328n.re + self.twiddle12.im * x427n.re + self.twiddle15.im * x526n.re + -self.twiddle13.im * x625n.re + -self.twiddle10.im * x724n.re + -self.twiddle7.im * x823n.re + -self.twiddle4.im * x922n.re + -self.twiddle1.im * x1021n.re + self.twiddle2.im * x1120n.re + self.twiddle5.im * x1219n.re + self.twiddle8.im * x1318n.re + self.twiddle11.im * x1417n.re + self.twiddle14.im * x1516n.re; let b427im_a = buffer.load(0).im + self.twiddle4.re * x130p.im + self.twiddle8.re * x229p.im + self.twiddle12.re * x328p.im + self.twiddle15.re * x427p.im + self.twiddle11.re * x526p.im + self.twiddle7.re * x625p.im + self.twiddle3.re * x724p.im + self.twiddle1.re * x823p.im + self.twiddle5.re * x922p.im + self.twiddle9.re * x1021p.im + self.twiddle13.re * x1120p.im + self.twiddle14.re * x1219p.im + self.twiddle10.re * x1318p.im + self.twiddle6.re * x1417p.im + self.twiddle2.re * x1516p.im; let b427im_b = self.twiddle4.im * x130n.re + self.twiddle8.im * x229n.re + self.twiddle12.im * x328n.re + -self.twiddle15.im * x427n.re + -self.twiddle11.im * x526n.re + -self.twiddle7.im * x625n.re + -self.twiddle3.im * x724n.re + self.twiddle1.im * x823n.re + self.twiddle5.im * x922n.re + self.twiddle9.im * x1021n.re + self.twiddle13.im * x1120n.re + -self.twiddle14.im * x1219n.re + -self.twiddle10.im * x1318n.re + -self.twiddle6.im * x1417n.re + -self.twiddle2.im * x1516n.re; let b526im_a = buffer.load(0).im + self.twiddle5.re * x130p.im + self.twiddle10.re * x229p.im + self.twiddle15.re * x328p.im + self.twiddle11.re * x427p.im + self.twiddle6.re * x526p.im + self.twiddle1.re * x625p.im + self.twiddle4.re * x724p.im + self.twiddle9.re * x823p.im + self.twiddle14.re * x922p.im + self.twiddle12.re * x1021p.im + self.twiddle7.re * x1120p.im + self.twiddle2.re * x1219p.im + self.twiddle3.re * x1318p.im + self.twiddle8.re * x1417p.im + self.twiddle13.re * x1516p.im; let b526im_b = self.twiddle5.im * x130n.re + self.twiddle10.im * x229n.re + self.twiddle15.im * x328n.re + -self.twiddle11.im * x427n.re + -self.twiddle6.im * x526n.re + -self.twiddle1.im * x625n.re + self.twiddle4.im * x724n.re + self.twiddle9.im * x823n.re + self.twiddle14.im * x922n.re + -self.twiddle12.im * x1021n.re + -self.twiddle7.im * x1120n.re + -self.twiddle2.im * x1219n.re + self.twiddle3.im * x1318n.re + self.twiddle8.im * x1417n.re + self.twiddle13.im * x1516n.re; let b625im_a = buffer.load(0).im + self.twiddle6.re * x130p.im + self.twiddle12.re * x229p.im + self.twiddle13.re * x328p.im + self.twiddle7.re * x427p.im + self.twiddle1.re * x526p.im + self.twiddle5.re * x625p.im + self.twiddle11.re * x724p.im + self.twiddle14.re * x823p.im + self.twiddle8.re * x922p.im + self.twiddle2.re * x1021p.im + self.twiddle4.re * x1120p.im + self.twiddle10.re * x1219p.im + self.twiddle15.re * x1318p.im + self.twiddle9.re * x1417p.im + self.twiddle3.re * x1516p.im; let b625im_b = self.twiddle6.im * x130n.re + self.twiddle12.im * x229n.re + -self.twiddle13.im * x328n.re + -self.twiddle7.im * x427n.re + -self.twiddle1.im * x526n.re + self.twiddle5.im * x625n.re + self.twiddle11.im * x724n.re + -self.twiddle14.im * x823n.re + -self.twiddle8.im * x922n.re + -self.twiddle2.im * x1021n.re + self.twiddle4.im * x1120n.re + self.twiddle10.im * x1219n.re + -self.twiddle15.im * x1318n.re + -self.twiddle9.im * x1417n.re + -self.twiddle3.im * x1516n.re; let b724im_a = buffer.load(0).im + self.twiddle7.re * x130p.im + self.twiddle14.re * x229p.im + self.twiddle10.re * x328p.im + self.twiddle3.re * x427p.im + self.twiddle4.re * x526p.im + self.twiddle11.re * x625p.im + self.twiddle13.re * x724p.im + self.twiddle6.re * x823p.im + self.twiddle1.re * x922p.im + self.twiddle8.re * x1021p.im + self.twiddle15.re * x1120p.im + self.twiddle9.re * x1219p.im + self.twiddle2.re * x1318p.im + self.twiddle5.re * x1417p.im + self.twiddle12.re * x1516p.im; let b724im_b = self.twiddle7.im * x130n.re + self.twiddle14.im * x229n.re + -self.twiddle10.im * x328n.re + -self.twiddle3.im * x427n.re + self.twiddle4.im * x526n.re + self.twiddle11.im * x625n.re + -self.twiddle13.im * x724n.re + -self.twiddle6.im * x823n.re + self.twiddle1.im * x922n.re + self.twiddle8.im * x1021n.re + self.twiddle15.im * x1120n.re + -self.twiddle9.im * x1219n.re + -self.twiddle2.im * x1318n.re + self.twiddle5.im * x1417n.re + self.twiddle12.im * x1516n.re; let b823im_a = buffer.load(0).im + self.twiddle8.re * x130p.im + self.twiddle15.re * x229p.im + self.twiddle7.re * x328p.im + self.twiddle1.re * x427p.im + self.twiddle9.re * x526p.im + self.twiddle14.re * x625p.im + self.twiddle6.re * x724p.im + self.twiddle2.re * x823p.im + self.twiddle10.re * x922p.im + self.twiddle13.re * x1021p.im + self.twiddle5.re * x1120p.im + self.twiddle3.re * x1219p.im + self.twiddle11.re * x1318p.im + self.twiddle12.re * x1417p.im + self.twiddle4.re * x1516p.im; let b823im_b = self.twiddle8.im * x130n.re + -self.twiddle15.im * x229n.re + -self.twiddle7.im * x328n.re + self.twiddle1.im * x427n.re + self.twiddle9.im * x526n.re + -self.twiddle14.im * x625n.re + -self.twiddle6.im * x724n.re + self.twiddle2.im * x823n.re + self.twiddle10.im * x922n.re + -self.twiddle13.im * x1021n.re + -self.twiddle5.im * x1120n.re + self.twiddle3.im * x1219n.re + self.twiddle11.im * x1318n.re + -self.twiddle12.im * x1417n.re + -self.twiddle4.im * x1516n.re; let b922im_a = buffer.load(0).im + self.twiddle9.re * x130p.im + self.twiddle13.re * x229p.im + self.twiddle4.re * x328p.im + self.twiddle5.re * x427p.im + self.twiddle14.re * x526p.im + self.twiddle8.re * x625p.im + self.twiddle1.re * x724p.im + self.twiddle10.re * x823p.im + self.twiddle12.re * x922p.im + self.twiddle3.re * x1021p.im + self.twiddle6.re * x1120p.im + self.twiddle15.re * x1219p.im + self.twiddle7.re * x1318p.im + self.twiddle2.re * x1417p.im + self.twiddle11.re * x1516p.im; let b922im_b = self.twiddle9.im * x130n.re + -self.twiddle13.im * x229n.re + -self.twiddle4.im * x328n.re + self.twiddle5.im * x427n.re + self.twiddle14.im * x526n.re + -self.twiddle8.im * x625n.re + self.twiddle1.im * x724n.re + self.twiddle10.im * x823n.re + -self.twiddle12.im * x922n.re + -self.twiddle3.im * x1021n.re + self.twiddle6.im * x1120n.re + self.twiddle15.im * x1219n.re + -self.twiddle7.im * x1318n.re + self.twiddle2.im * x1417n.re + self.twiddle11.im * x1516n.re; let b1021im_a = buffer.load(0).im + self.twiddle10.re * x130p.im + self.twiddle11.re * x229p.im + self.twiddle1.re * x328p.im + self.twiddle9.re * x427p.im + self.twiddle12.re * x526p.im + self.twiddle2.re * x625p.im + self.twiddle8.re * x724p.im + self.twiddle13.re * x823p.im + self.twiddle3.re * x922p.im + self.twiddle7.re * x1021p.im + self.twiddle14.re * x1120p.im + self.twiddle4.re * x1219p.im + self.twiddle6.re * x1318p.im + self.twiddle15.re * x1417p.im + self.twiddle5.re * x1516p.im; let b1021im_b = self.twiddle10.im * x130n.re + -self.twiddle11.im * x229n.re + -self.twiddle1.im * x328n.re + self.twiddle9.im * x427n.re + -self.twiddle12.im * x526n.re + -self.twiddle2.im * x625n.re + self.twiddle8.im * x724n.re + -self.twiddle13.im * x823n.re + -self.twiddle3.im * x922n.re + self.twiddle7.im * x1021n.re + -self.twiddle14.im * x1120n.re + -self.twiddle4.im * x1219n.re + self.twiddle6.im * x1318n.re + -self.twiddle15.im * x1417n.re + -self.twiddle5.im * x1516n.re; let b1120im_a = buffer.load(0).im + self.twiddle11.re * x130p.im + self.twiddle9.re * x229p.im + self.twiddle2.re * x328p.im + self.twiddle13.re * x427p.im + self.twiddle7.re * x526p.im + self.twiddle4.re * x625p.im + self.twiddle15.re * x724p.im + self.twiddle5.re * x823p.im + self.twiddle6.re * x922p.im + self.twiddle14.re * x1021p.im + self.twiddle3.re * x1120p.im + self.twiddle8.re * x1219p.im + self.twiddle12.re * x1318p.im + self.twiddle1.re * x1417p.im + self.twiddle10.re * x1516p.im; let b1120im_b = self.twiddle11.im * x130n.re + -self.twiddle9.im * x229n.re + self.twiddle2.im * x328n.re + self.twiddle13.im * x427n.re + -self.twiddle7.im * x526n.re + self.twiddle4.im * x625n.re + self.twiddle15.im * x724n.re + -self.twiddle5.im * x823n.re + self.twiddle6.im * x922n.re + -self.twiddle14.im * x1021n.re + -self.twiddle3.im * x1120n.re + self.twiddle8.im * x1219n.re + -self.twiddle12.im * x1318n.re + -self.twiddle1.im * x1417n.re + self.twiddle10.im * x1516n.re; let b1219im_a = buffer.load(0).im + self.twiddle12.re * x130p.im + self.twiddle7.re * x229p.im + self.twiddle5.re * x328p.im + self.twiddle14.re * x427p.im + self.twiddle2.re * x526p.im + self.twiddle10.re * x625p.im + self.twiddle9.re * x724p.im + self.twiddle3.re * x823p.im + self.twiddle15.re * x922p.im + self.twiddle4.re * x1021p.im + self.twiddle8.re * x1120p.im + self.twiddle11.re * x1219p.im + self.twiddle1.re * x1318p.im + self.twiddle13.re * x1417p.im + self.twiddle6.re * x1516p.im; let b1219im_b = self.twiddle12.im * x130n.re + -self.twiddle7.im * x229n.re + self.twiddle5.im * x328n.re + -self.twiddle14.im * x427n.re + -self.twiddle2.im * x526n.re + self.twiddle10.im * x625n.re + -self.twiddle9.im * x724n.re + self.twiddle3.im * x823n.re + self.twiddle15.im * x922n.re + -self.twiddle4.im * x1021n.re + self.twiddle8.im * x1120n.re + -self.twiddle11.im * x1219n.re + self.twiddle1.im * x1318n.re + self.twiddle13.im * x1417n.re + -self.twiddle6.im * x1516n.re; let b1318im_a = buffer.load(0).im + self.twiddle13.re * x130p.im + self.twiddle5.re * x229p.im + self.twiddle8.re * x328p.im + self.twiddle10.re * x427p.im + self.twiddle3.re * x526p.im + self.twiddle15.re * x625p.im + self.twiddle2.re * x724p.im + self.twiddle11.re * x823p.im + self.twiddle7.re * x922p.im + self.twiddle6.re * x1021p.im + self.twiddle12.re * x1120p.im + self.twiddle1.re * x1219p.im + self.twiddle14.re * x1318p.im + self.twiddle4.re * x1417p.im + self.twiddle9.re * x1516p.im; let b1318im_b = self.twiddle13.im * x130n.re + -self.twiddle5.im * x229n.re + self.twiddle8.im * x328n.re + -self.twiddle10.im * x427n.re + self.twiddle3.im * x526n.re + -self.twiddle15.im * x625n.re + -self.twiddle2.im * x724n.re + self.twiddle11.im * x823n.re + -self.twiddle7.im * x922n.re + self.twiddle6.im * x1021n.re + -self.twiddle12.im * x1120n.re + self.twiddle1.im * x1219n.re + self.twiddle14.im * x1318n.re + -self.twiddle4.im * x1417n.re + self.twiddle9.im * x1516n.re; let b1417im_a = buffer.load(0).im + self.twiddle14.re * x130p.im + self.twiddle3.re * x229p.im + self.twiddle11.re * x328p.im + self.twiddle6.re * x427p.im + self.twiddle8.re * x526p.im + self.twiddle9.re * x625p.im + self.twiddle5.re * x724p.im + self.twiddle12.re * x823p.im + self.twiddle2.re * x922p.im + self.twiddle15.re * x1021p.im + self.twiddle1.re * x1120p.im + self.twiddle13.re * x1219p.im + self.twiddle4.re * x1318p.im + self.twiddle10.re * x1417p.im + self.twiddle7.re * x1516p.im; let b1417im_b = self.twiddle14.im * x130n.re + -self.twiddle3.im * x229n.re + self.twiddle11.im * x328n.re + -self.twiddle6.im * x427n.re + self.twiddle8.im * x526n.re + -self.twiddle9.im * x625n.re + self.twiddle5.im * x724n.re + -self.twiddle12.im * x823n.re + self.twiddle2.im * x922n.re + -self.twiddle15.im * x1021n.re + -self.twiddle1.im * x1120n.re + self.twiddle13.im * x1219n.re + -self.twiddle4.im * x1318n.re + self.twiddle10.im * x1417n.re + -self.twiddle7.im * x1516n.re; let b1516im_a = buffer.load(0).im + self.twiddle15.re * x130p.im + self.twiddle1.re * x229p.im + self.twiddle14.re * x328p.im + self.twiddle2.re * x427p.im + self.twiddle13.re * x526p.im + self.twiddle3.re * x625p.im + self.twiddle12.re * x724p.im + self.twiddle4.re * x823p.im + self.twiddle11.re * x922p.im + self.twiddle5.re * x1021p.im + self.twiddle10.re * x1120p.im + self.twiddle6.re * x1219p.im + self.twiddle9.re * x1318p.im + self.twiddle7.re * x1417p.im + self.twiddle8.re * x1516p.im; let b1516im_b = self.twiddle15.im * x130n.re + -self.twiddle1.im * x229n.re + self.twiddle14.im * x328n.re + -self.twiddle2.im * x427n.re + self.twiddle13.im * x526n.re + -self.twiddle3.im * x625n.re + self.twiddle12.im * x724n.re + -self.twiddle4.im * x823n.re + self.twiddle11.im * x922n.re + -self.twiddle5.im * x1021n.re + self.twiddle10.im * x1120n.re + -self.twiddle6.im * x1219n.re + self.twiddle9.im * x1318n.re + -self.twiddle7.im * x1417n.re + self.twiddle8.im * x1516n.re; let out1re = b130re_a - b130re_b; let out1im = b130im_a + b130im_b; let out2re = b229re_a - b229re_b; let out2im = b229im_a + b229im_b; let out3re = b328re_a - b328re_b; let out3im = b328im_a + b328im_b; let out4re = b427re_a - b427re_b; let out4im = b427im_a + b427im_b; let out5re = b526re_a - b526re_b; let out5im = b526im_a + b526im_b; let out6re = b625re_a - b625re_b; let out6im = b625im_a + b625im_b; let out7re = b724re_a - b724re_b; let out7im = b724im_a + b724im_b; let out8re = b823re_a - b823re_b; let out8im = b823im_a + b823im_b; let out9re = b922re_a - b922re_b; let out9im = b922im_a + b922im_b; let out10re = b1021re_a - b1021re_b; let out10im = b1021im_a + b1021im_b; let out11re = b1120re_a - b1120re_b; let out11im = b1120im_a + b1120im_b; let out12re = b1219re_a - b1219re_b; let out12im = b1219im_a + b1219im_b; let out13re = b1318re_a - b1318re_b; let out13im = b1318im_a + b1318im_b; let out14re = b1417re_a - b1417re_b; let out14im = b1417im_a + b1417im_b; let out15re = b1516re_a - b1516re_b; let out15im = b1516im_a + b1516im_b; let out16re = b1516re_a + b1516re_b; let out16im = b1516im_a - b1516im_b; let out17re = b1417re_a + b1417re_b; let out17im = b1417im_a - b1417im_b; let out18re = b1318re_a + b1318re_b; let out18im = b1318im_a - b1318im_b; let out19re = b1219re_a + b1219re_b; let out19im = b1219im_a - b1219im_b; let out20re = b1120re_a + b1120re_b; let out20im = b1120im_a - b1120im_b; let out21re = b1021re_a + b1021re_b; let out21im = b1021im_a - b1021im_b; let out22re = b922re_a + b922re_b; let out22im = b922im_a - b922im_b; let out23re = b823re_a + b823re_b; let out23im = b823im_a - b823im_b; let out24re = b724re_a + b724re_b; let out24im = b724im_a - b724im_b; let out25re = b625re_a + b625re_b; let out25im = b625im_a - b625im_b; let out26re = b526re_a + b526re_b; let out26im = b526im_a - b526im_b; let out27re = b427re_a + b427re_b; let out27im = b427im_a - b427im_b; let out28re = b328re_a + b328re_b; let out28im = b328im_a - b328im_b; let out29re = b229re_a + b229re_b; let out29im = b229im_a - b229im_b; let out30re = b130re_a + b130re_b; let out30im = b130im_a - b130im_b; buffer.store(sum, 0); buffer.store( Complex { re: out1re, im: out1im, }, 1, ); buffer.store( Complex { re: out2re, im: out2im, }, 2, ); buffer.store( Complex { re: out3re, im: out3im, }, 3, ); buffer.store( Complex { re: out4re, im: out4im, }, 4, ); buffer.store( Complex { re: out5re, im: out5im, }, 5, ); buffer.store( Complex { re: out6re, im: out6im, }, 6, ); buffer.store( Complex { re: out7re, im: out7im, }, 7, ); buffer.store( Complex { re: out8re, im: out8im, }, 8, ); buffer.store( Complex { re: out9re, im: out9im, }, 9, ); buffer.store( Complex { re: out10re, im: out10im, }, 10, ); buffer.store( Complex { re: out11re, im: out11im, }, 11, ); buffer.store( Complex { re: out12re, im: out12im, }, 12, ); buffer.store( Complex { re: out13re, im: out13im, }, 13, ); buffer.store( Complex { re: out14re, im: out14im, }, 14, ); buffer.store( Complex { re: out15re, im: out15im, }, 15, ); buffer.store( Complex { re: out16re, im: out16im, }, 16, ); buffer.store( Complex { re: out17re, im: out17im, }, 17, ); buffer.store( Complex { re: out18re, im: out18im, }, 18, ); buffer.store( Complex { re: out19re, im: out19im, }, 19, ); buffer.store( Complex { re: out20re, im: out20im, }, 20, ); buffer.store( Complex { re: out21re, im: out21im, }, 21, ); buffer.store( Complex { re: out22re, im: out22im, }, 22, ); buffer.store( Complex { re: out23re, im: out23im, }, 23, ); buffer.store( Complex { re: out24re, im: out24im, }, 24, ); buffer.store( Complex { re: out25re, im: out25im, }, 25, ); buffer.store( Complex { re: out26re, im: out26im, }, 26, ); buffer.store( Complex { re: out27re, im: out27im, }, 27, ); buffer.store( Complex { re: out28re, im: out28im, }, 28, ); buffer.store( Complex { re: out29re, im: out29im, }, 29, ); buffer.store( Complex { re: out30re, im: out30im, }, 30, ); } } pub struct Butterfly32 { butterfly16: Butterfly16, butterfly8: Butterfly8, twiddles: [Complex; 7], } boilerplate_fft_butterfly!(Butterfly32, 32, |this: &Butterfly32<_>| this .butterfly8 .fft_direction()); impl Butterfly32 { pub fn new(direction: FftDirection) -> Self { Self { butterfly16: Butterfly16::new(direction), butterfly8: Butterfly8::new(direction), twiddles: [ twiddles::compute_twiddle(1, 32, direction), twiddles::compute_twiddle(2, 32, direction), twiddles::compute_twiddle(3, 32, direction), twiddles::compute_twiddle(4, 32, direction), twiddles::compute_twiddle(5, 32, direction), twiddles::compute_twiddle(6, 32, direction), twiddles::compute_twiddle(7, 32, direction), ], } } #[inline(never)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl LoadStore) { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch let mut scratch_evens = [ buffer.load(0), buffer.load(2), buffer.load(4), buffer.load(6), buffer.load(8), buffer.load(10), buffer.load(12), buffer.load(14), buffer.load(16), buffer.load(18), buffer.load(20), buffer.load(22), buffer.load(24), buffer.load(26), buffer.load(28), buffer.load(30), ]; let mut scratch_odds_n1 = [ buffer.load(1), buffer.load(5), buffer.load(9), buffer.load(13), buffer.load(17), buffer.load(21), buffer.load(25), buffer.load(29), ]; let mut scratch_odds_n3 = [ buffer.load(31), buffer.load(3), buffer.load(7), buffer.load(11), buffer.load(15), buffer.load(19), buffer.load(23), buffer.load(27), ]; // step 2: column FFTs self.butterfly16.perform_fft_contiguous(&mut scratch_evens); self.butterfly8.perform_fft_contiguous(&mut scratch_odds_n1); self.butterfly8.perform_fft_contiguous(&mut scratch_odds_n3); // step 3: apply twiddle factors scratch_odds_n1[1] = scratch_odds_n1[1] * self.twiddles[0]; scratch_odds_n3[1] = scratch_odds_n3[1] * self.twiddles[0].conj(); scratch_odds_n1[2] = scratch_odds_n1[2] * self.twiddles[1]; scratch_odds_n3[2] = scratch_odds_n3[2] * self.twiddles[1].conj(); scratch_odds_n1[3] = scratch_odds_n1[3] * self.twiddles[2]; scratch_odds_n3[3] = scratch_odds_n3[3] * self.twiddles[2].conj(); scratch_odds_n1[4] = scratch_odds_n1[4] * self.twiddles[3]; scratch_odds_n3[4] = scratch_odds_n3[4] * self.twiddles[3].conj(); scratch_odds_n1[5] = scratch_odds_n1[5] * self.twiddles[4]; scratch_odds_n3[5] = scratch_odds_n3[5] * self.twiddles[4].conj(); scratch_odds_n1[6] = scratch_odds_n1[6] * self.twiddles[5]; scratch_odds_n3[6] = scratch_odds_n3[6] * self.twiddles[5].conj(); scratch_odds_n1[7] = scratch_odds_n1[7] * self.twiddles[6]; scratch_odds_n3[7] = scratch_odds_n3[7] * self.twiddles[6].conj(); // step 4: cross FFTs Butterfly2::perform_fft_strided(&mut scratch_odds_n1[0], &mut scratch_odds_n3[0]); Butterfly2::perform_fft_strided(&mut scratch_odds_n1[1], &mut scratch_odds_n3[1]); Butterfly2::perform_fft_strided(&mut scratch_odds_n1[2], &mut scratch_odds_n3[2]); Butterfly2::perform_fft_strided(&mut scratch_odds_n1[3], &mut scratch_odds_n3[3]); Butterfly2::perform_fft_strided(&mut scratch_odds_n1[4], &mut scratch_odds_n3[4]); Butterfly2::perform_fft_strided(&mut scratch_odds_n1[5], &mut scratch_odds_n3[5]); Butterfly2::perform_fft_strided(&mut scratch_odds_n1[6], &mut scratch_odds_n3[6]); Butterfly2::perform_fft_strided(&mut scratch_odds_n1[7], &mut scratch_odds_n3[7]); // apply the butterfly 4 twiddle factor, which is just a rotation scratch_odds_n3[0] = twiddles::rotate_90(scratch_odds_n3[0], self.fft_direction()); scratch_odds_n3[1] = twiddles::rotate_90(scratch_odds_n3[1], self.fft_direction()); scratch_odds_n3[2] = twiddles::rotate_90(scratch_odds_n3[2], self.fft_direction()); scratch_odds_n3[3] = twiddles::rotate_90(scratch_odds_n3[3], self.fft_direction()); scratch_odds_n3[4] = twiddles::rotate_90(scratch_odds_n3[4], self.fft_direction()); scratch_odds_n3[5] = twiddles::rotate_90(scratch_odds_n3[5], self.fft_direction()); scratch_odds_n3[6] = twiddles::rotate_90(scratch_odds_n3[6], self.fft_direction()); scratch_odds_n3[7] = twiddles::rotate_90(scratch_odds_n3[7], self.fft_direction()); //step 5: copy/add/subtract data back to buffer buffer.store(scratch_evens[0] + scratch_odds_n1[0], 0); buffer.store(scratch_evens[1] + scratch_odds_n1[1], 1); buffer.store(scratch_evens[2] + scratch_odds_n1[2], 2); buffer.store(scratch_evens[3] + scratch_odds_n1[3], 3); buffer.store(scratch_evens[4] + scratch_odds_n1[4], 4); buffer.store(scratch_evens[5] + scratch_odds_n1[5], 5); buffer.store(scratch_evens[6] + scratch_odds_n1[6], 6); buffer.store(scratch_evens[7] + scratch_odds_n1[7], 7); buffer.store(scratch_evens[8] + scratch_odds_n3[0], 8); buffer.store(scratch_evens[9] + scratch_odds_n3[1], 9); buffer.store(scratch_evens[10] + scratch_odds_n3[2], 10); buffer.store(scratch_evens[11] + scratch_odds_n3[3], 11); buffer.store(scratch_evens[12] + scratch_odds_n3[4], 12); buffer.store(scratch_evens[13] + scratch_odds_n3[5], 13); buffer.store(scratch_evens[14] + scratch_odds_n3[6], 14); buffer.store(scratch_evens[15] + scratch_odds_n3[7], 15); buffer.store(scratch_evens[0] - scratch_odds_n1[0], 16); buffer.store(scratch_evens[1] - scratch_odds_n1[1], 17); buffer.store(scratch_evens[2] - scratch_odds_n1[2], 18); buffer.store(scratch_evens[3] - scratch_odds_n1[3], 19); buffer.store(scratch_evens[4] - scratch_odds_n1[4], 20); buffer.store(scratch_evens[5] - scratch_odds_n1[5], 21); buffer.store(scratch_evens[6] - scratch_odds_n1[6], 22); buffer.store(scratch_evens[7] - scratch_odds_n1[7], 23); buffer.store(scratch_evens[8] - scratch_odds_n3[0], 24); buffer.store(scratch_evens[9] - scratch_odds_n3[1], 25); buffer.store(scratch_evens[10] - scratch_odds_n3[2], 26); buffer.store(scratch_evens[11] - scratch_odds_n3[3], 27); buffer.store(scratch_evens[12] - scratch_odds_n3[4], 28); buffer.store(scratch_evens[13] - scratch_odds_n3[5], 29); buffer.store(scratch_evens[14] - scratch_odds_n3[6], 30); buffer.store(scratch_evens[15] - scratch_odds_n3[7], 31); } } #[cfg(test)] mod unit_tests { use super::*; use crate::test_utils::check_fft_algorithm; //the tests for all butterflies will be identical except for the identifiers used and size //so it's ideal for a macro macro_rules! test_butterfly_func { ($test_name:ident, $struct_name:ident, $size:expr) => { #[test] fn $test_name() { let butterfly = $struct_name::new(FftDirection::Forward); check_fft_algorithm::(&butterfly, $size, FftDirection::Forward); let butterfly_direction = $struct_name::new(FftDirection::Inverse); check_fft_algorithm::(&butterfly_direction, $size, FftDirection::Inverse); } }; } test_butterfly_func!(test_butterfly2, Butterfly2, 2); test_butterfly_func!(test_butterfly3, Butterfly3, 3); test_butterfly_func!(test_butterfly4, Butterfly4, 4); test_butterfly_func!(test_butterfly5, Butterfly5, 5); test_butterfly_func!(test_butterfly6, Butterfly6, 6); test_butterfly_func!(test_butterfly7, Butterfly7, 7); test_butterfly_func!(test_butterfly8, Butterfly8, 8); test_butterfly_func!(test_butterfly9, Butterfly9, 9); test_butterfly_func!(test_butterfly11, Butterfly11, 11); test_butterfly_func!(test_butterfly13, Butterfly13, 13); test_butterfly_func!(test_butterfly16, Butterfly16, 16); test_butterfly_func!(test_butterfly17, Butterfly17, 17); test_butterfly_func!(test_butterfly19, Butterfly19, 19); test_butterfly_func!(test_butterfly23, Butterfly23, 23); test_butterfly_func!(test_butterfly27, Butterfly27, 27); test_butterfly_func!(test_butterfly29, Butterfly29, 29); test_butterfly_func!(test_butterfly31, Butterfly31, 31); test_butterfly_func!(test_butterfly32, Butterfly32, 32); } rustfft-6.2.0/src/algorithm/dft.rs000064400000000000000000000275700072674642500152710ustar 00000000000000use num_complex::Complex; use num_traits::Zero; use crate::array_utils; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::{twiddles, FftDirection}; use crate::{Direction, Fft, FftNum, Length}; /// Naive O(n^2 ) Discrete Fourier Transform implementation /// /// This implementation is primarily used to test other FFT algorithms. /// /// ~~~ /// // Computes a naive DFT of size 123 /// use rustfft::algorithm::Dft; /// use rustfft::{Fft, FftDirection}; /// use rustfft::num_complex::Complex; /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 123]; /// /// let dft = Dft::new(123, FftDirection::Forward); /// dft.process(&mut buffer); /// ~~~ pub struct Dft { twiddles: Vec>, direction: FftDirection, } impl Dft { /// Preallocates necessary arrays and precomputes necessary data to efficiently compute Dft pub fn new(len: usize, direction: FftDirection) -> Self { let twiddles = (0..len) .map(|i| twiddles::compute_twiddle(i, len, direction)) .collect(); Self { twiddles, direction, } } fn perform_fft_out_of_place( &self, signal: &[Complex], spectrum: &mut [Complex], _scratch: &mut [Complex], ) { for k in 0..spectrum.len() { let output_cell = spectrum.get_mut(k).unwrap(); *output_cell = Zero::zero(); let mut twiddle_index = 0; for input_cell in signal { let twiddle = self.twiddles[twiddle_index]; *output_cell = *output_cell + twiddle * input_cell; twiddle_index += k; if twiddle_index >= self.twiddles.len() { twiddle_index -= self.twiddles.len(); } } } } } boilerplate_fft_oop!(Dft, |this: &Dft<_>| this.twiddles.len()); #[cfg(test)] mod unit_tests { use super::*; use crate::test_utils::{compare_vectors, random_signal}; use num_complex::Complex; use num_traits::Zero; use std::f32; fn dft(signal: &[Complex], spectrum: &mut [Complex]) { for (k, spec_bin) in spectrum.iter_mut().enumerate() { let mut sum = Zero::zero(); for (i, &x) in signal.iter().enumerate() { let angle = -1f32 * (i * k) as f32 * 2f32 * f32::consts::PI / signal.len() as f32; let twiddle = Complex::from_polar(1f32, angle); sum = sum + twiddle * x; } *spec_bin = sum; } } #[test] fn test_matches_dft() { let n = 4; for len in 1..20 { let dft_instance = Dft::new(len, FftDirection::Forward); assert_eq!( dft_instance.len(), len, "Dft instance reported incorrect length" ); let input = random_signal(len * n); let mut expected_output = input.clone(); // Compute the control data using our simplified Dft definition for (input_chunk, output_chunk) in input.chunks(len).zip(expected_output.chunks_mut(len)) { dft(input_chunk, output_chunk); } // test process() { let mut inplace_buffer = input.clone(); dft_instance.process(&mut inplace_buffer); assert!( compare_vectors(&expected_output, &inplace_buffer), "process() failed, length = {}", len ); } // test process_with_scratch() { let mut inplace_with_scratch_buffer = input.clone(); let mut inplace_scratch = vec![Zero::zero(); dft_instance.get_inplace_scratch_len()]; dft_instance .process_with_scratch(&mut inplace_with_scratch_buffer, &mut inplace_scratch); assert!( compare_vectors(&expected_output, &inplace_with_scratch_buffer), "process_inplace() failed, length = {}", len ); // one more thing: make sure that the Dft algorithm even works with dirty scratch space for item in inplace_scratch.iter_mut() { *item = Complex::new(100.0, 100.0); } inplace_with_scratch_buffer.copy_from_slice(&input); dft_instance .process_with_scratch(&mut inplace_with_scratch_buffer, &mut inplace_scratch); assert!( compare_vectors(&expected_output, &inplace_with_scratch_buffer), "process_with_scratch() failed the 'dirty scratch' test for len = {}", len ); } // test process_outofplace_with_scratch { let mut outofplace_input = input.clone(); let mut outofplace_output = expected_output.clone(); dft_instance.process_outofplace_with_scratch( &mut outofplace_input, &mut outofplace_output, &mut [], ); assert!( compare_vectors(&expected_output, &outofplace_output), "process_outofplace_with_scratch() failed, length = {}", len ); } } //verify that it doesn't crash or infinite loop if we have a length of 0 let zero_dft = Dft::new(0, FftDirection::Forward); let mut zero_input: Vec> = Vec::new(); let mut zero_output: Vec> = Vec::new(); let mut zero_scratch: Vec> = Vec::new(); zero_dft.process(&mut zero_input); zero_dft.process_with_scratch(&mut zero_input, &mut zero_scratch); zero_dft.process_outofplace_with_scratch( &mut zero_input, &mut zero_output, &mut zero_scratch, ); } /// Returns true if our `dft` function calculates the given output from the /// given input, and if rustfft's Dft struct does the same fn test_dft_correct(input: &[Complex], expected_output: &[Complex]) { assert_eq!(input.len(), expected_output.len()); let len = input.len(); let mut reference_output = vec![Zero::zero(); len]; dft(&input, &mut reference_output); assert!( compare_vectors(expected_output, &reference_output), "Reference implementation failed for len={}", len ); let dft_instance = Dft::new(len, FftDirection::Forward); // test process() { let mut inplace_buffer = input.to_vec(); dft_instance.process(&mut inplace_buffer); assert!( compare_vectors(&expected_output, &inplace_buffer), "process() failed, length = {}", len ); } // test process_with_scratch() { let mut inplace_with_scratch_buffer = input.to_vec(); let mut inplace_scratch = vec![Zero::zero(); dft_instance.get_inplace_scratch_len()]; dft_instance .process_with_scratch(&mut inplace_with_scratch_buffer, &mut inplace_scratch); assert!( compare_vectors(&expected_output, &inplace_with_scratch_buffer), "process_inplace() failed, length = {}", len ); // one more thing: make sure that the Dft algorithm even works with dirty scratch space for item in inplace_scratch.iter_mut() { *item = Complex::new(100.0, 100.0); } inplace_with_scratch_buffer.copy_from_slice(&input); dft_instance .process_with_scratch(&mut inplace_with_scratch_buffer, &mut inplace_scratch); assert!( compare_vectors(&expected_output, &inplace_with_scratch_buffer), "process_with_scratch() failed the 'dirty scratch' test for len = {}", len ); } // test process_outofplace_with_scratch { let mut outofplace_input = input.to_vec(); let mut outofplace_output = expected_output.to_vec(); dft_instance.process_outofplace_with_scratch( &mut outofplace_input, &mut outofplace_output, &mut [], ); assert!( compare_vectors(&expected_output, &outofplace_output), "process_outofplace_with_scratch() failed, length = {}", len ); } } #[test] fn test_dft_known_len_2() { let signal = [ Complex { re: 1f32, im: 0f32 }, Complex { re: -1f32, im: 0f32, }, ]; let spectrum = [ Complex { re: 0f32, im: 0f32 }, Complex { re: 2f32, im: 0f32 }, ]; test_dft_correct(&signal[..], &spectrum[..]); } #[test] fn test_dft_known_len_3() { let signal = [ Complex { re: 1f32, im: 1f32 }, Complex { re: 2f32, im: -3f32, }, Complex { re: -1f32, im: 4f32, }, ]; let spectrum = [ Complex { re: 2f32, im: 2f32 }, Complex { re: -5.562177f32, im: -2.098076f32, }, Complex { re: 6.562178f32, im: 3.09807f32, }, ]; test_dft_correct(&signal[..], &spectrum[..]); } #[test] fn test_dft_known_len_4() { let signal = [ Complex { re: 0f32, im: 1f32 }, Complex { re: 2.5f32, im: -3f32, }, Complex { re: -1f32, im: -1f32, }, Complex { re: 4f32, im: 0f32 }, ]; let spectrum = [ Complex { re: 5.5f32, im: -3f32, }, Complex { re: -2f32, im: 3.5f32, }, Complex { re: -7.5f32, im: 3f32, }, Complex { re: 4f32, im: 0.5f32, }, ]; test_dft_correct(&signal[..], &spectrum[..]); } #[test] fn test_dft_known_len_6() { let signal = [ Complex { re: 1f32, im: 1f32 }, Complex { re: 2f32, im: 2f32 }, Complex { re: 3f32, im: 3f32 }, Complex { re: 4f32, im: 4f32 }, Complex { re: 5f32, im: 5f32 }, Complex { re: 6f32, im: 6f32 }, ]; let spectrum = [ Complex { re: 21f32, im: 21f32, }, Complex { re: -8.16f32, im: 2.16f32, }, Complex { re: -4.76f32, im: -1.24f32, }, Complex { re: -3f32, im: -3f32, }, Complex { re: -1.24f32, im: -4.76f32, }, Complex { re: 2.16f32, im: -8.16f32, }, ]; test_dft_correct(&signal[..], &spectrum[..]); } } rustfft-6.2.0/src/algorithm/good_thomas_algorithm.rs000064400000000000000000000620070072674642500210570ustar 00000000000000use std::cmp::max; use std::sync::Arc; use num_complex::Complex; use num_integer::Integer; use strength_reduce::StrengthReducedUsize; use transpose; use crate::array_utils; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::{common::FftNum, FftDirection}; use crate::{Direction, Fft, Length}; /// Implementation of the [Good-Thomas Algorithm (AKA Prime Factor Algorithm)](https://en.wikipedia.org/wiki/Prime-factor_FFT_algorithm) /// /// This algorithm factors a size n FFT into n1 * n2, where GCD(n1, n2) == 1 /// /// Conceptually, this algorithm is very similar to the Mixed-Radix, except because GCD(n1, n2) == 1 we can do some /// number theory trickery to reduce the number of floating-point multiplications and additions. Additionally, It can /// be faster than Mixed-Radix at sizes below 10,000 or so. /// /// ~~~ /// // Computes a forward FFT of size 1200, using the Good-Thomas Algorithm /// use rustfft::algorithm::GoodThomasAlgorithm; /// use rustfft::{Fft, FftPlanner}; /// use rustfft::num_complex::Complex; /// use rustfft::num_traits::Zero; /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 1200]; /// /// // we need to find an n1 and n2 such that n1 * n2 == 1200 and GCD(n1, n2) == 1 /// // n1 = 48 and n2 = 25 satisfies this /// let mut planner = FftPlanner::new(); /// let inner_fft_n1 = planner.plan_fft_forward(48); /// let inner_fft_n2 = planner.plan_fft_forward(25); /// /// // the good-thomas FFT length will be inner_fft_n1.len() * inner_fft_n2.len() = 1200 /// let fft = GoodThomasAlgorithm::new(inner_fft_n1, inner_fft_n2); /// fft.process(&mut buffer); /// ~~~ pub struct GoodThomasAlgorithm { width: usize, width_size_fft: Arc>, height: usize, height_size_fft: Arc>, reduced_width: StrengthReducedUsize, reduced_width_plus_one: StrengthReducedUsize, inplace_scratch_len: usize, outofplace_scratch_len: usize, len: usize, direction: FftDirection, } impl GoodThomasAlgorithm { /// Creates a FFT instance which will process inputs/outputs of size `width_fft.len() * height_fft.len()` /// /// `GCD(width_fft.len(), height_fft.len())` must be equal to 1 pub fn new(mut width_fft: Arc>, mut height_fft: Arc>) -> Self { assert_eq!( width_fft.fft_direction(), height_fft.fft_direction(), "width_fft and height_fft must have the same direction. got width direction={}, height direction={}", width_fft.fft_direction(), height_fft.fft_direction()); let mut width = width_fft.len(); let mut height = height_fft.len(); let direction = width_fft.fft_direction(); // This algorithm doesn't work if width and height aren't coprime let gcd = num_integer::gcd(width as i64, height as i64); assert!(gcd == 1, "Invalid width and height for Good-Thomas Algorithm (width={}, height={}): Inputs must be coprime", width, height); // The trick we're using for our index remapping will only work if width < height, so just swap them if it isn't if width > height { std::mem::swap(&mut width, &mut height); std::mem::swap(&mut width_fft, &mut height_fft); } let len = width * height; // Collect some data about what kind of scratch space our inner FFTs need let width_inplace_scratch = width_fft.get_inplace_scratch_len(); let height_inplace_scratch = height_fft.get_inplace_scratch_len(); let height_outofplace_scratch = height_fft.get_outofplace_scratch_len(); // Computing the scratch we'll require is a somewhat confusing process. // When we compute an out-of-place FFT, both of our inner FFTs are in-place // When we compute an inplace FFT, our inner width FFT will be inplace, and our height FFT will be out-of-place // For the out-of-place FFT, one of 2 things can happen regarding scratch: // - If the required scratch of both FFTs is <= self.len(), then we can use the input or output buffer as scratch, and so we need 0 extra scratch // - If either of the inner FFTs require more, then we'll have to request an entire scratch buffer for the inner FFTs, // whose size is the max of the two inner FFTs' required scratch let max_inner_inplace_scratch = max(height_inplace_scratch, width_inplace_scratch); let outofplace_scratch_len = if max_inner_inplace_scratch > len { max_inner_inplace_scratch } else { 0 }; // For the in-place FFT, again the best case is that we can just bounce data around between internal buffers, and the only inplace scratch we need is self.len() // If our height fft's OOP FFT requires any scratch, then we can tack that on the end of our own scratch, and use split_at_mut to separate our own from our internal FFT's // Likewise, if our width inplace FFT requires more inplace scracth than self.len(), we can tack that on to the end of our own inplace scratch. // Thus, the total inplace scratch is our own length plus the max of what the two inner FFTs will need let inplace_scratch_len = len + max( if width_inplace_scratch > len { width_inplace_scratch } else { 0 }, height_outofplace_scratch, ); Self { width, width_size_fft: width_fft, height, height_size_fft: height_fft, reduced_width: StrengthReducedUsize::new(width), reduced_width_plus_one: StrengthReducedUsize::new(width + 1), inplace_scratch_len, outofplace_scratch_len, len, direction, } } fn reindex_input(&self, source: &[Complex], destination: &mut [Complex]) { // A critical part of the good-thomas algorithm is re-indexing the inputs and outputs. // To remap the inputs, we will use the CRT mapping, paired with the normal transpose we'd do for mixed radix. // // The algorithm for the CRT mapping will work like this: // 1: Keep an output index, initialized to 0 // 2: The output index will be incremented by width + 1 // 3: At the start of the row, compute if we will increment output_index past self.len() // 3a: If we will, then compute exactly how many increments it will take, // 3b: Increment however many times as we scan over the input row, copying each element to the output index // 3c: Subtract self.len() from output_index // 4: Scan over the rest of the row, incrementing output_index, and copying each element to output_index, thne incrementing output_index // 5: The first index of each row will be the final index of the previous row plus one, but because of our incrementing (width+1) inside the loop, we overshot, so at the end of the row, subtract width from output_index // // This ends up producing the same result as computing the multiplicative inverse of width mod height and etc by the CRT mapping, but with only one integer division per row, instead of one per element. let mut destination_index = 0; for mut source_row in source.chunks_exact(self.width) { let increments_until_cycle = 1 + (self.len() - destination_index) / self.reduced_width_plus_one; // If we will have to rollover output_index on this row, do it in a separate loop if increments_until_cycle < self.width { let (pre_cycle_row, post_cycle_row) = source_row.split_at(increments_until_cycle); for input_element in pre_cycle_row { destination[destination_index] = *input_element; destination_index += self.reduced_width_plus_one.get(); } // Store the split slice back to input_row, os that outside the loop, we can finish the job of iterating the row source_row = post_cycle_row; destination_index -= self.len(); } // Loop over the entire row (if we did not roll over) or what's left of the row (if we did) and keep incrementing output_row for input_element in source_row { destination[destination_index] = *input_element; destination_index += self.reduced_width_plus_one.get(); } // The first index of the next will be the final index this row, plus one. // But because of our incrementing (width+1) inside the loop above, we overshot, so subtract width, and we'll get (width + 1) - width = 1 destination_index -= self.width; } } fn reindex_output(&self, source: &[Complex], destination: &mut [Complex]) { // A critical part of the good-thomas algorithm is re-indexing the inputs and outputs. // To remap the outputs, we will use the ruritanian mapping, paired with the normal transpose we'd do for mixed radix. // // The algorithm for the ruritanian mapping will work like this: // 1: At the start of every row, compute the output index = (y * self.height) % self.width // 2: We will increment this output index by self.width for every element // 3: Compute where in the row the output index will wrap around // 4: Instead of starting copying from the beginning of the row, start copying from after the rollover point // 5: When we hit the end of the row, continue from the beginning of the row, continuing to increment the output index by self.width // // This achieves the same result as the modular arithmetic ofthe ruritanian mapping, but with only one integer divison per row, instead of one per element for (y, source_chunk) in source.chunks_exact(self.height).enumerate() { let (quotient, remainder) = StrengthReducedUsize::div_rem(y * self.height, self.reduced_width); // Compute our base index and starting point in the row let mut destination_index = remainder; let start_x = self.height - quotient; // Process the first part of the row for x in start_x..self.height { destination[destination_index] = source_chunk[x]; destination_index += self.width; } // Wrap back around to the beginning of the row and keep incrementing for x in 0..start_x { destination[destination_index] = source_chunk[x]; destination_index += self.width; } } } fn perform_fft_inplace(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { let (scratch, inner_scratch) = scratch.split_at_mut(self.len()); // Re-index the input, copying from the buffer to the scratch in the process self.reindex_input(buffer, scratch); // run FFTs of size `width` let width_scratch = if inner_scratch.len() > buffer.len() { &mut inner_scratch[..] } else { &mut buffer[..] }; self.width_size_fft .process_with_scratch(scratch, width_scratch); // transpose transpose::transpose(scratch, buffer, self.width, self.height); // run FFTs of size 'height' self.height_size_fft .process_outofplace_with_scratch(buffer, scratch, inner_scratch); // Re-index the output, copying from the scratch to the buffer in the process self.reindex_output(scratch, buffer); } fn perform_fft_out_of_place( &self, input: &mut [Complex], output: &mut [Complex], scratch: &mut [Complex], ) { // Re-index the input, copying from the input to the output in the process self.reindex_input(input, output); // run FFTs of size `width` let width_scratch = if scratch.len() > input.len() { &mut scratch[..] } else { &mut input[..] }; self.width_size_fft .process_with_scratch(output, width_scratch); // transpose transpose::transpose(output, input, self.width, self.height); // run FFTs of size 'height' let height_scratch = if scratch.len() > output.len() { &mut scratch[..] } else { &mut output[..] }; self.height_size_fft .process_with_scratch(input, height_scratch); // Re-index the output, copying from the input to the output in the process self.reindex_output(input, output); } } boilerplate_fft!( GoodThomasAlgorithm, |this: &GoodThomasAlgorithm<_>| this.len, |this: &GoodThomasAlgorithm<_>| this.inplace_scratch_len, |this: &GoodThomasAlgorithm<_>| this.outofplace_scratch_len ); /// Implementation of the Good-Thomas Algorithm, specialized for smaller input sizes /// /// This algorithm factors a size n FFT into n1 * n2, where GCD(n1, n2) == 1 /// /// Conceptually, this algorithm is very similar to MixedRadix, except because GCD(n1, n2) == 1 we can do some /// number theory trickery to reduce the number of floating point operations. It typically performs /// better than MixedRadixSmall, especially at the smallest sizes. /// /// ~~~ /// // Computes a forward FFT of size 56 using GoodThomasAlgorithmSmall /// use std::sync::Arc; /// use rustfft::algorithm::GoodThomasAlgorithmSmall; /// use rustfft::algorithm::butterflies::{Butterfly7, Butterfly8}; /// use rustfft::{Fft, FftDirection}; /// use rustfft::num_complex::Complex; /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 56]; /// /// // we need to find an n1 and n2 such that n1 * n2 == 56 and GCD(n1, n2) == 1 /// // n1 = 7 and n2 = 8 satisfies this /// let inner_fft_n1 = Arc::new(Butterfly7::new(FftDirection::Forward)); /// let inner_fft_n2 = Arc::new(Butterfly8::new(FftDirection::Forward)); /// /// // the good-thomas FFT length will be inner_fft_n1.len() * inner_fft_n2.len() = 56 /// let fft = GoodThomasAlgorithmSmall::new(inner_fft_n1, inner_fft_n2); /// fft.process(&mut buffer); /// ~~~ pub struct GoodThomasAlgorithmSmall { width: usize, width_size_fft: Arc>, height: usize, height_size_fft: Arc>, input_output_map: Box<[usize]>, direction: FftDirection, } impl GoodThomasAlgorithmSmall { /// Creates a FFT instance which will process inputs/outputs of size `width_fft.len() * height_fft.len()` /// /// `GCD(width_fft.len(), height_fft.len())` must be equal to 1 pub fn new(width_fft: Arc>, height_fft: Arc>) -> Self { assert_eq!( width_fft.fft_direction(), height_fft.fft_direction(), "n1_fft and height_fft must have the same direction. got width direction={}, height direction={}", width_fft.fft_direction(), height_fft.fft_direction()); let width = width_fft.len(); let height = height_fft.len(); let len = width * height; assert_eq!(width_fft.get_outofplace_scratch_len(), 0, "GoodThomasAlgorithmSmall should only be used with algorithms that require 0 out-of-place scratch. Width FFT (len={}) requires {}, should require 0", width, width_fft.get_outofplace_scratch_len()); assert_eq!(height_fft.get_outofplace_scratch_len(), 0, "GoodThomasAlgorithmSmall should only be used with algorithms that require 0 out-of-place scratch. Height FFT (len={}) requires {}, should require 0", height, height_fft.get_outofplace_scratch_len()); assert!(width_fft.get_inplace_scratch_len() <= width, "GoodThomasAlgorithmSmall should only be used with algorithms that require little inplace scratch. Width FFT (len={}) requires {}, should require {} or less", width, width_fft.get_inplace_scratch_len(), width); assert!(height_fft.get_inplace_scratch_len() <= height, "GoodThomasAlgorithmSmall should only be used with algorithms that require little inplace scratch. Height FFT (len={}) requires {}, should require {} or less", height, height_fft.get_inplace_scratch_len(), height); // compute the multiplicative inverse of width mod height and vice versa. x will be width mod height, and y will be height mod width let gcd_data = i64::extended_gcd(&(width as i64), &(height as i64)); assert!(gcd_data.gcd == 1, "Invalid input width and height to Good-Thomas Algorithm: ({},{}): Inputs must be coprime", width, height); // width_inverse or height_inverse might be negative, make it positive by wrapping let width_inverse = if gcd_data.x >= 0 { gcd_data.x } else { gcd_data.x + height as i64 } as usize; let height_inverse = if gcd_data.y >= 0 { gcd_data.y } else { gcd_data.y + width as i64 } as usize; // NOTE: we are precomputing the input and output reordering indexes, because benchmarking shows that it's 10-20% faster // If we wanted to optimize for memory use or setup time instead of multiple-FFT speed, we could compute these on the fly in the perform_fft() method let input_iter = (0..len) .map(|i| (i % width, i / width)) .map(|(x, y)| (x * height + y * width) % len); let output_iter = (0..len).map(|i| (i % height, i / height)).map(|(y, x)| { (x * height * height_inverse as usize + y * width * width_inverse as usize) % len }); let input_output_map: Vec = input_iter.chain(output_iter).collect(); Self { direction: width_fft.fft_direction(), width, width_size_fft: width_fft, height, height_size_fft: height_fft, input_output_map: input_output_map.into_boxed_slice(), } } fn perform_fft_out_of_place( &self, input: &mut [Complex], output: &mut [Complex], _scratch: &mut [Complex], ) { // These asserts are for the unsafe blocks down below. we're relying on the optimizer to get rid of this assert assert_eq!(self.len(), input.len()); assert_eq!(self.len(), output.len()); let (input_map, output_map) = self.input_output_map.split_at(self.len()); // copy the input using our reordering mapping for (output_element, &input_index) in output.iter_mut().zip(input_map.iter()) { *output_element = input[input_index]; } // run FFTs of size `width` self.width_size_fft.process_with_scratch(output, input); // transpose unsafe { array_utils::transpose_small(self.width, self.height, output, input) }; // run FFTs of size 'height' self.height_size_fft.process_with_scratch(input, output); // copy to the output, using our output redordeing mapping for (input_element, &output_index) in input.iter().zip(output_map.iter()) { output[output_index] = *input_element; } } fn perform_fft_inplace(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { // These asserts are for the unsafe blocks down below. we're relying on the optimizer to get rid of this assert assert_eq!(self.len(), buffer.len()); assert_eq!(self.len(), scratch.len()); let (input_map, output_map) = self.input_output_map.split_at(self.len()); // copy the input using our reordering mapping for (output_element, &input_index) in scratch.iter_mut().zip(input_map.iter()) { *output_element = buffer[input_index]; } // run FFTs of size `width` self.width_size_fft.process_with_scratch(scratch, buffer); // transpose unsafe { array_utils::transpose_small(self.width, self.height, scratch, buffer) }; // run FFTs of size 'height' self.height_size_fft .process_outofplace_with_scratch(buffer, scratch, &mut []); // copy to the output, using our output redordeing mapping for (input_element, &output_index) in scratch.iter().zip(output_map.iter()) { buffer[output_index] = *input_element; } } } boilerplate_fft!( GoodThomasAlgorithmSmall, |this: &GoodThomasAlgorithmSmall<_>| this.width * this.height, |this: &GoodThomasAlgorithmSmall<_>| this.len(), |_| 0 ); #[cfg(test)] mod unit_tests { use super::*; use crate::test_utils::check_fft_algorithm; use crate::{algorithm::Dft, test_utils::BigScratchAlgorithm}; use num_integer::gcd; use num_traits::Zero; use std::sync::Arc; #[test] fn test_good_thomas() { for width in 1..12 { for height in 1..12 { if gcd(width, height) == 1 { test_good_thomas_with_lengths(width, height, FftDirection::Forward); test_good_thomas_with_lengths(width, height, FftDirection::Inverse); } } } } #[test] fn test_good_thomas_small() { let butterfly_sizes = [2, 3, 4, 5, 6, 7, 8, 16]; for width in &butterfly_sizes { for height in &butterfly_sizes { if gcd(*width, *height) == 1 { test_good_thomas_small_with_lengths(*width, *height, FftDirection::Forward); test_good_thomas_small_with_lengths(*width, *height, FftDirection::Inverse); } } } } fn test_good_thomas_with_lengths(width: usize, height: usize, direction: FftDirection) { let width_fft = Arc::new(Dft::new(width, direction)) as Arc>; let height_fft = Arc::new(Dft::new(height, direction)) as Arc>; let fft = GoodThomasAlgorithm::new(width_fft, height_fft); check_fft_algorithm(&fft, width * height, direction); } fn test_good_thomas_small_with_lengths(width: usize, height: usize, direction: FftDirection) { let width_fft = Arc::new(Dft::new(width, direction)) as Arc>; let height_fft = Arc::new(Dft::new(height, direction)) as Arc>; let fft = GoodThomasAlgorithmSmall::new(width_fft, height_fft); check_fft_algorithm(&fft, width * height, direction); } #[test] fn test_output_mapping() { let width = 15; for height in 3..width { if gcd(width, height) == 1 { let width_fft = Arc::new(Dft::new(width, FftDirection::Forward)) as Arc>; let height_fft = Arc::new(Dft::new(height, FftDirection::Forward)) as Arc>; let fft = GoodThomasAlgorithm::new(width_fft, height_fft); let mut buffer = vec![Complex { re: 0.0, im: 0.0 }; fft.len()]; fft.process(&mut buffer); } } } // Verify that the Good-Thomas algorithm correctly provides scratch space to inner FFTs #[test] fn test_good_thomas_inner_scratch() { let scratch_lengths = [1, 5, 24]; let mut inner_ffts = Vec::new(); for &len in &scratch_lengths { for &inplace_scratch in &scratch_lengths { for &outofplace_scratch in &scratch_lengths { inner_ffts.push(Arc::new(BigScratchAlgorithm { len, inplace_scratch, outofplace_scratch, direction: FftDirection::Forward, }) as Arc>); } } } for width_fft in inner_ffts.iter() { for height_fft in inner_ffts.iter() { if width_fft.len() == height_fft.len() { continue; } let fft = GoodThomasAlgorithm::new(Arc::clone(width_fft), Arc::clone(height_fft)); let mut inplace_buffer = vec![Complex::zero(); fft.len()]; let mut inplace_scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; fft.process_with_scratch(&mut inplace_buffer, &mut inplace_scratch); let mut outofplace_input = vec![Complex::zero(); fft.len()]; let mut outofplace_output = vec![Complex::zero(); fft.len()]; let mut outofplace_scratch = vec![Complex::zero(); fft.get_outofplace_scratch_len()]; fft.process_outofplace_with_scratch( &mut outofplace_input, &mut outofplace_output, &mut outofplace_scratch, ); } } } } rustfft-6.2.0/src/algorithm/mixed_radix.rs000064400000000000000000000411700072674642500170010ustar 00000000000000use std::cmp::max; use std::sync::Arc; use num_complex::Complex; use num_traits::Zero; use transpose; use crate::array_utils; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::{common::FftNum, twiddles, FftDirection}; use crate::{Direction, Fft, Length}; /// Implementation of the Mixed-Radix FFT algorithm /// /// This algorithm factors a size n FFT into n1 * n2, computes several inner FFTs of size n1 and n2, then combines the /// results to get the final answer /// /// ~~~ /// // Computes a forward FFT of size 1200, using the Mixed-Radix Algorithm /// use rustfft::algorithm::MixedRadix; /// use rustfft::{Fft, FftPlanner}; /// use rustfft::num_complex::Complex; /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 1200]; /// /// // we need to find an n1 and n2 such that n1 * n2 == 1200 /// // n1 = 30 and n2 = 40 satisfies this /// let mut planner = FftPlanner::new(); /// let inner_fft_n1 = planner.plan_fft_forward(30); /// let inner_fft_n2 = planner.plan_fft_forward(40); /// /// // the mixed radix FFT length will be inner_fft_n1.len() * inner_fft_n2.len() = 1200 /// let fft = MixedRadix::new(inner_fft_n1, inner_fft_n2); /// fft.process(&mut buffer); /// ~~~ pub struct MixedRadix { twiddles: Box<[Complex]>, width_size_fft: Arc>, width: usize, height_size_fft: Arc>, height: usize, inplace_scratch_len: usize, outofplace_scratch_len: usize, direction: FftDirection, } impl MixedRadix { /// Creates a FFT instance which will process inputs/outputs of size `width_fft.len() * height_fft.len()` pub fn new(width_fft: Arc>, height_fft: Arc>) -> Self { assert_eq!( width_fft.fft_direction(), height_fft.fft_direction(), "width_fft and height_fft must have the same direction. got width direction={}, height direction={}", width_fft.fft_direction(), height_fft.fft_direction()); let direction = width_fft.fft_direction(); let width = width_fft.len(); let height = height_fft.len(); let len = width * height; let mut twiddles = vec![Complex::zero(); len]; for (x, twiddle_chunk) in twiddles.chunks_exact_mut(height).enumerate() { for (y, twiddle_element) in twiddle_chunk.iter_mut().enumerate() { *twiddle_element = twiddles::compute_twiddle(x * y, len, direction); } } // Collect some data about what kind of scratch space our inner FFTs need let height_inplace_scratch = height_fft.get_inplace_scratch_len(); let width_inplace_scratch = width_fft.get_inplace_scratch_len(); let width_outofplace_scratch = width_fft.get_outofplace_scratch_len(); // Computing the scratch we'll require is a somewhat confusing process. // When we compute an out-of-place FFT, both of our inner FFTs are in-place // When we compute an inplace FFT, our inner width FFT will be inplace, and our height FFT will be out-of-place // For the out-of-place FFT, one of 2 things can happen regarding scratch: // - If the required scratch of both FFTs is <= self.len(), then we can use the input or output buffer as scratch, and so we need 0 extra scratch // - If either of the inner FFTs require more, then we'll have to request an entire scratch buffer for the inner FFTs, // whose size is the max of the two inner FFTs' required scratch let max_inner_inplace_scratch = max(height_inplace_scratch, width_inplace_scratch); let outofplace_scratch_len = if max_inner_inplace_scratch > len { max_inner_inplace_scratch } else { 0 }; // For the in-place FFT, again the best case is that we can just bounce data around between internal buffers, and the only inplace scratch we need is self.len() // If our width fft's OOP FFT requires any scratch, then we can tack that on the end of our own scratch, and use split_at_mut to separate our own from our internal FFT's // Likewise, if our height inplace FFT requires more inplace scracth than self.len(), we can tack that on to the end of our own inplace scratch. // Thus, the total inplace scratch is our own length plus the max of what the two inner FFTs will need let inplace_scratch_len = len + max( if height_inplace_scratch > len { height_inplace_scratch } else { 0 }, width_outofplace_scratch, ); Self { twiddles: twiddles.into_boxed_slice(), width_size_fft: width_fft, width: width, height_size_fft: height_fft, height: height, inplace_scratch_len, outofplace_scratch_len, direction, } } fn perform_fft_inplace(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { // SIX STEP FFT: let (scratch, inner_scratch) = scratch.split_at_mut(self.len()); // STEP 1: transpose transpose::transpose(buffer, scratch, self.width, self.height); // STEP 2: perform FFTs of size `height` let height_scratch = if inner_scratch.len() > buffer.len() { &mut inner_scratch[..] } else { &mut buffer[..] }; self.height_size_fft .process_with_scratch(scratch, height_scratch); // STEP 3: Apply twiddle factors for (element, twiddle) in scratch.iter_mut().zip(self.twiddles.iter()) { *element = *element * twiddle; } // STEP 4: transpose again transpose::transpose(scratch, buffer, self.height, self.width); // STEP 5: perform FFTs of size `width` self.width_size_fft .process_outofplace_with_scratch(buffer, scratch, inner_scratch); // STEP 6: transpose again transpose::transpose(scratch, buffer, self.width, self.height); } fn perform_fft_out_of_place( &self, input: &mut [Complex], output: &mut [Complex], scratch: &mut [Complex], ) { // SIX STEP FFT: // STEP 1: transpose transpose::transpose(input, output, self.width, self.height); // STEP 2: perform FFTs of size `height` let height_scratch = if scratch.len() > input.len() { &mut scratch[..] } else { &mut input[..] }; self.height_size_fft .process_with_scratch(output, height_scratch); // STEP 3: Apply twiddle factors for (element, twiddle) in output.iter_mut().zip(self.twiddles.iter()) { *element = *element * twiddle; } // STEP 4: transpose again transpose::transpose(output, input, self.height, self.width); // STEP 5: perform FFTs of size `width` let width_scratch = if scratch.len() > output.len() { &mut scratch[..] } else { &mut output[..] }; self.width_size_fft .process_with_scratch(input, width_scratch); // STEP 6: transpose again transpose::transpose(input, output, self.width, self.height); } } boilerplate_fft!( MixedRadix, |this: &MixedRadix<_>| this.twiddles.len(), |this: &MixedRadix<_>| this.inplace_scratch_len, |this: &MixedRadix<_>| this.outofplace_scratch_len ); /// Implementation of the Mixed-Radix FFT algorithm, specialized for smaller input sizes /// /// This algorithm factors a size n FFT into n1 * n2, computes several inner FFTs of size n1 and n2, then combines the /// results to get the final answer /// /// ~~~ /// // Computes a forward FFT of size 40 using MixedRadixSmall /// use std::sync::Arc; /// use rustfft::algorithm::MixedRadixSmall; /// use rustfft::algorithm::butterflies::{Butterfly5, Butterfly8}; /// use rustfft::{Fft, FftDirection}; /// use rustfft::num_complex::Complex; /// /// let len = 40; /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; len]; /// /// // we need to find an n1 and n2 such that n1 * n2 == 40 /// // n1 = 5 and n2 = 8 satisfies this /// let inner_fft_n1 = Arc::new(Butterfly5::new(FftDirection::Forward)); /// let inner_fft_n2 = Arc::new(Butterfly8::new(FftDirection::Forward)); /// /// // the mixed radix FFT length will be inner_fft_n1.len() * inner_fft_n2.len() = 40 /// let fft = MixedRadixSmall::new(inner_fft_n1, inner_fft_n2); /// fft.process(&mut buffer); /// ~~~ pub struct MixedRadixSmall { twiddles: Box<[Complex]>, width_size_fft: Arc>, width: usize, height_size_fft: Arc>, height: usize, direction: FftDirection, } impl MixedRadixSmall { /// Creates a FFT instance which will process inputs/outputs of size `width_fft.len() * height_fft.len()` pub fn new(width_fft: Arc>, height_fft: Arc>) -> Self { assert_eq!( width_fft.fft_direction(), height_fft.fft_direction(), "width_fft and height_fft must have the same direction. got width direction={}, height direction={}", width_fft.fft_direction(), height_fft.fft_direction()); // Verify that the inner FFTs don't require out-of-place scratch, and only arequire a small amount of inplace scratch let width = width_fft.len(); let height = height_fft.len(); let len = width * height; assert_eq!(width_fft.get_outofplace_scratch_len(), 0, "MixedRadixSmall should only be used with algorithms that require 0 out-of-place scratch. Width FFT (len={}) requires {}, should require 0", width, width_fft.get_outofplace_scratch_len()); assert_eq!(height_fft.get_outofplace_scratch_len(), 0, "MixedRadixSmall should only be used with algorithms that require 0 out-of-place scratch. Height FFT (len={}) requires {}, should require 0", height, height_fft.get_outofplace_scratch_len()); assert!(width_fft.get_inplace_scratch_len() <= width, "MixedRadixSmall should only be used with algorithms that require little inplace scratch. Width FFT (len={}) requires {}, should require {} or less", width, width_fft.get_inplace_scratch_len(), width); assert!(height_fft.get_inplace_scratch_len() <= height, "MixedRadixSmall should only be used with algorithms that require little inplace scratch. Height FFT (len={}) requires {}, should require {} or less", height, height_fft.get_inplace_scratch_len(), height); let direction = width_fft.fft_direction(); let mut twiddles = vec![Complex::zero(); len]; for (x, twiddle_chunk) in twiddles.chunks_exact_mut(height).enumerate() { for (y, twiddle_element) in twiddle_chunk.iter_mut().enumerate() { *twiddle_element = twiddles::compute_twiddle(x * y, len, direction); } } Self { twiddles: twiddles.into_boxed_slice(), width_size_fft: width_fft, width: width, height_size_fft: height_fft, height: height, direction, } } fn perform_fft_inplace(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { // SIX STEP FFT: // STEP 1: transpose unsafe { array_utils::transpose_small(self.width, self.height, buffer, scratch) }; // STEP 2: perform FFTs of size `height` self.height_size_fft.process_with_scratch(scratch, buffer); // STEP 3: Apply twiddle factors for (element, twiddle) in scratch.iter_mut().zip(self.twiddles.iter()) { *element = *element * twiddle; } // STEP 4: transpose again unsafe { array_utils::transpose_small(self.height, self.width, scratch, buffer) }; // STEP 5: perform FFTs of size `width` self.width_size_fft .process_outofplace_with_scratch(buffer, scratch, &mut []); // STEP 6: transpose again unsafe { array_utils::transpose_small(self.width, self.height, scratch, buffer) }; } fn perform_fft_out_of_place( &self, input: &mut [Complex], output: &mut [Complex], _scratch: &mut [Complex], ) { // SIX STEP FFT: // STEP 1: transpose unsafe { array_utils::transpose_small(self.width, self.height, input, output) }; // STEP 2: perform FFTs of size `height` self.height_size_fft.process_with_scratch(output, input); // STEP 3: Apply twiddle factors for (element, twiddle) in output.iter_mut().zip(self.twiddles.iter()) { *element = *element * twiddle; } // STEP 4: transpose again unsafe { array_utils::transpose_small(self.height, self.width, output, input) }; // STEP 5: perform FFTs of size `width` self.width_size_fft.process_with_scratch(input, output); // STEP 6: transpose again unsafe { array_utils::transpose_small(self.width, self.height, input, output) }; } } boilerplate_fft!( MixedRadixSmall, |this: &MixedRadixSmall<_>| this.twiddles.len(), |this: &MixedRadixSmall<_>| this.len(), |_| 0 ); #[cfg(test)] mod unit_tests { use super::*; use crate::test_utils::check_fft_algorithm; use crate::{algorithm::Dft, test_utils::BigScratchAlgorithm}; use num_traits::Zero; use std::sync::Arc; #[test] fn test_mixed_radix() { for width in 1..7 { for height in 1..7 { test_mixed_radix_with_lengths(width, height, FftDirection::Forward); test_mixed_radix_with_lengths(width, height, FftDirection::Inverse); } } } #[test] fn test_mixed_radix_small() { for width in 2..7 { for height in 2..7 { test_mixed_radix_small_with_lengths(width, height, FftDirection::Forward); test_mixed_radix_small_with_lengths(width, height, FftDirection::Inverse); } } } fn test_mixed_radix_with_lengths(width: usize, height: usize, direction: FftDirection) { let width_fft = Arc::new(Dft::new(width, direction)) as Arc>; let height_fft = Arc::new(Dft::new(height, direction)) as Arc>; let fft = MixedRadix::new(width_fft, height_fft); check_fft_algorithm(&fft, width * height, direction); } fn test_mixed_radix_small_with_lengths(width: usize, height: usize, direction: FftDirection) { let width_fft = Arc::new(Dft::new(width, direction)) as Arc>; let height_fft = Arc::new(Dft::new(height, direction)) as Arc>; let fft = MixedRadixSmall::new(width_fft, height_fft); check_fft_algorithm(&fft, width * height, direction); } // Verify that the mixed radix algorithm correctly provides scratch space to inner FFTs #[test] fn test_mixed_radix_inner_scratch() { let scratch_lengths = [1, 5, 25]; let mut inner_ffts = Vec::new(); for &len in &scratch_lengths { for &inplace_scratch in &scratch_lengths { for &outofplace_scratch in &scratch_lengths { inner_ffts.push(Arc::new(BigScratchAlgorithm { len, inplace_scratch, outofplace_scratch, direction: FftDirection::Forward, }) as Arc>); } } } for width_fft in inner_ffts.iter() { for height_fft in inner_ffts.iter() { let fft = MixedRadix::new(Arc::clone(width_fft), Arc::clone(height_fft)); let mut inplace_buffer = vec![Complex::zero(); fft.len()]; let mut inplace_scratch = vec![Complex::zero(); fft.get_inplace_scratch_len()]; fft.process_with_scratch(&mut inplace_buffer, &mut inplace_scratch); let mut outofplace_input = vec![Complex::zero(); fft.len()]; let mut outofplace_output = vec![Complex::zero(); fft.len()]; let mut outofplace_scratch = vec![Complex::zero(); fft.get_outofplace_scratch_len()]; fft.process_outofplace_with_scratch( &mut outofplace_input, &mut outofplace_output, &mut outofplace_scratch, ); } } } } rustfft-6.2.0/src/algorithm/mod.rs000064400000000000000000000010670072674642500152640ustar 00000000000000mod bluesteins_algorithm; mod dft; mod good_thomas_algorithm; mod mixed_radix; mod raders_algorithm; mod radix3; mod radix4; /// Hardcoded size-specfic FFT algorithms pub mod butterflies; pub use self::bluesteins_algorithm::BluesteinsAlgorithm; pub use self::dft::Dft; pub use self::good_thomas_algorithm::{GoodThomasAlgorithm, GoodThomasAlgorithmSmall}; pub use self::mixed_radix::{MixedRadix, MixedRadixSmall}; pub use self::raders_algorithm::RadersAlgorithm; pub use self::radix3::Radix3; pub use self::radix4::{bitreversed_transpose, Radix4}; rustfft-6.2.0/src/algorithm/raders_algorithm.rs000064400000000000000000000252610072674642500200350ustar 00000000000000use std::sync::Arc; use num_complex::Complex; use num_integer::Integer; use num_traits::Zero; use primal_check::miller_rabin; use strength_reduce::StrengthReducedUsize; use crate::array_utils; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::math_utils; use crate::{common::FftNum, twiddles, FftDirection}; use crate::{Direction, Fft, Length}; /// Implementation of Rader's Algorithm /// /// This algorithm computes a prime-sized FFT in O(nlogn) time. It does this by converting this size-N FFT into a /// size-(N - 1) FFT, which is guaranteed to be composite. /// /// The worst case for this algorithm is when (N - 1) is 2 * prime, resulting in a /// [Cunningham Chain](https://en.wikipedia.org/wiki/Cunningham_chain) /// /// ~~~ /// // Computes a forward FFT of size 1201 (prime number), using Rader's Algorithm /// use rustfft::algorithm::RadersAlgorithm; /// use rustfft::{Fft, FftPlanner}; /// use rustfft::num_complex::Complex; /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 1201]; /// /// // plan a FFT of size n - 1 = 1200 /// let mut planner = FftPlanner::new(); /// let inner_fft = planner.plan_fft_forward(1200); /// /// let fft = RadersAlgorithm::new(inner_fft); /// fft.process(&mut buffer); /// ~~~ /// /// Rader's Algorithm is relatively expensive compared to other FFT algorithms. Benchmarking shows that it is up to /// an order of magnitude slower than similar composite sizes. In the example size above of 1201, benchmarking shows /// that it takes 2.5x more time to compute than a FFT of size 1200. pub struct RadersAlgorithm { inner_fft: Arc>, inner_fft_data: Box<[Complex]>, primitive_root: usize, primitive_root_inverse: usize, len: StrengthReducedUsize, inplace_scratch_len: usize, outofplace_scratch_len: usize, direction: FftDirection, } impl RadersAlgorithm { /// Creates a FFT instance which will process inputs/outputs of size `inner_fft.len() + 1`. /// /// Note that this constructor is quite expensive to run; This algorithm must compute a FFT using `inner_fft` within the /// constructor. This further underlines the fact that Rader's Algorithm is more expensive to run than other /// FFT algorithms /// /// # Panics /// Panics if `inner_fft.len() + 1` is not a prime number. pub fn new(inner_fft: Arc>) -> Self { let inner_fft_len = inner_fft.len(); let len = inner_fft_len + 1; assert!(miller_rabin(len as u64), "For raders algorithm, inner_fft.len() + 1 must be prime. Expected prime number, got {} + 1 = {}", inner_fft_len, len); let direction = inner_fft.fft_direction(); let reduced_len = StrengthReducedUsize::new(len); // compute the primitive root and its inverse for this size let primitive_root = math_utils::primitive_root(len as u64).unwrap() as usize; // compute the multiplicative inverse of primative_root mod len and vice versa. // i64::extended_gcd will compute both the inverse of left mod right, and the inverse of right mod left, but we're only goingto use one of them // the primtive root inverse might be negative, if o make it positive by wrapping let gcd_data = i64::extended_gcd(&(primitive_root as i64), &(len as i64)); let primitive_root_inverse = if gcd_data.x >= 0 { gcd_data.x } else { gcd_data.x + len as i64 } as usize; // precompute the coefficients to use inside the process method let inner_fft_scale = T::one() / T::from_usize(inner_fft_len).unwrap(); let mut inner_fft_input = vec![Complex::zero(); inner_fft_len]; let mut twiddle_input = 1; for input_cell in &mut inner_fft_input { let twiddle = twiddles::compute_twiddle(twiddle_input, len, direction); *input_cell = twiddle * inner_fft_scale; twiddle_input = (twiddle_input * primitive_root_inverse) % reduced_len; } let required_inner_scratch = inner_fft.get_inplace_scratch_len(); let extra_inner_scratch = if required_inner_scratch <= inner_fft_len { 0 } else { required_inner_scratch }; //precompute a FFT of our reordered twiddle factors let mut inner_fft_scratch = vec![Zero::zero(); required_inner_scratch]; inner_fft.process_with_scratch(&mut inner_fft_input, &mut inner_fft_scratch); Self { inner_fft, inner_fft_data: inner_fft_input.into_boxed_slice(), primitive_root, primitive_root_inverse, len: reduced_len, inplace_scratch_len: inner_fft_len + extra_inner_scratch, outofplace_scratch_len: extra_inner_scratch, direction, } } fn perform_fft_out_of_place( &self, input: &mut [Complex], output: &mut [Complex], scratch: &mut [Complex], ) { // The first output element is just the sum of all the input elements, and we need to store off the first input value let (output_first, output) = output.split_first_mut().unwrap(); let (input_first, input) = input.split_first_mut().unwrap(); // copy the input into the output, reordering as we go. also compute a sum of all elements let mut input_index = 1; for output_element in output.iter_mut() { input_index = (input_index * self.primitive_root) % self.len; let input_element = input[input_index - 1]; *output_element = input_element; } // perform the first of two inner FFTs let inner_scratch = if scratch.len() > 0 { &mut scratch[..] } else { &mut input[..] }; self.inner_fft.process_with_scratch(output, inner_scratch); // output[0] now contains the sum of elements 1..len. We need the sum of all elements, so all we have to do is add the first input *output_first = *input_first + output[0]; // multiply the inner result with our cached setup data // also conjugate every entry. this sets us up to do an inverse FFT // (because an inverse FFT is equivalent to a normal FFT where you conjugate both the inputs and outputs) for ((output_cell, input_cell), &multiple) in output .iter() .zip(input.iter_mut()) .zip(self.inner_fft_data.iter()) { *input_cell = (*output_cell * multiple).conj(); } // We need to add the first input value to all output values. We can accomplish this by adding it to the DC input of our inner ifft. // Of course, we have to conjugate it, just like we conjugated the complex multiplied above input[0] = input[0] + input_first.conj(); // execute the second FFT let inner_scratch = if scratch.len() > 0 { scratch } else { &mut output[..] }; self.inner_fft.process_with_scratch(input, inner_scratch); // copy the final values into the output, reordering as we go let mut output_index = 1; for input_element in input { output_index = (output_index * self.primitive_root_inverse) % self.len; output[output_index - 1] = input_element.conj(); } } fn perform_fft_inplace(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { // The first output element is just the sum of all the input elements, and we need to store off the first input value let (buffer_first, buffer) = buffer.split_first_mut().unwrap(); let buffer_first_val = *buffer_first; let (scratch, extra_scratch) = scratch.split_at_mut(self.len() - 1); // copy the buffer into the scratch, reordering as we go. also compute a sum of all elements let mut input_index = 1; for scratch_element in scratch.iter_mut() { input_index = (input_index * self.primitive_root) % self.len; let buffer_element = buffer[input_index - 1]; *scratch_element = buffer_element; } // perform the first of two inner FFTs let inner_scratch = if extra_scratch.len() > 0 { extra_scratch } else { &mut buffer[..] }; self.inner_fft.process_with_scratch(scratch, inner_scratch); // scratch[0] now contains the sum of elements 1..len. We need the sum of all elements, so all we have to do is add the first input *buffer_first = *buffer_first + scratch[0]; // multiply the inner result with our cached setup data // also conjugate every entry. this sets us up to do an inverse FFT // (because an inverse FFT is equivalent to a normal FFT where you conjugate both the inputs and outputs) for (scratch_cell, &twiddle) in scratch.iter_mut().zip(self.inner_fft_data.iter()) { *scratch_cell = (*scratch_cell * twiddle).conj(); } // We need to add the first input value to all output values. We can accomplish this by adding it to the DC input of our inner ifft. // Of course, we have to conjugate it, just like we conjugated the complex multiplied above scratch[0] = scratch[0] + buffer_first_val.conj(); // execute the second FFT self.inner_fft.process_with_scratch(scratch, inner_scratch); // copy the final values into the output, reordering as we go let mut output_index = 1; for scratch_element in scratch { output_index = (output_index * self.primitive_root_inverse) % self.len; buffer[output_index - 1] = scratch_element.conj(); } } } boilerplate_fft!( RadersAlgorithm, |this: &RadersAlgorithm<_>| this.len.get(), |this: &RadersAlgorithm<_>| this.inplace_scratch_len, |this: &RadersAlgorithm<_>| this.outofplace_scratch_len ); #[cfg(test)] mod unit_tests { use super::*; use crate::algorithm::Dft; use crate::test_utils::check_fft_algorithm; use std::sync::Arc; #[test] fn test_raders() { for len in 3..100 { if miller_rabin(len as u64) { test_raders_with_length(len, FftDirection::Forward); test_raders_with_length(len, FftDirection::Inverse); } } } fn test_raders_with_length(len: usize, direction: FftDirection) { let inner_fft = Arc::new(Dft::new(len - 1, direction)); let fft = RadersAlgorithm::new(inner_fft); check_fft_algorithm::(&fft, len, direction); } } rustfft-6.2.0/src/algorithm/radix3.rs000064400000000000000000000223470072674642500157030ustar 00000000000000use std::sync::Arc; use num_complex::Complex; use num_traits::Zero; use crate::algorithm::butterflies::{Butterfly1, Butterfly27, Butterfly3, Butterfly9}; use crate::array_utils; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::{common::FftNum, twiddles, FftDirection}; use crate::{Direction, Fft, Length}; /// FFT algorithm optimized for power-of-three sizes /// /// ~~~ /// // Computes a forward FFT of size 2187 /// use rustfft::algorithm::Radix3; /// use rustfft::{Fft, FftDirection}; /// use rustfft::num_complex::Complex; /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 2187]; /// /// let fft = Radix3::new(2187, FftDirection::Forward); /// fft.process(&mut buffer); /// ~~~ pub struct Radix3 { twiddles: Box<[Complex]>, butterfly3: Butterfly3, base_fft: Arc>, base_len: usize, len: usize, direction: FftDirection, } impl Radix3 { /// Preallocates necessary arrays and precomputes necessary data to efficiently compute the power-of-three FFT pub fn new(len: usize, direction: FftDirection) -> Self { // Compute the total power of 3 for this length. IE, len = 3^exponent let exponent = compute_logarithm(len, 3).unwrap_or_else(|| { panic!( "Radix3 algorithm requires a power-of-three input size. Got {}", len ) }); // figure out which base length we're going to use let (base_len, base_fft) = match exponent { 0 => (len, Arc::new(Butterfly1::new(direction)) as Arc>), 1 => (len, Arc::new(Butterfly3::new(direction)) as Arc>), 2 => (len, Arc::new(Butterfly9::new(direction)) as Arc>), _ => (27, Arc::new(Butterfly27::new(direction)) as Arc>), }; // precompute the twiddle factors this algorithm will use. // we're doing the same precomputation of twiddle factors as the mixed radix algorithm where width=3 and height=len/3 // but mixed radix only does one step and then calls itself recusrively, and this algorithm does every layer all the way down // so we're going to pack all the "layers" of twiddle factors into a single array, starting with the bottom layer and going up let mut twiddle_stride = len / (base_len * 3); let mut twiddle_factors = Vec::with_capacity(len * 2); while twiddle_stride > 0 { let num_rows = len / (twiddle_stride * 3); for i in 0..num_rows { for k in 1..3 { let twiddle = twiddles::compute_twiddle(i * k * twiddle_stride, len, direction); twiddle_factors.push(twiddle); } } twiddle_stride /= 3; } Self { twiddles: twiddle_factors.into_boxed_slice(), butterfly3: Butterfly3::new(direction), base_fft, base_len, len, direction, } } fn perform_fft_out_of_place( &self, signal: &[Complex], spectrum: &mut [Complex], _scratch: &mut [Complex], ) { // copy the data into the spectrum vector if self.len() == self.base_len { spectrum.copy_from_slice(signal); } else { bitreversed_transpose(self.base_len, signal, spectrum); } // Base-level FFTs self.base_fft.process_with_scratch(spectrum, &mut []); // cross-FFTs let mut current_size = self.base_len * 3; let mut layer_twiddles: &[Complex] = &self.twiddles; while current_size <= signal.len() { let num_rows = signal.len() / current_size; for i in 0..num_rows { unsafe { butterfly_3( &mut spectrum[i * current_size..], layer_twiddles, current_size / 3, &self.butterfly3, ) } } //skip past all the twiddle factors used in this layer let twiddle_offset = (current_size * 2) / 3; layer_twiddles = &layer_twiddles[twiddle_offset..]; current_size *= 3; } } } boilerplate_fft_oop!(Radix3, |this: &Radix3<_>| this.len); // Preparing for radix 3 is similar to a transpose, where the column index is bit reversed. // Use a lookup table to avoid repeating the slow bit reverse operations. // Unrolling the outer loop by a factor 4 helps speed things up. pub fn bitreversed_transpose(height: usize, input: &[T], output: &mut [T]) { let width = input.len() / height; let third_width = width / 3; let rev_digits = compute_logarithm(width, 3).unwrap(); // Let's make sure the arguments are ok assert!(input.len() == output.len()); for x in 0..third_width { let x0 = 3 * x; let x1 = 3 * x + 1; let x2 = 3 * x + 2; let x_rev = [ reverse_bits(x0, rev_digits), reverse_bits(x1, rev_digits), reverse_bits(x2, rev_digits), ]; // Assert that the the bit reversed indices will not exceed the length of the output. // The highest index the loop reaches is: (x_rev[n] + 1)*height - 1 // The last element of the data is at index: width*height - 1 // Thus it is sufficient to assert that x_rev[n] Option { if value == 0 || base == 0 { return None; } let mut current_exponent = 0; let mut current_value = value; while current_value % base == 0 { current_exponent += 1; current_value /= base; } if current_value == 1 { Some(current_exponent) } else { None } } // Sort of like reversing bits in radix4. We're not actually reversing bits, but the algorithm is exactly the same. // Radix4's bit reversal does divisions by 4, multiplications by 4, and modulo 4 - all of which are easily represented by bit manipulation. // As a result, it can be thought of as a bit reversal. But really, the "bit reversal"-ness of it is a special case of a more general "remainder reversal" // IE, it's repeatedly taking the remainder of dividing by N, and building a new number where those remainders are reversed. // So this algorithm does all the things that bit reversal does, but replaces the multiplications by 4 with multiplications by 3, etc, and ends up with the same conceptual result as a bit reversal. pub fn reverse_bits(value: usize, reversal_iters: usize) -> usize { let mut result: usize = 0; let mut value = value; for _ in 0..reversal_iters { result = (result * 3) + (value % 3); value /= 3; } result } unsafe fn butterfly_3( data: &mut [Complex], twiddles: &[Complex], num_ffts: usize, butterfly3: &Butterfly3, ) { let mut idx = 0usize; let mut tw_idx = 0usize; let mut scratch = [Zero::zero(); 3]; for _ in 0..num_ffts { scratch[0] = *data.get_unchecked(idx); scratch[1] = *data.get_unchecked(idx + 1 * num_ffts) * twiddles[tw_idx]; scratch[2] = *data.get_unchecked(idx + 2 * num_ffts) * twiddles[tw_idx + 1]; butterfly3.perform_fft_butterfly(&mut scratch); *data.get_unchecked_mut(idx) = scratch[0]; *data.get_unchecked_mut(idx + 1 * num_ffts) = scratch[1]; *data.get_unchecked_mut(idx + 2 * num_ffts) = scratch[2]; tw_idx += 2; idx += 1; } } #[cfg(test)] mod unit_tests { use super::*; use crate::test_utils::check_fft_algorithm; #[test] fn test_radix3() { for pow in 0..8 { let len = 3usize.pow(pow); test_3adix3_with_length(len, FftDirection::Forward); test_3adix3_with_length(len, FftDirection::Inverse); } } fn test_3adix3_with_length(len: usize, direction: FftDirection) { let fft = Radix3::new(len, direction); check_fft_algorithm::(&fft, len, direction); } } rustfft-6.2.0/src/algorithm/radix4.rs000064400000000000000000000211230072674642500156730ustar 00000000000000use std::sync::Arc; use num_complex::Complex; use num_traits::Zero; use crate::algorithm::butterflies::{Butterfly1, Butterfly16, Butterfly2, Butterfly4, Butterfly8}; use crate::array_utils; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::{common::FftNum, twiddles, FftDirection}; use crate::{Direction, Fft, Length}; /// FFT algorithm optimized for power-of-two sizes /// /// ~~~ /// // Computes a forward FFT of size 4096 /// use rustfft::algorithm::Radix4; /// use rustfft::{Fft, FftDirection}; /// use rustfft::num_complex::Complex; /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 4096]; /// /// let fft = Radix4::new(4096, FftDirection::Forward); /// fft.process(&mut buffer); /// ~~~ pub struct Radix4 { twiddles: Box<[Complex]>, base_fft: Arc>, base_len: usize, len: usize, direction: FftDirection, } impl Radix4 { /// Preallocates necessary arrays and precomputes necessary data to efficiently compute the power-of-two FFT pub fn new(len: usize, direction: FftDirection) -> Self { assert!( len.is_power_of_two(), "Radix4 algorithm requires a power-of-two input size. Got {}", len ); // figure out which base length we're going to use let num_bits = len.trailing_zeros(); let (base_len, base_fft) = match num_bits { 0 => (len, Arc::new(Butterfly1::new(direction)) as Arc>), 1 => (len, Arc::new(Butterfly2::new(direction)) as Arc>), 2 => (len, Arc::new(Butterfly4::new(direction)) as Arc>), _ => { if num_bits % 2 == 1 { (8, Arc::new(Butterfly8::new(direction)) as Arc>) } else { (16, Arc::new(Butterfly16::new(direction)) as Arc>) } } }; // precompute the twiddle factors this algorithm will use. // we're doing the same precomputation of twiddle factors as the mixed radix algorithm where width=4 and height=len/4 // but mixed radix only does one step and then calls itself recusrively, and this algorithm does every layer all the way down // so we're going to pack all the "layers" of twiddle factors into a single array, starting with the bottom layer and going up let mut twiddle_stride = len / (base_len * 4); let mut twiddle_factors = Vec::with_capacity(len * 2); while twiddle_stride > 0 { let num_rows = len / (twiddle_stride * 4); for i in 0..num_rows { for k in 1..4 { let twiddle = twiddles::compute_twiddle(i * k * twiddle_stride, len, direction); twiddle_factors.push(twiddle); } } twiddle_stride /= 4; } Self { twiddles: twiddle_factors.into_boxed_slice(), base_fft, base_len, len, direction, } } fn perform_fft_out_of_place( &self, signal: &[Complex], spectrum: &mut [Complex], _scratch: &mut [Complex], ) { // copy the data into the spectrum vector if self.len() == self.base_len { spectrum.copy_from_slice(signal); } else { bitreversed_transpose(self.base_len, signal, spectrum); } // Base-level FFTs self.base_fft.process_with_scratch(spectrum, &mut []); // cross-FFTs let mut current_size = self.base_len * 4; let mut layer_twiddles: &[Complex] = &self.twiddles; while current_size <= signal.len() { let num_rows = signal.len() / current_size; for i in 0..num_rows { unsafe { butterfly_4( &mut spectrum[i * current_size..], layer_twiddles, current_size / 4, self.direction, ) } } //skip past all the twiddle factors used in this layer let twiddle_offset = (current_size * 3) / 4; layer_twiddles = &layer_twiddles[twiddle_offset..]; current_size *= 4; } } } boilerplate_fft_oop!(Radix4, |this: &Radix4<_>| this.len); // Preparing for radix 4 is similar to a transpose, where the column index is bit reversed. // Use a lookup table to avoid repeating the slow bit reverse operations. // Unrolling the outer loop by a factor 4 helps speed things up. pub fn bitreversed_transpose(height: usize, input: &[T], output: &mut [T]) { let width = input.len() / height; let quarter_width = width / 4; let rev_digits = (width.trailing_zeros() / 2) as usize; // Let's make sure the arguments are ok assert!(input.len() == output.len()); for x in 0..quarter_width { let x0 = 4 * x; let x1 = 4 * x + 1; let x2 = 4 * x + 2; let x3 = 4 * x + 3; let x_rev = [ reverse_bits(x0, rev_digits), reverse_bits(x1, rev_digits), reverse_bits(x2, rev_digits), reverse_bits(x3, rev_digits), ]; // Assert that the the bit reversed indices will not exceed the length of the output. // The highest index the loop reaches is: (x_rev[n] + 1)*height - 1 // The last element of the data is at index: width*height - 1 // Thus it is sufficient to assert that x_rev[n] ghefcdab pub fn reverse_bits(value: usize, bitpairs: usize) -> usize { let mut result: usize = 0; let mut value = value; for _ in 0..bitpairs { result = (result << 2) + (value & 0x03); value = value >> 2; } result } unsafe fn butterfly_4( data: &mut [Complex], twiddles: &[Complex], num_ffts: usize, direction: FftDirection, ) { let butterfly4 = Butterfly4::new(direction); let mut idx = 0usize; let mut tw_idx = 0usize; let mut scratch = [Zero::zero(); 4]; for _ in 0..num_ffts { scratch[0] = *data.get_unchecked(idx); scratch[1] = *data.get_unchecked(idx + 1 * num_ffts) * twiddles[tw_idx]; scratch[2] = *data.get_unchecked(idx + 2 * num_ffts) * twiddles[tw_idx + 1]; scratch[3] = *data.get_unchecked(idx + 3 * num_ffts) * twiddles[tw_idx + 2]; butterfly4.perform_fft_butterfly(&mut scratch); *data.get_unchecked_mut(idx) = scratch[0]; *data.get_unchecked_mut(idx + 1 * num_ffts) = scratch[1]; *data.get_unchecked_mut(idx + 2 * num_ffts) = scratch[2]; *data.get_unchecked_mut(idx + 3 * num_ffts) = scratch[3]; tw_idx += 3; idx += 1; } } #[cfg(test)] mod unit_tests { use super::*; use crate::test_utils::check_fft_algorithm; #[test] fn test_radix4() { for pow in 1..12 { let len = 1 << pow; test_radix4_with_length(len, FftDirection::Forward); //test_radix4_with_length(len, FftDirection::Inverse); } } fn test_radix4_with_length(len: usize, direction: FftDirection) { let fft = Radix4::new(len, direction); check_fft_algorithm::(&fft, len, direction); } } rustfft-6.2.0/src/array_utils.rs000064400000000000000000000144170072674642500150600ustar 00000000000000use crate::Complex; use crate::FftNum; use std::ops::{Deref, DerefMut}; /// Given an array of size width * height, representing a flattened 2D array, /// transpose the rows and columns of that 2D array into the output /// benchmarking shows that loop tiling isn't effective for small arrays (in the range of 50x50 or smaller) pub unsafe fn transpose_small(width: usize, height: usize, input: &[T], output: &mut [T]) { for x in 0..width { for y in 0..height { let input_index = x + y * width; let output_index = y + x * height; *output.get_unchecked_mut(output_index) = *input.get_unchecked(input_index); } } } #[allow(unused)] pub unsafe fn workaround_transmute(slice: &[T]) -> &[U] { let ptr = slice.as_ptr() as *const U; let len = slice.len(); std::slice::from_raw_parts(ptr, len) } #[allow(unused)] pub unsafe fn workaround_transmute_mut(slice: &mut [T]) -> &mut [U] { let ptr = slice.as_mut_ptr() as *mut U; let len = slice.len(); std::slice::from_raw_parts_mut(ptr, len) } pub(crate) trait LoadStore: DerefMut { unsafe fn load(&self, idx: usize) -> Complex; unsafe fn store(&mut self, val: Complex, idx: usize); } impl LoadStore for &mut [Complex] { #[inline(always)] unsafe fn load(&self, idx: usize) -> Complex { debug_assert!(idx < self.len()); *self.get_unchecked(idx) } #[inline(always)] unsafe fn store(&mut self, val: Complex, idx: usize) { debug_assert!(idx < self.len()); *self.get_unchecked_mut(idx) = val; } } impl LoadStore for &mut [Complex; N] { #[inline(always)] unsafe fn load(&self, idx: usize) -> Complex { debug_assert!(idx < self.len()); *self.get_unchecked(idx) } #[inline(always)] unsafe fn store(&mut self, val: Complex, idx: usize) { debug_assert!(idx < self.len()); *self.get_unchecked_mut(idx) = val; } } pub(crate) struct DoubleBuf<'a, T> { pub input: &'a [Complex], pub output: &'a mut [Complex], } impl<'a, T> Deref for DoubleBuf<'a, T> { type Target = [Complex]; fn deref(&self) -> &Self::Target { self.input } } impl<'a, T> DerefMut for DoubleBuf<'a, T> { fn deref_mut(&mut self) -> &mut Self::Target { self.output } } impl<'a, T: FftNum> LoadStore for DoubleBuf<'a, T> { #[inline(always)] unsafe fn load(&self, idx: usize) -> Complex { debug_assert!(idx < self.input.len()); *self.input.get_unchecked(idx) } #[inline(always)] unsafe fn store(&mut self, val: Complex, idx: usize) { debug_assert!(idx < self.output.len()); *self.output.get_unchecked_mut(idx) = val; } } #[cfg(test)] mod unit_tests { use super::*; use crate::test_utils::random_signal; use num_complex::Complex; use num_traits::Zero; #[test] fn test_transpose() { let sizes: Vec = (1..16).collect(); for &width in &sizes { for &height in &sizes { let len = width * height; let input: Vec> = random_signal(len); let mut output = vec![Zero::zero(); len]; unsafe { transpose_small(width, height, &input, &mut output) }; for x in 0..width { for y in 0..height { assert_eq!( input[x + y * width], output[y + x * height], "x = {}, y = {}", x, y ); } } } } } } // Loop over exact chunks of the provided buffer. Very similar in semantics to ChunksExactMut, but generates smaller code and requires no modulo operations // Returns Ok() if every element ended up in a chunk, Err() if there was a remainder pub fn iter_chunks( mut buffer: &mut [T], chunk_size: usize, mut chunk_fn: impl FnMut(&mut [T]), ) -> Result<(), ()> { // Loop over the buffer, splicing off chunk_size at a time, and calling chunk_fn on each while buffer.len() >= chunk_size { let (head, tail) = buffer.split_at_mut(chunk_size); buffer = tail; chunk_fn(head); } // We have a remainder if there's data still in the buffer -- in which case we want to indicate to the caller that there was an unwanted remainder if buffer.len() == 0 { Ok(()) } else { Err(()) } } // Loop over exact zipped chunks of the 2 provided buffers. Very similar in semantics to ChunksExactMut.zip(ChunksExactMut), but generates smaller code and requires no modulo operations // Returns Ok() if every element of both buffers ended up in a chunk, Err() if there was a remainder pub fn iter_chunks_zipped( mut buffer1: &mut [T], mut buffer2: &mut [T], chunk_size: usize, mut chunk_fn: impl FnMut(&mut [T], &mut [T]), ) -> Result<(), ()> { // If the two buffers aren't the same size, record the fact that they're different, then snip them to be the same size let uneven = if buffer1.len() > buffer2.len() { buffer1 = &mut buffer1[..buffer2.len()]; true } else if buffer2.len() < buffer1.len() { buffer2 = &mut buffer2[..buffer1.len()]; true } else { false }; // Now that we know the two slices are the same length, loop over each one, splicing off chunk_size at a time, and calling chunk_fn on each while buffer1.len() >= chunk_size && buffer2.len() >= chunk_size { let (head1, tail1) = buffer1.split_at_mut(chunk_size); buffer1 = tail1; let (head2, tail2) = buffer2.split_at_mut(chunk_size); buffer2 = tail2; chunk_fn(head1, head2); } // We have a remainder if the 2 chunks were uneven to start with, or if there's still data in the buffers -- in which case we want to indicate to the caller that there was an unwanted remainder if !uneven && buffer1.len() == 0 { Ok(()) } else { Err(()) } } rustfft-6.2.0/src/avx/avx32_butterflies.rs000064400000000000000000002447270072674642500167040ustar 00000000000000use std::arch::x86_64::*; use std::marker::PhantomData; use std::mem::MaybeUninit; use num_complex::Complex; use crate::array_utils; use crate::array_utils::workaround_transmute_mut; use crate::array_utils::DoubleBuf; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::{common::FftNum, twiddles}; use crate::{Direction, Fft, FftDirection, Length}; use super::avx32_utils; use super::avx_vector::{self, AvxArray}; use super::avx_vector::{AvxArrayMut, AvxVector, AvxVector128, AvxVector256, Rotation90}; // Safety: This macro will call `self::perform_fft_f32()` which probably has a #[target_feature(enable = "...")] annotation on it. // Calling functions with that annotation is unsafe, because it doesn't actually check if the CPU has the required features. // Callers of this macro must guarantee that users can't even obtain an instance of $struct_name if their CPU doesn't have the required CPU features. macro_rules! boilerplate_fft_simd_butterfly { ($struct_name:ident, $len:expr) => { impl $struct_name { #[inline] pub fn is_supported_by_cpu() -> bool { is_x86_feature_detected!("avx") && is_x86_feature_detected!("fma") } #[inline] pub fn new(direction: FftDirection) -> Result { if Self::is_supported_by_cpu() { // Safety: new_internal requires the "avx" feature set. Since we know it's present, we're safe Ok(unsafe { Self::new_with_avx(direction) }) } else { Err(()) } } } impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], _scratch: &mut [Complex], ) { if input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } let result = array_utils::iter_chunks_zipped( input, output, self.len(), |in_chunk, out_chunk| { unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why we have to transmute these slices let input_slice = workaround_transmute_mut(in_chunk); let output_slice = workaround_transmute_mut(out_chunk); self.perform_fft_f32(DoubleBuf { input: input_slice, output: output_slice, }); } }, ); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); } } fn process_with_scratch(&self, buffer: &mut [Complex], _scratch: &mut [Complex]) { if buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace(self.len(), buffer.len(), 0, 0); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let result = array_utils::iter_chunks(buffer, self.len(), |chunk| { unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why we have to transmute these slices self.perform_fft_f32(workaround_transmute_mut::<_, Complex>(chunk)); } }); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace(self.len(), buffer.len(), 0, 0); } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { 0 } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { 0 } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { $len } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { self.direction } } }; } // Safety: This macro will call `self::column_butterflies_and_transpose and self::row_butterflies()` which probably has a #[target_feature(enable = "...")] annotation on it. // Calling functions with that annotation is unsafe, because it doesn't actually check if the CPU has the required features. // Callers of this macro must guarantee that users can't even obtain an instance of $struct_name if their CPU doesn't have the required CPU features. macro_rules! boilerplate_fft_simd_butterfly_with_scratch { ($struct_name:ident, $len:expr) => { impl $struct_name { #[inline] pub fn new(direction: FftDirection) -> Result { let has_avx = is_x86_feature_detected!("avx"); let has_fma = is_x86_feature_detected!("fma"); if has_avx && has_fma { // Safety: new_internal requires the "avx" feature set. Since we know it's present, we're safe Ok(unsafe { Self::new_with_avx(direction) }) } else { Err(()) } } } impl $struct_name { #[inline] fn perform_fft_inplace( &self, buffer: &mut [Complex], scratch: &mut [Complex], ) { // Perform the column FFTs // Safety: self.perform_column_butterflies() requres the "avx" and "fma" instruction sets, and we return Err() in our constructor if the instructions aren't available unsafe { self.column_butterflies_and_transpose(buffer, scratch) }; // process the row FFTs, and copy from the scratch back to the buffer as we go // Safety: self.transpose() requres the "avx" instruction set, and we return Err() in our constructor if the instructions aren't available unsafe { self.row_butterflies(DoubleBuf { input: scratch, output: buffer, }) }; } #[inline] fn perform_fft_out_of_place( &self, input: &mut [Complex], output: &mut [Complex], ) { // Perform the column FFTs // Safety: self.perform_column_butterflies() requres the "avx" and "fma" instruction sets, and we return Err() in our constructor if the instructions aren't available unsafe { self.column_butterflies_and_transpose(input, output) }; // process the row FFTs in-place in the output buffer // Safety: self.transpose() requres the "avx" instruction set, and we return Err() in our constructor if the instructions aren't available unsafe { self.row_butterflies(output) }; } } impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], _scratch: &mut [Complex], ) { if input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_input: &mut [Complex] = unsafe { array_utils::workaround_transmute_mut(input) }; let transmuted_output: &mut [Complex] = unsafe { array_utils::workaround_transmute_mut(output) }; let result = array_utils::iter_chunks_zipped( transmuted_input, transmuted_output, self.len(), |in_chunk, out_chunk| self.perform_fft_out_of_place(in_chunk, out_chunk), ); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); } } fn process_with_scratch(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { let required_scratch = self.len(); if scratch.len() < required_scratch || buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace(self.len(), buffer.len(), self.len(), scratch.len()); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let scratch = &mut scratch[..required_scratch]; // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_buffer: &mut [Complex] = unsafe { array_utils::workaround_transmute_mut(buffer) }; let transmuted_scratch: &mut [Complex] = unsafe { array_utils::workaround_transmute_mut(scratch) }; let result = array_utils::iter_chunks(transmuted_buffer, self.len(), |chunk| { self.perform_fft_inplace(chunk, transmuted_scratch) }); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace(self.len(), buffer.len(), self.len(), scratch.len()); } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { $len } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { 0 } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { $len } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { self.direction } } }; } macro_rules! gen_butterfly_twiddles_interleaved_columns { ($num_rows:expr, $num_cols:expr, $skip_cols:expr, $direction: expr) => {{ const FFT_LEN: usize = $num_rows * $num_cols; const TWIDDLE_ROWS: usize = $num_rows - 1; const TWIDDLE_COLS: usize = $num_cols - $skip_cols; const TWIDDLE_VECTOR_COLS: usize = TWIDDLE_COLS / 4; const TWIDDLE_VECTOR_COUNT: usize = TWIDDLE_VECTOR_COLS * TWIDDLE_ROWS; let mut twiddles = [AvxVector::zero(); TWIDDLE_VECTOR_COUNT]; for index in 0..TWIDDLE_VECTOR_COUNT { let y = (index / TWIDDLE_VECTOR_COLS) + 1; let x = (index % TWIDDLE_VECTOR_COLS) * 4 + $skip_cols; twiddles[index] = AvxVector::make_mixedradix_twiddle_chunk(x, y, FFT_LEN, $direction); } twiddles }}; } macro_rules! gen_butterfly_twiddles_separated_columns { ($num_rows:expr, $num_cols:expr, $skip_cols:expr, $direction: expr) => {{ const FFT_LEN: usize = $num_rows * $num_cols; const TWIDDLE_ROWS: usize = $num_rows - 1; const TWIDDLE_COLS: usize = $num_cols - $skip_cols; const TWIDDLE_VECTOR_COLS: usize = TWIDDLE_COLS / 4; const TWIDDLE_VECTOR_COUNT: usize = TWIDDLE_VECTOR_COLS * TWIDDLE_ROWS; let mut twiddles = [AvxVector::zero(); TWIDDLE_VECTOR_COUNT]; for index in 0..TWIDDLE_VECTOR_COUNT { let y = (index % TWIDDLE_ROWS) + 1; let x = (index / TWIDDLE_ROWS) * 4 + $skip_cols; twiddles[index] = AvxVector::make_mixedradix_twiddle_chunk(x, y, FFT_LEN, $direction); } twiddles }}; } pub struct Butterfly5Avx { twiddles: [__m128; 3], direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly5Avx, 5); impl Butterfly5Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { let twiddle1 = twiddles::compute_twiddle(1, 5, direction); let twiddle2 = twiddles::compute_twiddle(2, 5, direction); Self { twiddles: [ _mm_set_ps(twiddle1.im, twiddle1.im, twiddle1.re, twiddle1.re), _mm_set_ps(twiddle2.im, twiddle2.im, twiddle2.re, twiddle2.re), _mm_set_ps(-twiddle1.im, -twiddle1.im, twiddle1.re, twiddle1.re), ], direction, _phantom_t: PhantomData, } } } impl Butterfly5Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f32(&self, mut buffer: impl AvxArrayMut) { let input0 = _mm_castpd_ps(_mm_load1_pd(buffer.input_ptr() as *const f64)); // load the first element of the input, and duplicate it into both complex number slots of input0 let input12 = buffer.load_partial2_complex(1); let input34 = buffer.load_partial2_complex(3); // swap elements for inputs 3 and 4 let input43 = AvxVector::reverse_complex_elements(input34); // do some prep work before we can start applying twiddle factors let [sum12, diff43] = AvxVector::column_butterfly2([input12, input43]); let rotation = AvxVector::make_rotation90(FftDirection::Inverse); let rotated43 = AvxVector::rotate90(diff43, rotation); let [mid14, mid23] = AvxVector::unpack_complex([sum12, rotated43]); // to compute the first output, compute the sum of all elements. mid14[0] and mid23[0] already have the sum of 1+4 and 2+3 respectively, so if we add them, we'll get the sum of all 4 let sum1234 = AvxVector::add(mid14, mid23); let output0 = AvxVector::add(input0, sum1234); // apply twiddle factors let twiddled14_mid = AvxVector::mul(mid14, self.twiddles[0]); let twiddled23_mid = AvxVector::mul(mid14, self.twiddles[1]); let twiddled14 = AvxVector::fmadd(mid23, self.twiddles[1], twiddled14_mid); let twiddled23 = AvxVector::fmadd(mid23, self.twiddles[2], twiddled23_mid); // unpack the data for the last butterfly 2 let [twiddled12, twiddled43] = AvxVector::unpack_complex([twiddled14, twiddled23]); let [output12, output43] = AvxVector::column_butterfly2([twiddled12, twiddled43]); // swap the elements in output43 before writing them out, and add the first input to everything let final12 = AvxVector::add(input0, output12); let output34 = AvxVector::reverse_complex_elements(output43); let final34 = AvxVector::add(input0, output34); buffer.store_partial1_complex(output0, 0); buffer.store_partial2_complex(final12, 1); buffer.store_partial2_complex(final34, 3); } } pub struct Butterfly7Avx { twiddles: [__m128; 5], direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly7Avx, 7); impl Butterfly7Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { let twiddle1 = twiddles::compute_twiddle(1, 7, direction); let twiddle2 = twiddles::compute_twiddle(2, 7, direction); let twiddle3 = twiddles::compute_twiddle(3, 7, direction); Self { twiddles: [ _mm_set_ps(twiddle1.im, twiddle1.im, twiddle1.re, twiddle1.re), _mm_set_ps(twiddle2.im, twiddle2.im, twiddle2.re, twiddle2.re), _mm_set_ps(twiddle3.im, twiddle3.im, twiddle3.re, twiddle3.re), _mm_set_ps(-twiddle3.im, -twiddle3.im, twiddle3.re, twiddle3.re), _mm_set_ps(-twiddle1.im, -twiddle1.im, twiddle1.re, twiddle1.re), ], direction, _phantom_t: PhantomData, } } } impl Butterfly7Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f32(&self, mut buffer: impl AvxArrayMut) { // load the first element of the input, and duplicate it into both complex number slots of input0 let input0 = _mm_castpd_ps(_mm_load1_pd(buffer.input_ptr() as *const f64)); // we want to load 3 elements into 123 and 3 elements into 456, but we can only load 4, so we're going to do slightly overlapping reads here // we have to reverse 456 immediately after loading, and that'll be easiest if we load the 456 into the latter 3 slots of the register, rather than the front 3 slots // as a bonus, that also means we don't need masked reads or anything let input123 = buffer.load_complex(1); let input456 = buffer.load_complex(3); // reverse the order of input456 let input654 = AvxVector::reverse_complex_elements(input456); // do some prep work before we can start applying twiddle factors let [sum123, diff654] = AvxVector::column_butterfly2([input123, input654]); let rotation = AvxVector::make_rotation90(FftDirection::Inverse); let rotated654 = AvxVector::rotate90(diff654, rotation); let [mid1634, mid25] = AvxVector::unpack_complex([sum123, rotated654]); let mid16 = mid1634.lo(); let mid25 = mid25.lo(); let mid34 = mid1634.hi(); // to compute the first output, compute the sum of all elements. mid16[0], mid25[0], and mid34[0] already have the sum of 1+6, 2+5 and 3+4 respectively, so if we add them, we'll get 1+2+3+4+5+6 let output0_left = AvxVector::add(mid16, mid25); let output0_right = AvxVector::add(input0, mid34); let output0 = AvxVector::add(output0_left, output0_right); buffer.store_partial1_complex(output0, 0); _mm256_zeroupper(); // apply twiddle factors let twiddled16_intermediate1 = AvxVector::mul(mid16, self.twiddles[0]); let twiddled25_intermediate1 = AvxVector::mul(mid16, self.twiddles[1]); let twiddled34_intermediate1 = AvxVector::mul(mid16, self.twiddles[2]); let twiddled16_intermediate2 = AvxVector::fmadd(mid25, self.twiddles[1], twiddled16_intermediate1); let twiddled25_intermediate2 = AvxVector::fmadd(mid25, self.twiddles[3], twiddled25_intermediate1); let twiddled34_intermediate2 = AvxVector::fmadd(mid25, self.twiddles[4], twiddled34_intermediate1); let twiddled16 = AvxVector::fmadd(mid34, self.twiddles[2], twiddled16_intermediate2); let twiddled25 = AvxVector::fmadd(mid34, self.twiddles[4], twiddled25_intermediate2); let twiddled34 = AvxVector::fmadd(mid34, self.twiddles[1], twiddled34_intermediate2); // unpack the data for the last butterfly 2 let [twiddled12, twiddled65] = AvxVector::unpack_complex([twiddled16, twiddled25]); let [twiddled33, twiddled44] = AvxVector::unpack_complex([twiddled34, twiddled34]); // we can save one add if we add input0 to twiddled33 now. normally we'd add input0 to the final output, but the arrangement of data makes that a little awkward let twiddled033 = AvxVector::add(twiddled33, input0); let [output12, output65] = AvxVector::column_butterfly2([twiddled12, twiddled65]); let [output033, output044] = AvxVector::column_butterfly2([twiddled033, twiddled44]); let output56 = AvxVector::reverse_complex_elements(output65); buffer.store_partial2_complex(AvxVector::add(output12, input0), 1); buffer.store_partial1_complex(output033, 3); buffer.store_partial1_complex(output044, 4); buffer.store_partial2_complex(AvxVector::add(output56, input0), 5); } } pub struct Butterfly11Avx { twiddles: [__m256; 10], twiddle_lo_4: __m128, twiddle_lo_9: __m128, twiddle_lo_3: __m128, twiddle_lo_8: __m128, twiddle_lo_2: __m128, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly11Avx, 11); impl Butterfly11Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { let twiddle1 = twiddles::compute_twiddle(1, 11, direction); let twiddle2 = twiddles::compute_twiddle(2, 11, direction); let twiddle3 = twiddles::compute_twiddle(3, 11, direction); let twiddle4 = twiddles::compute_twiddle(4, 11, direction); let twiddle5 = twiddles::compute_twiddle(5, 11, direction); let twiddles_lo = [ _mm_set_ps(twiddle1.im, twiddle1.im, twiddle1.re, twiddle1.re), _mm_set_ps(twiddle2.im, twiddle2.im, twiddle2.re, twiddle2.re), _mm_set_ps(twiddle3.im, twiddle3.im, twiddle3.re, twiddle3.re), _mm_set_ps(twiddle4.im, twiddle4.im, twiddle4.re, twiddle4.re), _mm_set_ps(twiddle5.im, twiddle5.im, twiddle5.re, twiddle5.re), _mm_set_ps(-twiddle5.im, -twiddle5.im, twiddle5.re, twiddle5.re), _mm_set_ps(-twiddle4.im, -twiddle4.im, twiddle4.re, twiddle4.re), _mm_set_ps(-twiddle3.im, -twiddle3.im, twiddle3.re, twiddle3.re), _mm_set_ps(-twiddle2.im, -twiddle2.im, twiddle2.re, twiddle2.re), _mm_set_ps(-twiddle1.im, -twiddle1.im, twiddle1.re, twiddle1.re), ]; Self { twiddles: [ AvxVector256::merge(twiddles_lo[0], twiddles_lo[2]), AvxVector256::merge(twiddles_lo[1], twiddles_lo[3]), AvxVector256::merge(twiddles_lo[1], twiddles_lo[5]), AvxVector256::merge(twiddles_lo[3], twiddles_lo[7]), AvxVector256::merge(twiddles_lo[2], twiddles_lo[8]), AvxVector256::merge(twiddles_lo[5], twiddles_lo[0]), AvxVector256::merge(twiddles_lo[3], twiddles_lo[0]), AvxVector256::merge(twiddles_lo[7], twiddles_lo[4]), AvxVector256::merge(twiddles_lo[4], twiddles_lo[3]), AvxVector256::merge(twiddles_lo[9], twiddles_lo[8]), ], twiddle_lo_4: twiddles_lo[4], twiddle_lo_9: twiddles_lo[9], twiddle_lo_3: twiddles_lo[3], twiddle_lo_8: twiddles_lo[8], twiddle_lo_2: twiddles_lo[2], direction, _phantom_t: PhantomData, } } } impl Butterfly11Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f32(&self, mut buffer: impl AvxArrayMut) { let input0 = _mm_castpd_ps(_mm_load1_pd(buffer.input_ptr() as *const f64)); // load the first element of the input, and duplicate it into both complex number slots of input0 let input1234 = buffer.load_complex(1); let input56 = buffer.load_partial2_complex(5); let input78910 = buffer.load_complex(7); // reverse the order of input78910, and separate let [input55, input66] = AvxVector::unpack_complex([input56, input56]); let input10987 = AvxVector::reverse_complex_elements(input78910); // do some initial butterflies and rotations let [sum1234, diff10987] = AvxVector::column_butterfly2([input1234, input10987]); let [sum55, diff66] = AvxVector::column_butterfly2([input55, input66]); let rotation = AvxVector::make_rotation90(FftDirection::Inverse); let rotated10987 = AvxVector::rotate90(diff10987, rotation); let rotated66 = AvxVector::rotate90(diff66, rotation.lo()); // arrange the data into the format to apply twiddles let [mid11038, mid2947] = AvxVector::unpack_complex([sum1234, rotated10987]); let mid110: __m256 = AvxVector256::merge(mid11038.lo(), mid11038.lo()); let mid29: __m256 = AvxVector256::merge(mid2947.lo(), mid2947.lo()); let mid38: __m256 = AvxVector256::merge(mid11038.hi(), mid11038.hi()); let mid47: __m256 = AvxVector256::merge(mid2947.hi(), mid2947.hi()); let mid56 = AvxVector::unpacklo_complex([sum55, rotated66]); let mid56: __m256 = AvxVector256::merge(mid56, mid56); // to compute the first output, compute the sum of all elements. mid16[0], mid25[0], and mid34[0] already have the sum of 1+6, 2+5 and 3+4 respectively, so if we add them, we'll get 1+2+3+4+5+6 let mid12910 = AvxVector::add(mid110.lo(), mid29.lo()); let mid3478 = AvxVector::add(mid38.lo(), mid47.lo()); let output0_left = AvxVector::add(input0, mid56.lo()); let output0_right = AvxVector::add(mid12910, mid3478); let output0 = AvxVector::add(output0_left, output0_right); buffer.store_partial1_complex(output0, 0); // we need to add the first input to each of our 5 twiddles values -- but right now, input0 is duplicated into both slots // but we only want to add it once, so zero the second element let zero = _mm_setzero_pd(); let input0 = _mm_castpd_ps(_mm_move_sd(zero, _mm_castps_pd(input0))); let input0 = AvxVector256::merge(input0, input0); // apply twiddle factors let twiddled11038 = AvxVector::fmadd(mid110, self.twiddles[0], input0); let twiddled2947 = AvxVector::fmadd(mid110, self.twiddles[1], input0); let twiddled56 = AvxVector::fmadd(mid110.lo(), self.twiddle_lo_4, input0.lo()); let twiddled11038 = AvxVector::fmadd(mid29, self.twiddles[2], twiddled11038); let twiddled2947 = AvxVector::fmadd(mid29, self.twiddles[3], twiddled2947); let twiddled56 = AvxVector::fmadd(mid29.lo(), self.twiddle_lo_9, twiddled56); let twiddled11038 = AvxVector::fmadd(mid38, self.twiddles[4], twiddled11038); let twiddled2947 = AvxVector::fmadd(mid38, self.twiddles[5], twiddled2947); let twiddled56 = AvxVector::fmadd(mid38.lo(), self.twiddle_lo_3, twiddled56); let twiddled11038 = AvxVector::fmadd(mid47, self.twiddles[6], twiddled11038); let twiddled2947 = AvxVector::fmadd(mid47, self.twiddles[7], twiddled2947); let twiddled56 = AvxVector::fmadd(mid47.lo(), self.twiddle_lo_8, twiddled56); let twiddled11038 = AvxVector::fmadd(mid56, self.twiddles[8], twiddled11038); let twiddled2947 = AvxVector::fmadd(mid56, self.twiddles[9], twiddled2947); let twiddled56 = AvxVector::fmadd(mid56.lo(), self.twiddle_lo_2, twiddled56); // unpack the data for the last butterfly 2 let [twiddled1234, twiddled10987] = AvxVector::unpack_complex([twiddled11038, twiddled2947]); let [twiddled55, twiddled66] = AvxVector::unpack_complex([twiddled56, twiddled56]); let [output1234, output10987] = AvxVector::column_butterfly2([twiddled1234, twiddled10987]); let [output55, output66] = AvxVector::column_butterfly2([twiddled55, twiddled66]); let output78910 = AvxVector::reverse_complex_elements(output10987); buffer.store_complex(output1234, 1); buffer.store_partial1_complex(output55, 5); buffer.store_partial1_complex(output66, 6); buffer.store_complex(output78910, 7); } } pub struct Butterfly8Avx { twiddles: __m256, twiddles_butterfly4: __m256, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly8Avx, 8); impl Butterfly8Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: AvxVector::make_mixedradix_twiddle_chunk(0, 1, 8, direction), twiddles_butterfly4: match direction { FftDirection::Forward => [ Complex::new(0.0f32, 0.0), Complex::new(0.0, -0.0), Complex::new(0.0, 0.0), Complex::new(0.0, -0.0), ] .as_slice() .load_complex(0), FftDirection::Inverse => [ Complex::new(0.0f32, 0.0), Complex::new(-0.0, 0.0), Complex::new(0.0, 0.0), Complex::new(-0.0, 0.0), ] .as_slice() .load_complex(0), }, direction, _phantom_t: PhantomData, } } } impl Butterfly8Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f32(&self, mut buffer: impl AvxArrayMut) { let row0 = buffer.load_complex(0); let row1 = buffer.load_complex(4); // Do our butterfly 2's down the columns let [intermediate0, intermediate1_pretwiddle] = AvxVector::column_butterfly2([row0, row1]); // Apply the size-8 twiddle factors let intermediate1 = AvxVector::mul_complex(intermediate1_pretwiddle, self.twiddles); // Rearrange the data before we do our butterfly 4s. This swaps the last 2 elements of butterfly0 with the first two elements of butterfly1 // The result is that we can then do a 4x butterfly 2, apply twiddles, use unpack instructions to transpose to the final output, then do another 4x butterfly 2 let permuted0 = _mm256_permute2f128_ps(intermediate0, intermediate1, 0x20); let permuted1 = _mm256_permute2f128_ps(intermediate0, intermediate1, 0x31); // Do the first set of butterfly 2's let [postbutterfly0, postbutterfly1_pretwiddle] = AvxVector::column_butterfly2([permuted0, permuted1]); // Which negative we blend in depends on whether we're forward or direction // Our goal is to swap the reals with the imaginaries, then negate either the reals or the imaginaries, based on whether we're an direction or not // but we can't use the AvxVector swap_complex_components function, because we only want to swap the odd reals with the odd imaginaries let elements_swapped = _mm256_permute_ps(postbutterfly1_pretwiddle, 0xB4); // We can negate the elements we want by xoring the row with a pre-set vector let postbutterfly1 = AvxVector::xor(elements_swapped, self.twiddles_butterfly4); // use unpack instructions to transpose, and to prepare for the final butterfly 2's let unpermuted0 = _mm256_permute2f128_ps(postbutterfly0, postbutterfly1, 0x20); let unpermuted1 = _mm256_permute2f128_ps(postbutterfly0, postbutterfly1, 0x31); let unpacked = AvxVector::unpack_complex([unpermuted0, unpermuted1]); let [output0, output1] = AvxVector::column_butterfly2(unpacked); buffer.store_complex(output0, 0); buffer.store_complex(output1, 4); } } pub struct Butterfly9Avx { twiddles: __m256, twiddles_butterfly3: __m256, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly9Avx, 9); impl Butterfly9Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { let twiddles: [Complex; 4] = [ twiddles::compute_twiddle(1, 9, direction), twiddles::compute_twiddle(2, 9, direction), twiddles::compute_twiddle(2, 9, direction), twiddles::compute_twiddle(4, 9, direction), ]; Self { twiddles: twiddles.as_slice().load_complex(0), twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, direction), direction, _phantom_t: PhantomData, } } } impl Butterfly9Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f32(&self, mut buffer: impl AvxArrayMut) { // we're going to load these elements in a peculiar way. instead of loading a row into the first 3 element of each register and leaving the last element empty // we're leaving the first element empty and putting the data in the last 3 elements. this will let us do 3 total complex multiplies instead of 4. let input0_lo = _mm_castpd_ps(_mm_load1_pd(buffer.input_ptr() as *const f64)); let input0_hi = buffer.load_partial2_complex(1); let input0 = AvxVector256::merge(input0_lo, input0_hi); let input1 = buffer.load_complex(2); let input2 = buffer.load_complex(5); // We're going to treat our input as a 3x3 2d array. First, do 3 butterfly 3's down the columns of that array. let [mid0, mid1, mid2] = AvxVector::column_butterfly3([input0, input1, input2], self.twiddles_butterfly3); // merge the twiddle-able data into a single avx vector let twiddle_data = _mm256_permute2f128_ps(mid1, mid2, 0x31); let twiddled = AvxVector::mul_complex(twiddle_data, self.twiddles); // Transpose our 3x3 array. We could use the 4x4 transpose with an empty bottom row, which would result in an empty last column // but it turns out that it'll make our packing process later simpler if we duplicate the second row into the last row // which will result in duplicating the second column into the last column after the transpose let permute0 = _mm256_permute2f128_ps(mid0, mid2, 0x20); let permute1 = _mm256_permute2f128_ps(mid1, mid1, 0x20); let permute2 = _mm256_permute2f128_ps(mid0, twiddled, 0x31); let permute3 = _mm256_permute2f128_ps(twiddled, twiddled, 0x20); let transposed0 = AvxVector::unpackhi_complex([permute0, permute1]); let [transposed1, transposed2] = AvxVector::unpack_complex([permute2, permute3]); // more size 3 buterflies let output_rows = AvxVector::column_butterfly3( [transposed0, transposed1, transposed2], self.twiddles_butterfly3, ); // the elements of row 1 are in pretty much the worst possible order, thankfully we can fix that with just a couple instructions let swapped1 = _mm256_permute_ps(output_rows[1], 0x4E); // swap even and odd complex numbers let packed1 = _mm256_permute2f128_ps(swapped1, output_rows[2], 0x21); buffer.store_complex(packed1, 4); // merge just the high element of swapped_lo into the high element of row 0 let zero_swapped1_lo = AvxVector256::merge(AvxVector::zero(), swapped1.lo()); let packed0 = _mm256_blend_ps(output_rows[0], zero_swapped1_lo, 0xC0); buffer.store_complex(packed0, 0); // The last element can just be written on its own buffer.store_partial1_complex(output_rows[2].hi(), 8); } } pub struct Butterfly12Avx { twiddles: [__m256; 2], twiddles_butterfly3: __m256, twiddles_butterfly4: Rotation90<__m256>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly12Avx, 12); impl Butterfly12Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { let twiddles = [ Complex { re: 1.0f32, im: 0.0, }, Complex { re: 1.0, im: 0.0 }, twiddles::compute_twiddle(2, 12, direction), twiddles::compute_twiddle(4, 12, direction), // note that these twiddles are deliberately in a weird order, see perform_fft_f32 for why twiddles::compute_twiddle(1, 12, direction), twiddles::compute_twiddle(2, 12, direction), twiddles::compute_twiddle(3, 12, direction), twiddles::compute_twiddle(6, 12, direction), ]; Self { twiddles: [ twiddles.as_slice().load_complex(0), twiddles.as_slice().load_complex(4), ], twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, direction), twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly12Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f32(&self, mut buffer: impl AvxArrayMut) { // we're going to load these elements in a peculiar way. instead of loading a row into the first 3 element of each register and leaving the last element empty // we're leaving the first element empty and putting the data in the last 3 elements. this will save us a complex multiply. // for everything but the first element, we can do overlapping reads. for the first element, an "overlapping read" would have us reading from index -1, so instead we have to shuffle some data around let input0_lo = _mm_castpd_ps(_mm_load1_pd(buffer.input_ptr() as *const f64)); let input0_hi = buffer.load_partial2_complex(1); let input_rows = [ AvxVector256::merge(input0_lo, input0_hi), buffer.load_complex(2), buffer.load_complex(5), buffer.load_complex(8), ]; // 3 butterfly 4's down the columns let mut mid = AvxVector::column_butterfly4(input_rows, self.twiddles_butterfly4); // Multiply in our twiddle factors. mid2 will be normal, but for mid1 and mid3, we're going to merge the twiddle-able parts into a single vector, // and do a single complex multiply on it. this transformation saves a complex multiply and costs nothing, // because we needthe second halves of mid1 and mid3 in a single vector for the transpose afterward anyways, so we would have done this permute2f128 operation either way mid[2] = AvxVector::mul_complex(mid[2], self.twiddles[0]); let merged_mid13 = _mm256_permute2f128_ps(mid[1], mid[3], 0x31); let twiddled13 = AvxVector::mul_complex(self.twiddles[1], merged_mid13); // Transpose our 3x4 array into a 4x3. we're doing a custom transpose here because we have to re-distribute the merged twiddled23 back out, and we can roll that into the transpose to make it free let transposed = { let permute0 = _mm256_permute2f128_ps(mid[0], mid[2], 0x20); let permute1 = _mm256_permute2f128_ps(mid[1], mid[3], 0x20); let permute2 = _mm256_permute2f128_ps(mid[0], mid[2], 0x31); let permute3 = twiddled13; // normally we'd need to do a permute here, but we can skip it because we already did it for twiddle factors let unpacked1 = AvxVector::unpackhi_complex([permute0, permute1]); let [unpacked2, unpacked3] = AvxVector::unpack_complex([permute2, permute3]); [unpacked1, unpacked2, unpacked3] }; // Do 4 butterfly 3's down the columns of our transposed array let output_rows = AvxVector::column_butterfly3(transposed, self.twiddles_butterfly3); buffer.store_complex(output_rows[0], 0); buffer.store_complex(output_rows[1], 4); buffer.store_complex(output_rows[2], 8); } } pub struct Butterfly16Avx { twiddles: [__m256; 3], twiddles_butterfly4: Rotation90<__m256>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly16Avx, 16); impl Butterfly16Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_interleaved_columns!(4, 4, 0, direction), twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly16Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f32(&self, mut buffer: impl AvxArrayMut) { // Manually unrolling this loop because writing a "for r in 0..4" loop results in slow codegen that makes the whole thing take 1.5x longer :( let rows = [ buffer.load_complex(0), buffer.load_complex(4), buffer.load_complex(8), buffer.load_complex(12), ]; // We're going to treat our input as a 4x4 2d array. First, do 4 butterfly 4's down the columns of that array. let mut mid = AvxVector::column_butterfly4(rows, self.twiddles_butterfly4); // apply twiddle factors for r in 1..4 { mid[r] = AvxVector::mul_complex(mid[r], self.twiddles[r - 1]); } // Transpose our 4x4 array let transposed = avx32_utils::transpose_4x4_f32(mid); // Do 4 butterfly 4's down the columns of our transposed array let output_rows = AvxVector::column_butterfly4(transposed, self.twiddles_butterfly4); // Manually unrolling this loop because writing a "for r in 0..4" loop results in slow codegen that makes the whole thing take 1.5x longer :( buffer.store_complex(output_rows[0], 0); buffer.store_complex(output_rows[1], 4); buffer.store_complex(output_rows[2], 8); buffer.store_complex(output_rows[3], 12); } } pub struct Butterfly24Avx { twiddles: [__m256; 5], twiddles_butterfly3: __m256, twiddles_butterfly4: Rotation90<__m256>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly24Avx, 24); impl Butterfly24Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_interleaved_columns!(6, 4, 0, direction), twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, direction), twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly24Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f32(&self, mut buffer: impl AvxArrayMut) { // Manually unrolling this loop because writing a "for r in 0..6" loop results in slow codegen that makes the whole thing take 1.5x longer :( let rows = [ buffer.load_complex(0), buffer.load_complex(4), buffer.load_complex(8), buffer.load_complex(12), buffer.load_complex(16), buffer.load_complex(20), ]; // We're going to treat our input as a 4x6 2d array. First, do 4 butterfly 6's down the columns of that array. let mut mid = AvxVector256::column_butterfly6(rows, self.twiddles_butterfly3); // apply twiddle factors for r in 1..6 { mid[r] = AvxVector::mul_complex(mid[r], self.twiddles[r - 1]); } // Transpose our 6x4 array into a 4x6. let (transposed0, transposed1) = avx32_utils::transpose_4x6_to_6x4_f32(mid); // Do 6 butterfly 4's down the columns of our transposed array let output0 = AvxVector::column_butterfly4(transposed0, self.twiddles_butterfly4); let output1 = AvxVector::column_butterfly4(transposed1, self.twiddles_butterfly4); // the upper two elements of output1 are empty, so only store half the data for it for r in 0..4 { buffer.store_complex(output0[r], 6 * r); buffer.store_partial2_complex(output1[r].lo(), r * 6 + 4); } } } pub struct Butterfly27Avx { twiddles: [__m256; 4], twiddles_butterfly9: [__m256; 3], twiddles_butterfly3: __m256, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly27Avx, 27); impl Butterfly27Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_interleaved_columns!(3, 9, 1, direction), twiddles_butterfly9: [ AvxVector::broadcast_twiddle(1, 9, direction), AvxVector::broadcast_twiddle(2, 9, direction), AvxVector::broadcast_twiddle(4, 9, direction), ], twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, direction), direction, _phantom_t: PhantomData, } } } impl Butterfly27Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f32(&self, mut buffer: impl AvxArrayMut) { // we're going to load our data in a peculiar way. we're going to load the first column on its own as a column of __m128. // it's faster to just load the first 2 columns into these m128s than trying to worry about masks, etc, so the second column will piggyback along and we just won't use it let mut rows0 = [AvxVector::zero(); 3]; let mut rows1 = [AvxVector::zero(); 3]; let mut rows2 = [AvxVector::zero(); 3]; for r in 0..3 { rows0[r] = buffer.load_partial2_complex(r * 9); rows1[r] = buffer.load_complex(r * 9 + 1); rows2[r] = buffer.load_complex(r * 9 + 5); } // butterfly 3s down the columns let mid0 = AvxVector::column_butterfly3(rows0, self.twiddles_butterfly3.lo()); let mut mid1 = AvxVector::column_butterfly3(rows1, self.twiddles_butterfly3); let mut mid2 = AvxVector::column_butterfly3(rows2, self.twiddles_butterfly3); // apply twiddle factors mid1[1] = AvxVector::mul_complex(mid1[1], self.twiddles[0]); mid2[1] = AvxVector::mul_complex(mid2[1], self.twiddles[1]); mid1[2] = AvxVector::mul_complex(mid1[2], self.twiddles[2]); mid2[2] = AvxVector::mul_complex(mid2[2], self.twiddles[3]); // transpose 9x3 to 3x9. this will be a little awkward because of rows0 containing garbage data, so use a transpose function that knows to ignore it let transposed = avx32_utils::transpose_9x3_to_3x9_emptycolumn1_f32(mid0, mid1, mid2); // butterfly 9s down the rows let output_rows = AvxVector256::column_butterfly9( transposed, self.twiddles_butterfly9, self.twiddles_butterfly3, ); // Our last column is empty, so it's a bit awkward to write out to memory. We could pack it in fewer vectors, but benchmarking shows it's simpler and just as fast to just brute-force it with partial writes buffer.store_partial3_complex(output_rows[0], 0); buffer.store_partial3_complex(output_rows[1], 3); buffer.store_partial3_complex(output_rows[2], 6); buffer.store_partial3_complex(output_rows[3], 9); buffer.store_partial3_complex(output_rows[4], 12); buffer.store_partial3_complex(output_rows[5], 15); buffer.store_partial3_complex(output_rows[6], 18); buffer.store_partial3_complex(output_rows[7], 21); buffer.store_partial3_complex(output_rows[8], 24); } } pub struct Butterfly32Avx { twiddles: [__m256; 6], twiddles_butterfly4: Rotation90<__m256>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly32Avx, 32); impl Butterfly32Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_interleaved_columns!(4, 8, 0, direction), twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly32Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f32(&self, mut buffer: impl AvxArrayMut) { let mut rows0 = [AvxVector::zero(); 4]; let mut rows1 = [AvxVector::zero(); 4]; for r in 0..4 { rows0[r] = buffer.load_complex(8 * r); rows1[r] = buffer.load_complex(8 * r + 4); } // We're going to treat our input as a 8x4 2d array. First, do 8 butterfly 4's down the columns of that array. let mut mid0 = AvxVector::column_butterfly4(rows0, self.twiddles_butterfly4); let mut mid1 = AvxVector::column_butterfly4(rows1, self.twiddles_butterfly4); // apply twiddle factors for r in 1..4 { mid0[r] = AvxVector::mul_complex(mid0[r], self.twiddles[2 * r - 2]); mid1[r] = AvxVector::mul_complex(mid1[r], self.twiddles[2 * r - 1]); } // Transpose our 8x4 array to an 4x8 array let transposed = avx32_utils::transpose_8x4_to_4x8_f32(mid0, mid1); // Do 4 butterfly 8's down the columns of our transpsed array let output_rows = AvxVector::column_butterfly8(transposed, self.twiddles_butterfly4); // Manually unrolling this loop because writing a "for r in 0..8" loop results in slow codegen that makes the whole thing take 1.5x longer :( buffer.store_complex(output_rows[0], 0); buffer.store_complex(output_rows[1], 1 * 4); buffer.store_complex(output_rows[2], 2 * 4); buffer.store_complex(output_rows[3], 3 * 4); buffer.store_complex(output_rows[4], 4 * 4); buffer.store_complex(output_rows[5], 5 * 4); buffer.store_complex(output_rows[6], 6 * 4); buffer.store_complex(output_rows[7], 7 * 4); } } pub struct Butterfly36Avx { twiddles: [__m256; 6], twiddles_butterfly9: [__m256; 3], twiddles_butterfly3: __m256, twiddles_butterfly4: Rotation90<__m256>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly36Avx, 36); impl Butterfly36Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_interleaved_columns!(4, 9, 1, direction), twiddles_butterfly9: [ AvxVector::broadcast_twiddle(1, 9, direction), AvxVector::broadcast_twiddle(2, 9, direction), AvxVector::broadcast_twiddle(4, 9, direction), ], twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, direction), twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly36Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f32(&self, mut buffer: impl AvxArrayMut) { // we're going to load our data in a peculiar way. we're going to load the first column on its own as a column of __m128. // it's faster to just load the first 2 columns into these m128s than trying to worry about masks, etc, so the second column will piggyback along and we just won't use it let mut rows0 = [AvxVector::zero(); 4]; let mut rows1 = [AvxVector::zero(); 4]; let mut rows2 = [AvxVector::zero(); 4]; for r in 0..4 { rows0[r] = buffer.load_partial2_complex(r * 9); rows1[r] = buffer.load_complex(r * 9 + 1); rows2[r] = buffer.load_complex(r * 9 + 5); } // butterfly 4s down the columns let mid0 = AvxVector::column_butterfly4(rows0, self.twiddles_butterfly4.lo()); let mut mid1 = AvxVector::column_butterfly4(rows1, self.twiddles_butterfly4); let mut mid2 = AvxVector::column_butterfly4(rows2, self.twiddles_butterfly4); // apply twiddle factors for r in 1..4 { mid1[r] = AvxVector::mul_complex(mid1[r], self.twiddles[2 * r - 2]); mid2[r] = AvxVector::mul_complex(mid2[r], self.twiddles[2 * r - 1]); } // transpose 9x4 to 4x9. this will be a little awkward because of rows0 containing garbage data, so use a transpose function that knows to ignore it let transposed = avx32_utils::transpose_9x4_to_4x9_emptycolumn1_f32(mid0, mid1, mid2); // butterfly 9s down the rows let output_rows = AvxVector256::column_butterfly9( transposed, self.twiddles_butterfly9, self.twiddles_butterfly3, ); for r in 0..3 { buffer.store_complex(output_rows[r * 3], r * 12); buffer.store_complex(output_rows[r * 3 + 1], r * 12 + 4); buffer.store_complex(output_rows[r * 3 + 2], r * 12 + 8); } } } pub struct Butterfly48Avx { twiddles: [__m256; 9], twiddles_butterfly3: __m256, twiddles_butterfly4: Rotation90<__m256>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly48Avx, 48); impl Butterfly48Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_interleaved_columns!(4, 12, 0, direction), twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, direction), twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly48Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f32(&self, mut buffer: impl AvxArrayMut) { let mut rows0 = [AvxVector::zero(); 4]; let mut rows1 = [AvxVector::zero(); 4]; let mut rows2 = [AvxVector::zero(); 4]; for r in 0..4 { rows0[r] = buffer.load_complex(12 * r); rows1[r] = buffer.load_complex(12 * r + 4); rows2[r] = buffer.load_complex(12 * r + 8); } // We're going to treat our input as a 12x4 2d array. First, do 12 butterfly 4's down the columns of that array. let mut mid0 = AvxVector::column_butterfly4(rows0, self.twiddles_butterfly4); let mut mid1 = AvxVector::column_butterfly4(rows1, self.twiddles_butterfly4); let mut mid2 = AvxVector::column_butterfly4(rows2, self.twiddles_butterfly4); // apply twiddle factors for r in 1..4 { mid0[r] = AvxVector::mul_complex(mid0[r], self.twiddles[3 * r - 3]); mid1[r] = AvxVector::mul_complex(mid1[r], self.twiddles[3 * r - 2]); mid2[r] = AvxVector::mul_complex(mid2[r], self.twiddles[3 * r - 1]); } // Transpose our 12x4 array into a 4x12. let transposed = avx32_utils::transpose_12x4_to_4x12_f32(mid0, mid1, mid2); // Do 4 butterfly 12's down the columns of our transposed array let output_rows = AvxVector256::column_butterfly12( transposed, self.twiddles_butterfly3, self.twiddles_butterfly4, ); // Manually unrolling this loop because writing a "for r in 0..12" loop results in slow codegen that makes the whole thing take 1.5x longer :( buffer.store_complex(output_rows[0], 0); buffer.store_complex(output_rows[1], 4); buffer.store_complex(output_rows[2], 8); buffer.store_complex(output_rows[3], 12); buffer.store_complex(output_rows[4], 16); buffer.store_complex(output_rows[5], 20); buffer.store_complex(output_rows[6], 24); buffer.store_complex(output_rows[7], 28); buffer.store_complex(output_rows[8], 32); buffer.store_complex(output_rows[9], 36); buffer.store_complex(output_rows[10], 40); buffer.store_complex(output_rows[11], 44); } } pub struct Butterfly54Avx { twiddles: [__m256; 10], twiddles_butterfly9: [__m256; 3], twiddles_butterfly9_lo: [__m256; 2], twiddles_butterfly3: __m256, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly54Avx, 54); impl Butterfly54Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { let twiddle1 = __m128::broadcast_twiddle(1, 9, direction); let twiddle2 = __m128::broadcast_twiddle(2, 9, direction); let twiddle4 = __m128::broadcast_twiddle(4, 9, direction); Self { twiddles: gen_butterfly_twiddles_interleaved_columns!(6, 9, 1, direction), twiddles_butterfly9: [ AvxVector::broadcast_twiddle(1, 9, direction), AvxVector::broadcast_twiddle(2, 9, direction), AvxVector::broadcast_twiddle(4, 9, direction), ], twiddles_butterfly9_lo: [ AvxVector256::merge(twiddle1, twiddle2), AvxVector256::merge(twiddle2, twiddle4), ], twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, direction), direction, _phantom_t: PhantomData, } } } impl Butterfly54Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f32(&self, mut buffer: impl AvxArrayMut) { // we're going to load our data in a peculiar way. we're going to load the first column on its own as a column of __m128. // it's faster to just load the first 2 columns into these m128s than trying to worry about masks, etc, so the second column will piggyback along and we just won't use it // // we have too much data to fit into registers all at once, so split up our data processing so that we entirely finish with one "rows_" array before moving to the next let mut rows0 = [AvxVector::zero(); 6]; for r in 0..3 { rows0[r * 2] = buffer.load_partial2_complex(r * 18); rows0[r * 2 + 1] = buffer.load_partial2_complex(r * 18 + 9); } let mid0 = AvxVector128::column_butterfly6(rows0, self.twiddles_butterfly3); // next set of butterfly 6's let mut rows1 = [AvxVector::zero(); 6]; for r in 0..3 { rows1[r * 2] = buffer.load_complex(r * 18 + 1); rows1[r * 2 + 1] = buffer.load_complex(r * 18 + 10); } let mut mid1 = AvxVector256::column_butterfly6(rows1, self.twiddles_butterfly3); for r in 1..6 { mid1[r] = AvxVector::mul_complex(mid1[r], self.twiddles[2 * r - 2]); } // final set of butterfly 6's let mut rows2 = [AvxVector::zero(); 6]; for r in 0..3 { rows2[r * 2] = buffer.load_complex(r * 18 + 5); rows2[r * 2 + 1] = buffer.load_complex(r * 18 + 14); } let mut mid2 = AvxVector256::column_butterfly6(rows2, self.twiddles_butterfly3); for r in 1..6 { mid2[r] = AvxVector::mul_complex(mid2[r], self.twiddles[2 * r - 1]); } // transpose 9x6 to 6x9. this will be a little awkward because of rows0 containing garbage data, so use a transpose function that knows to ignore it let (transposed0, transposed1) = avx32_utils::transpose_9x6_to_6x9_emptycolumn1_f32(mid0, mid1, mid2); // butterfly 9s down the rows // process the other half let output_rows1 = AvxVector128::column_butterfly9( transposed1, self.twiddles_butterfly9_lo, self.twiddles_butterfly3, ); for r in 0..9 { buffer.store_partial2_complex(output_rows1[r], r * 6 + 4); } // we have too much data to fit into registers all at once, do one set of butterfly 9's and output them before even starting on the others, to make it easier for the compiler to figure out what to spill let output_rows0 = AvxVector256::column_butterfly9( transposed0, self.twiddles_butterfly9, self.twiddles_butterfly3, ); for r in 0..9 { buffer.store_complex(output_rows0[r], r * 6); } } } pub struct Butterfly64Avx { twiddles: [__m256; 14], twiddles_butterfly4: Rotation90<__m256>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly64Avx, 64); impl Butterfly64Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_separated_columns!(8, 8, 0, direction), twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly64Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f32(&self, mut buffer: impl AvxArrayMut) { // We're going to treat our input as a 8x8 2d array. First, do 8 butterfly 8's down the columns of that array. // We can't fit the whole problem into AVX registers at once, so we'll have to spill some things. // By computing a sizeable chunk and not referencing any of it for a while, we're making it easy for the compiler to decide what to spill let mut rows0 = [AvxVector::zero(); 8]; for r in 0..8 { rows0[r] = buffer.load_complex(8 * r); } let mut mid0 = AvxVector::column_butterfly8(rows0, self.twiddles_butterfly4); for r in 1..8 { mid0[r] = AvxVector::mul_complex(mid0[r], self.twiddles[r - 1]); } // One half is done, so the compiler can spill everything above this. Now do the other set of columns let mut rows1 = [AvxVector::zero(); 8]; for r in 0..8 { rows1[r] = buffer.load_complex(8 * r + 4); } let mut mid1 = AvxVector::column_butterfly8(rows1, self.twiddles_butterfly4); for r in 1..8 { mid1[r] = AvxVector::mul_complex(mid1[r], self.twiddles[r - 1 + 7]); } // Transpose our 8x8 array let (transposed0, transposed1) = avx32_utils::transpose_8x8_f32(mid0, mid1); // Do 8 butterfly 8's down the columns of our transposed array, and store the results // Same thing as above - Do the half of the butterfly 8's separately to give the compiler a better hint about what to spill let output0 = AvxVector::column_butterfly8(transposed0, self.twiddles_butterfly4); for r in 0..8 { buffer.store_complex(output0[r], 8 * r); } let output1 = AvxVector::column_butterfly8(transposed1, self.twiddles_butterfly4); for r in 0..8 { buffer.store_complex(output1[r], 8 * r + 4); } } } pub struct Butterfly72Avx { twiddles: [__m256; 15], twiddles_butterfly4: Rotation90<__m256>, twiddles_butterfly3: __m256, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly72Avx, 72); impl Butterfly72Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_separated_columns!(6, 12, 0, direction), twiddles_butterfly4: AvxVector::make_rotation90(direction), twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, direction), direction, _phantom_t: PhantomData, } } } impl Butterfly72Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f32(&self, mut buffer: impl AvxArrayMut) { // We're going to treat our input as a 12x6 2d array. First, do butterfly 6's down the columns of that array. // We can't fit the whole problem into AVX registers at once, so we'll have to spill some things. // By computing a sizeable chunk and not referencing any of it for a while, we're making it easy for the compiler to decide what to spill let mut rows0 = [AvxVector::zero(); 6]; for r in 0..6 { rows0[r] = buffer.load_complex(12 * r); } let mut mid0 = AvxVector256::column_butterfly6(rows0, self.twiddles_butterfly3); for r in 1..6 { mid0[r] = AvxVector::mul_complex(mid0[r], self.twiddles[r - 1]); } // One third is done, so the compiler can spill everything above this. Now do the middle set of columns let mut rows1 = [AvxVector::zero(); 6]; for r in 0..6 { rows1[r] = buffer.load_complex(12 * r + 4); } let mut mid1 = AvxVector256::column_butterfly6(rows1, self.twiddles_butterfly3); for r in 1..6 { mid1[r] = AvxVector::mul_complex(mid1[r], self.twiddles[r - 1 + 5]); } // two thirds are done, so the compiler can spill everything above this. Now do the final set of columns let mut rows2 = [AvxVector::zero(); 6]; for r in 0..6 { rows2[r] = buffer.load_complex(12 * r + 8); } let mut mid2 = AvxVector256::column_butterfly6(rows2, self.twiddles_butterfly3); for r in 1..6 { mid2[r] = AvxVector::mul_complex(mid2[r], self.twiddles[r - 1 + 10]); } // Transpose our 12x6 array to 6x12 array let (transposed0, transposed1) = avx32_utils::transpose_12x6_to_6x12_f32(mid0, mid1, mid2); // Do butterfly 12's down the columns of our transposed array, and store the results // Same thing as above - Do the half of the butterfly 12's separately to give the compiler a better hint about what to spill let output0 = AvxVector128::column_butterfly12( transposed0, self.twiddles_butterfly3, self.twiddles_butterfly4, ); for r in 0..12 { buffer.store_partial2_complex(output0[r], 6 * r); } let output1 = AvxVector256::column_butterfly12( transposed1, self.twiddles_butterfly3, self.twiddles_butterfly4, ); for r in 0..12 { buffer.store_complex(output1[r], 6 * r + 2); } } } pub struct Butterfly128Avx { twiddles: [__m256; 28], twiddles_butterfly16: [__m256; 2], twiddles_butterfly4: Rotation90<__m256>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly_with_scratch!(Butterfly128Avx, 128); impl Butterfly128Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_separated_columns!(8, 16, 0, direction), twiddles_butterfly16: [ AvxVector::broadcast_twiddle(1, 16, direction), AvxVector::broadcast_twiddle(3, 16, direction), ], twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly128Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn column_butterflies_and_transpose( &self, input: &[Complex], mut output: &mut [Complex], ) { // A size-128 FFT is way too big to fit in registers, so instead we're going to compute it in two phases, storing in scratch in between. // First phase is to treat this size-128 array like a 16x8 2D array, and do butterfly 8's down the columns // Then, apply twiddle factors, and finally transpose into the scratch space // But again, we don't have enough registers to load it all at once, so only load one column of AVX vectors at a time for columnset in 0..4 { let mut rows = [AvxVector::zero(); 8]; for r in 0..8 { rows[r] = input.load_complex(columnset * 4 + 16 * r); } // apply butterfly 8 let mut mid = AvxVector::column_butterfly8(rows, self.twiddles_butterfly4); // apply twiddle factors for r in 1..8 { mid[r] = AvxVector::mul_complex(mid[r], self.twiddles[r - 1 + 7 * columnset]); } // transpose let transposed = AvxVector::transpose8_packed(mid); // write out for i in 0..4 { output.store_complex(transposed[i * 2], columnset * 32 + i * 8); output.store_complex(transposed[i * 2 + 1], columnset * 32 + i * 8 + 4); } } } #[target_feature(enable = "avx", enable = "fma")] unsafe fn row_butterflies(&self, mut buffer: impl AvxArrayMut) { // Second phase: Butterfly 16's down the columns of our transposed array. // Thankfully, during the first phase, we set everything up so that all we have to do here is compute the size-16 FFT columns and write them back out where we got them // We're also using a customized butterfly16 function that is smarter about when it loads/stores data, to reduce register spilling for columnset in 0usize..2 { column_butterfly16_loadfn!( |index: usize| buffer.load_complex(columnset * 4 + index * 8), |data, index| buffer.store_complex(data, columnset * 4 + index * 8), self.twiddles_butterfly16, self.twiddles_butterfly4 ); } } } #[allow(non_camel_case_types)] pub struct Butterfly256Avx { twiddles: [__m256; 56], twiddles_butterfly32: [__m256; 6], twiddles_butterfly4: Rotation90<__m256>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly_with_scratch!(Butterfly256Avx, 256); impl Butterfly256Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_separated_columns!(8, 32, 0, direction), twiddles_butterfly32: [ AvxVector::broadcast_twiddle(1, 32, direction), AvxVector::broadcast_twiddle(2, 32, direction), AvxVector::broadcast_twiddle(3, 32, direction), AvxVector::broadcast_twiddle(5, 32, direction), AvxVector::broadcast_twiddle(6, 32, direction), AvxVector::broadcast_twiddle(7, 32, direction), ], twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly256Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn column_butterflies_and_transpose( &self, input: &[Complex], mut output: &mut [Complex], ) { // A size-256 FFT is way too big to fit in registers, so instead we're going to compute it in two phases, storing in scratch in between. // First phase is to treeat this size-256 array like a 32x8 2D array, and do butterfly 8's down the columns // Then, apply twiddle factors, and finally transpose into the scratch space // But again, we don't have enough registers to load it all at once, so only load one column of AVX vectors at a time for columnset in 0..8 { let mut rows = [AvxVector::zero(); 8]; for r in 0..8 { rows[r] = input.load_complex(columnset * 4 + 32 * r); } let mut mid = AvxVector::column_butterfly8(rows, self.twiddles_butterfly4); for r in 1..8 { mid[r] = AvxVector::mul_complex(mid[r], self.twiddles[r - 1 + 7 * columnset]); } // Before writing to the scratch, transpose this chunk of the array let transposed = AvxVector::transpose8_packed(mid); for i in 0..4 { output.store_complex(transposed[i * 2], columnset * 32 + i * 8); output.store_complex(transposed[i * 2 + 1], columnset * 32 + i * 8 + 4); } } } #[target_feature(enable = "avx", enable = "fma")] unsafe fn row_butterflies(&self, mut buffer: impl AvxArrayMut) { // Second phase: Butterfly 32's down the columns of our transposed array. // Thankfully, during the first phase, we set everything up so that all we have to do here is compute the size-32 FFT columns and write them back out where we got them // We're also using a customized butterfly32 function that is smarter about when it loads/stores data, to reduce register spilling for columnset in 0..2 { column_butterfly32_loadfn!( |index: usize| buffer.load_complex(columnset * 4 + index * 8), |data, index| buffer.store_complex(data, columnset * 4 + index * 8), self.twiddles_butterfly32, self.twiddles_butterfly4 ); } } } pub struct Butterfly512Avx { twiddles: [__m256; 120], twiddles_butterfly32: [__m256; 6], twiddles_butterfly16: [__m256; 2], twiddles_butterfly4: Rotation90<__m256>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly_with_scratch!(Butterfly512Avx, 512); impl Butterfly512Avx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_separated_columns!(16, 32, 0, direction), twiddles_butterfly32: [ AvxVector::broadcast_twiddle(1, 32, direction), AvxVector::broadcast_twiddle(2, 32, direction), AvxVector::broadcast_twiddle(3, 32, direction), AvxVector::broadcast_twiddle(5, 32, direction), AvxVector::broadcast_twiddle(6, 32, direction), AvxVector::broadcast_twiddle(7, 32, direction), ], twiddles_butterfly16: [ AvxVector::broadcast_twiddle(1, 16, direction), AvxVector::broadcast_twiddle(3, 16, direction), ], twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly512Avx { #[target_feature(enable = "avx", enable = "fma")] unsafe fn column_butterflies_and_transpose( &self, input: &[Complex], mut output: &mut [Complex], ) { // A size-512 FFT is way too big to fit in registers, so instead we're going to compute it in two phases, storing in scratch in between. // First phase is to treat this size-512 array like a 32x16 2D array, and do butterfly 16's down the columns // Then, apply twiddle factors, and finally transpose into the scratch space // But again, we don't have enough registers to load it all at once, so only load one column of AVX vectors at a time // We're also using a customized butterfly16 function that is smarter about when it loads/stores data, to reduce register spilling const TWIDDLES_PER_COLUMN: usize = 15; for (columnset, twiddle_chunk) in self.twiddles.chunks_exact(TWIDDLES_PER_COLUMN).enumerate() { // Sadly we have to use MaybeUninit here. If we init an array like normal with AvxVector::Zero(), the compiler can't seem to figure out that it can // eliminate the dead stores of zeroes to the stack. By using uninit here, we avoid those unnecessary writes let mut mid_uninit: [MaybeUninit<__m256>; 16] = [MaybeUninit::<__m256>::uninit(); 16]; column_butterfly16_loadfn!( |index: usize| input.load_complex(columnset * 4 + 32 * index), |data, index: usize| { mid_uninit[index].as_mut_ptr().write(data); }, self.twiddles_butterfly16, self.twiddles_butterfly4 ); // Apply twiddle factors, transpose, and store. Traditionally we apply all the twiddle factors at once and then do all the transposes at once, // But our data is pushing the limit of what we can store in registers, so the idea here is to get the data out the door with as few spills to the stack as possible for chunk in 0..4 { let twiddled = [ if chunk > 0 { AvxVector::mul_complex( mid_uninit[4 * chunk].assume_init(), twiddle_chunk[4 * chunk - 1], ) } else { mid_uninit[4 * chunk].assume_init() }, AvxVector::mul_complex( mid_uninit[4 * chunk + 1].assume_init(), twiddle_chunk[4 * chunk], ), AvxVector::mul_complex( mid_uninit[4 * chunk + 2].assume_init(), twiddle_chunk[4 * chunk + 1], ), AvxVector::mul_complex( mid_uninit[4 * chunk + 3].assume_init(), twiddle_chunk[4 * chunk + 2], ), ]; let transposed = AvxVector::transpose4_packed(twiddled); output.store_complex(transposed[0], columnset * 64 + 0 * 16 + 4 * chunk); output.store_complex(transposed[1], columnset * 64 + 1 * 16 + 4 * chunk); output.store_complex(transposed[2], columnset * 64 + 2 * 16 + 4 * chunk); output.store_complex(transposed[3], columnset * 64 + 3 * 16 + 4 * chunk); } } } #[target_feature(enable = "avx", enable = "fma")] unsafe fn row_butterflies(&self, mut buffer: impl AvxArrayMut) { // Second phase: Butterfly 32's down the columns of our transposed array. // Thankfully, during the first phase, we set everything up so that all we have to do here is compute the size-32 FFT columns and write them back out where we got them // We're also using a customized butterfly32 function that is smarter about when it loads/stores data, to reduce register spilling for columnset in 0..4 { column_butterfly32_loadfn!( |index: usize| buffer.load_complex(columnset * 4 + index * 16), |data, index| buffer.store_complex(data, columnset * 4 + index * 16), self.twiddles_butterfly32, self.twiddles_butterfly4 ); } } } #[cfg(test)] mod unit_tests { use super::*; use crate::test_utils::check_fft_algorithm; macro_rules! test_avx_butterfly { ($test_name:ident, $struct_name:ident, $size:expr) => ( #[test] fn $test_name() { let butterfly = $struct_name::::new(FftDirection::Forward).expect("Can't run test because this machine doesn't have the required instruction sets"); check_fft_algorithm(&butterfly as &dyn Fft, $size, FftDirection::Forward); let butterfly_inverse = $struct_name::::new(FftDirection::Inverse).expect("Can't run test because this machine doesn't have the required instruction sets"); check_fft_algorithm(&butterfly_inverse as &dyn Fft, $size, FftDirection::Inverse); } ) } test_avx_butterfly!(test_avx_butterfly5, Butterfly5Avx, 5); test_avx_butterfly!(test_avx_butterfly7, Butterfly7Avx, 7); test_avx_butterfly!(test_avx_butterfly8, Butterfly8Avx, 8); test_avx_butterfly!(test_avx_butterfly9, Butterfly9Avx, 9); test_avx_butterfly!(test_avx_butterfly11, Butterfly11Avx, 11); test_avx_butterfly!(test_avx_butterfly12, Butterfly12Avx, 12); test_avx_butterfly!(test_avx_butterfly16, Butterfly16Avx, 16); test_avx_butterfly!(test_avx_butterfly24, Butterfly24Avx, 24); test_avx_butterfly!(test_avx_butterfly27, Butterfly27Avx, 27); test_avx_butterfly!(test_avx_butterfly32, Butterfly32Avx, 32); test_avx_butterfly!(test_avx_butterfly36, Butterfly36Avx, 36); test_avx_butterfly!(test_avx_butterfly48, Butterfly48Avx, 48); test_avx_butterfly!(test_avx_butterfly54, Butterfly54Avx, 54); test_avx_butterfly!(test_avx_butterfly64, Butterfly64Avx, 64); test_avx_butterfly!(test_avx_butterfly72, Butterfly72Avx, 72); test_avx_butterfly!(test_avx_butterfly128, Butterfly128Avx, 128); test_avx_butterfly!(test_avx_butterfly256, Butterfly256Avx, 256); test_avx_butterfly!(test_avx_butterfly512, Butterfly512Avx, 512); } rustfft-6.2.0/src/avx/avx32_utils.rs000064400000000000000000000231200072674642500154720ustar 00000000000000use std::arch::x86_64::*; use super::avx_vector::{AvxVector, AvxVector256}; // Treat the input like the rows of a 4x4 array, and transpose said rows to the columns #[inline(always)] pub unsafe fn transpose_4x4_f32(rows: [__m256; 4]) -> [__m256; 4] { let permute0 = _mm256_permute2f128_ps(rows[0], rows[2], 0x20); let permute1 = _mm256_permute2f128_ps(rows[1], rows[3], 0x20); let permute2 = _mm256_permute2f128_ps(rows[0], rows[2], 0x31); let permute3 = _mm256_permute2f128_ps(rows[1], rows[3], 0x31); let [unpacked0, unpacked1] = AvxVector::unpack_complex([permute0, permute1]); let [unpacked2, unpacked3] = AvxVector::unpack_complex([permute2, permute3]); [unpacked0, unpacked1, unpacked2, unpacked3] } // Treat the input like the rows of a 4x8 array, and transpose it to a 8x4 array, where each array of 4 is one set of 4 columns // The assumption here is that it's very likely that the caller wants to do some more AVX operations on the columns of the transposed array, so the output is arranged to make that more convenient // The second array only has two columns of valid data. TODO: make them __m128 instead #[inline(always)] pub unsafe fn transpose_4x6_to_6x4_f32(rows: [__m256; 6]) -> ([__m256; 4], [__m256; 4]) { let chunk0 = [rows[0], rows[1], rows[2], rows[3]]; let chunk1 = [rows[4], rows[5], _mm256_setzero_ps(), _mm256_setzero_ps()]; let output0 = transpose_4x4_f32(chunk0); let output1 = transpose_4x4_f32(chunk1); (output0, output1) } // Treat the input like the rows of a 8x4 array, and transpose it to a 4x8 array #[inline(always)] pub unsafe fn transpose_8x4_to_4x8_f32(rows0: [__m256; 4], rows1: [__m256; 4]) -> [__m256; 8] { let transposed0 = transpose_4x4_f32(rows0); let transposed1 = transpose_4x4_f32(rows1); [ transposed0[0], transposed0[1], transposed0[2], transposed0[3], transposed1[0], transposed1[1], transposed1[2], transposed1[3], ] } // Treat the input like the rows of a 9x3 array, and transpose it to a 3x9 array. // our parameters are technically 10 columns, not 9 -- we're going to discard the second element of row0 #[inline(always)] pub unsafe fn transpose_9x3_to_3x9_emptycolumn1_f32( rows0: [__m128; 3], rows1: [__m256; 3], rows2: [__m256; 3], ) -> [__m256; 9] { // the first row of the output will be the first column of the input let unpacked0 = AvxVector::unpacklo_complex([rows0[0], rows0[1]]); let unpacked1 = AvxVector::unpacklo_complex([rows0[2], _mm_setzero_ps()]); let output0 = AvxVector256::merge(unpacked0, unpacked1); let transposed0 = transpose_4x4_f32([rows1[0], rows1[1], rows1[2], _mm256_setzero_ps()]); let transposed1 = transpose_4x4_f32([rows2[0], rows2[1], rows2[2], _mm256_setzero_ps()]); [ output0, transposed0[0], transposed0[1], transposed0[2], transposed0[3], transposed1[0], transposed1[1], transposed1[2], transposed1[3], ] } // Treat the input like the rows of a 9x4 array, and transpose it to a 4x9 array. // our parameters are technically 10 columns, not 9 -- we're going to discard the second element of row0 #[inline(always)] pub unsafe fn transpose_9x4_to_4x9_emptycolumn1_f32( rows0: [__m128; 4], rows1: [__m256; 4], rows2: [__m256; 4], ) -> [__m256; 9] { // the first row of the output will be the first column of the input let unpacked0 = AvxVector::unpacklo_complex([rows0[0], rows0[1]]); let unpacked1 = AvxVector::unpacklo_complex([rows0[2], rows0[3]]); let output0 = AvxVector256::merge(unpacked0, unpacked1); let transposed0 = transpose_4x4_f32([rows1[0], rows1[1], rows1[2], rows1[3]]); let transposed1 = transpose_4x4_f32([rows2[0], rows2[1], rows2[2], rows2[3]]); [ output0, transposed0[0], transposed0[1], transposed0[2], transposed0[3], transposed1[0], transposed1[1], transposed1[2], transposed1[3], ] } // Treat the input like the rows of a 9x4 array, and transpose it to a 4x9 array. // our parameters are technically 10 columns, not 9 -- we're going to discard the second element of row0 #[inline(always)] pub unsafe fn transpose_9x6_to_6x9_emptycolumn1_f32( rows0: [__m128; 6], rows1: [__m256; 6], rows2: [__m256; 6], ) -> ([__m256; 9], [__m128; 9]) { // the first row of the output will be the first column of the input let unpacked0 = AvxVector::unpacklo_complex([rows0[0], rows0[1]]); let unpacked1 = AvxVector::unpacklo_complex([rows0[2], rows0[3]]); let unpacked2 = AvxVector::unpacklo_complex([rows0[4], rows0[5]]); let output0 = AvxVector256::merge(unpacked0, unpacked1); let transposed_hi0 = transpose_4x4_f32([rows1[0], rows1[1], rows1[2], rows1[3]]); let transposed_hi1 = transpose_4x4_f32([rows2[0], rows2[1], rows2[2], rows2[3]]); let [unpacked_bottom0, unpacked_bottom1] = AvxVector::unpack_complex([rows1[4], rows1[5]]); let [unpacked_bottom2, unpacked_bottom3] = AvxVector::unpack_complex([rows2[4], rows2[5]]); let transposed_lo = [ unpacked2, unpacked_bottom0.lo(), unpacked_bottom1.lo(), unpacked_bottom0.hi(), unpacked_bottom1.hi(), unpacked_bottom2.lo(), unpacked_bottom3.lo(), unpacked_bottom2.hi(), unpacked_bottom3.hi(), ]; ( [ output0, transposed_hi0[0], transposed_hi0[1], transposed_hi0[2], transposed_hi0[3], transposed_hi1[0], transposed_hi1[1], transposed_hi1[2], transposed_hi1[3], ], transposed_lo, ) } // Treat the input like the rows of a 12x4 array, and transpose it to a 4x12 array // The assumption here is that the caller wants to do some more AVX operations on the columns of the transposed array, so the output is arranged to make that more convenient #[inline(always)] pub unsafe fn transpose_12x4_to_4x12_f32( rows0: [__m256; 4], rows1: [__m256; 4], rows2: [__m256; 4], ) -> [__m256; 12] { let transposed0 = transpose_4x4_f32(rows0); let transposed1 = transpose_4x4_f32(rows1); let transposed2 = transpose_4x4_f32(rows2); [ transposed0[0], transposed0[1], transposed0[2], transposed0[3], transposed1[0], transposed1[1], transposed1[2], transposed1[3], transposed2[0], transposed2[1], transposed2[2], transposed2[3], ] } // Treat the input like the rows of a 12x6 array, and transpose it to a 6x12 array // The assumption here is that the caller wants to do some more AVX operations on the columns of the transposed array, so the output is arranged to make that more convenient #[inline(always)] pub unsafe fn transpose_12x6_to_6x12_f32( rows0: [__m256; 6], rows1: [__m256; 6], rows2: [__m256; 6], ) -> ([__m128; 12], [__m256; 12]) { let [unpacked0, unpacked1] = AvxVector::unpack_complex([rows0[0], rows0[1]]); let [unpacked2, unpacked3] = AvxVector::unpack_complex([rows1[0], rows1[1]]); let [unpacked4, unpacked5] = AvxVector::unpack_complex([rows2[0], rows2[1]]); let output0 = [ unpacked0.lo(), unpacked1.lo(), unpacked0.hi(), unpacked1.hi(), unpacked2.lo(), unpacked3.lo(), unpacked2.hi(), unpacked3.hi(), unpacked4.lo(), unpacked5.lo(), unpacked4.hi(), unpacked5.hi(), ]; let transposed0 = transpose_4x4_f32([rows0[2], rows0[3], rows0[4], rows0[5]]); let transposed1 = transpose_4x4_f32([rows1[2], rows1[3], rows1[4], rows1[5]]); let transposed2 = transpose_4x4_f32([rows2[2], rows2[3], rows2[4], rows2[5]]); let output1 = [ transposed0[0], transposed0[1], transposed0[2], transposed0[3], transposed1[0], transposed1[1], transposed1[2], transposed1[3], transposed2[0], transposed2[1], transposed2[2], transposed2[3], ]; (output0, output1) } // Treat the input like the rows of a 8x8 array, and transpose said rows to the columns // The assumption here is that the caller wants to do some more AVX operations on the columns of the transposed array, so the output is arranged to make that more convenient #[inline(always)] pub unsafe fn transpose_8x8_f32( rows0: [__m256; 8], rows1: [__m256; 8], ) -> ([__m256; 8], [__m256; 8]) { let chunk00 = [rows0[0], rows0[1], rows0[2], rows0[3]]; let chunk01 = [rows0[4], rows0[5], rows0[6], rows0[7]]; let chunk10 = [rows1[0], rows1[1], rows1[2], rows1[3]]; let chunk11 = [rows1[4], rows1[5], rows1[6], rows1[7]]; let transposed00 = transpose_4x4_f32(chunk00); let transposed01 = transpose_4x4_f32(chunk10); let transposed10 = transpose_4x4_f32(chunk01); let transposed11 = transpose_4x4_f32(chunk11); let output0 = [ transposed00[0], transposed00[1], transposed00[2], transposed00[3], transposed01[0], transposed01[1], transposed01[2], transposed01[3], ]; let output1 = [ transposed10[0], transposed10[1], transposed10[2], transposed10[3], transposed11[0], transposed11[1], transposed11[2], transposed11[3], ]; (output0, output1) } rustfft-6.2.0/src/avx/avx64_butterflies.rs000064400000000000000000002153700072674642500167010ustar 00000000000000use std::arch::x86_64::*; use std::marker::PhantomData; use std::mem::MaybeUninit; use num_complex::Complex; use crate::array_utils; use crate::array_utils::workaround_transmute_mut; use crate::array_utils::DoubleBuf; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::{common::FftNum, twiddles}; use crate::{Direction, Fft, FftDirection, Length}; use super::avx64_utils; use super::avx_vector; use super::avx_vector::{AvxArray, AvxArrayMut, AvxVector, AvxVector128, AvxVector256, Rotation90}; // Safety: This macro will call `self::perform_fft_f32()` which probably has a #[target_feature(enable = "...")] annotation on it. // Calling functions with that annotation is unsafe, because it doesn't actually check if the CPU has the required features. // Callers of this macro must guarantee that users can't even obtain an instance of $struct_name if their CPU doesn't have the required CPU features. #[allow(unused)] macro_rules! boilerplate_fft_simd_butterfly { ($struct_name:ident, $len:expr) => { impl $struct_name { #[inline] pub fn new(direction: FftDirection) -> Result { let has_avx = is_x86_feature_detected!("avx"); let has_fma = is_x86_feature_detected!("fma"); if has_avx && has_fma { // Safety: new_internal requires the "avx" feature set. Since we know it's present, we're safe Ok(unsafe { Self::new_with_avx(direction) }) } else { Err(()) } } } impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], _scratch: &mut [Complex], ) { if input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } let result = array_utils::iter_chunks_zipped( input, output, self.len(), |in_chunk, out_chunk| { unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why we have to transmute these slices let input_slice = workaround_transmute_mut(in_chunk); let output_slice = workaround_transmute_mut(out_chunk); self.perform_fft_f64(DoubleBuf { input: input_slice, output: output_slice, }); } }, ); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); } } fn process_with_scratch(&self, buffer: &mut [Complex], _scratch: &mut [Complex]) { if buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace(self.len(), buffer.len(), 0, 0); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let result = array_utils::iter_chunks(buffer, self.len(), |chunk| { unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why we have to transmute these slices self.perform_fft_f64(workaround_transmute_mut::<_, Complex>(chunk)); } }); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace(self.len(), buffer.len(), 0, 0); } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { 0 } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { 0 } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { $len } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { self.direction } } }; } // Safety: This macro will call `self::column_butterflies_and_transpose()` and `self::row_butterflies()` which probably has a #[target_feature(enable = "...")] annotation on it. // Calling functions with that annotation is unsafe, because it doesn't actually check if the CPU has the required features. // Callers of this macro must guarantee that users can't even obtain an instance of $struct_name if their CPU doesn't have the required CPU features. macro_rules! boilerplate_fft_simd_butterfly_with_scratch { ($struct_name:ident, $len:expr) => { impl $struct_name { #[inline] pub fn is_supported_by_cpu() -> bool { is_x86_feature_detected!("avx") && is_x86_feature_detected!("fma") } #[inline] pub fn new(direction: FftDirection) -> Result { if Self::is_supported_by_cpu() { // Safety: new_internal requires the "avx" feature set. Since we know it's present, we're safe Ok(unsafe { Self::new_with_avx(direction) }) } else { Err(()) } } } impl $struct_name { #[inline] fn perform_fft_inplace( &self, buffer: &mut [Complex], scratch: &mut [Complex], ) { // Perform the column FFTs // Safety: self.perform_column_butterflies() requres the "avx" and "fma" instruction sets, and we return Err() in our constructor if the instructions aren't available unsafe { self.column_butterflies_and_transpose(buffer, scratch) }; // process the row FFTs, and copy from the scratch back to the buffer as we go // Safety: self.transpose() requres the "avx" instruction set, and we return Err() in our constructor if the instructions aren't available unsafe { self.row_butterflies(DoubleBuf { input: scratch, output: buffer, }) }; } #[inline] fn perform_fft_out_of_place( &self, input: &mut [Complex], output: &mut [Complex], ) { // Perform the column FFTs // Safety: self.perform_column_butterflies() requres the "avx" and "fma" instruction sets, and we return Err() in our constructor if the instructions aren't available unsafe { self.column_butterflies_and_transpose(input, output) }; // process the row FFTs in-place in the output buffer // Safety: self.transpose() requres the "avx" instruction set, and we return Err() in our constructor if the instructions aren't available unsafe { self.row_butterflies(output) }; } } impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], _scratch: &mut [Complex], ) { if input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_input: &mut [Complex] = unsafe { array_utils::workaround_transmute_mut(input) }; let transmuted_output: &mut [Complex] = unsafe { array_utils::workaround_transmute_mut(output) }; let result = array_utils::iter_chunks_zipped( transmuted_input, transmuted_output, self.len(), |in_chunk, out_chunk| self.perform_fft_out_of_place(in_chunk, out_chunk), ); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); } } fn process_with_scratch(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { let required_scratch = self.len(); if scratch.len() < required_scratch || buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace(self.len(), buffer.len(), self.len(), scratch.len()); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let scratch = &mut scratch[..required_scratch]; // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_buffer: &mut [Complex] = unsafe { array_utils::workaround_transmute_mut(buffer) }; let transmuted_scratch: &mut [Complex] = unsafe { array_utils::workaround_transmute_mut(scratch) }; let result = array_utils::iter_chunks(transmuted_buffer, self.len(), |chunk| { self.perform_fft_inplace(chunk, transmuted_scratch) }); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace(self.len(), buffer.len(), self.len(), scratch.len()); } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { $len } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { 0 } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { $len } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { self.direction } } }; } macro_rules! gen_butterfly_twiddles_interleaved_columns { ($num_rows:expr, $num_cols:expr, $skip_cols:expr, $direction: expr) => {{ const FFT_LEN: usize = $num_rows * $num_cols; const TWIDDLE_ROWS: usize = $num_rows - 1; const TWIDDLE_COLS: usize = $num_cols - $skip_cols; const TWIDDLE_VECTOR_COLS: usize = TWIDDLE_COLS / 2; const TWIDDLE_VECTOR_COUNT: usize = TWIDDLE_VECTOR_COLS * TWIDDLE_ROWS; let mut twiddles = [AvxVector::zero(); TWIDDLE_VECTOR_COUNT]; for index in 0..TWIDDLE_VECTOR_COUNT { let y = (index / TWIDDLE_VECTOR_COLS) + 1; let x = (index % TWIDDLE_VECTOR_COLS) * 2 + $skip_cols; twiddles[index] = AvxVector::make_mixedradix_twiddle_chunk(x, y, FFT_LEN, $direction); } twiddles }}; } macro_rules! gen_butterfly_twiddles_separated_columns { ($num_rows:expr, $num_cols:expr, $skip_cols:expr, $direction: expr) => {{ const FFT_LEN: usize = $num_rows * $num_cols; const TWIDDLE_ROWS: usize = $num_rows - 1; const TWIDDLE_COLS: usize = $num_cols - $skip_cols; const TWIDDLE_VECTOR_COLS: usize = TWIDDLE_COLS / 2; const TWIDDLE_VECTOR_COUNT: usize = TWIDDLE_VECTOR_COLS * TWIDDLE_ROWS; let mut twiddles = [AvxVector::zero(); TWIDDLE_VECTOR_COUNT]; for index in 0..TWIDDLE_VECTOR_COUNT { let y = (index % TWIDDLE_ROWS) + 1; let x = (index / TWIDDLE_ROWS) * 2 + $skip_cols; twiddles[index] = AvxVector::make_mixedradix_twiddle_chunk(x, y, FFT_LEN, $direction); } twiddles }}; } pub struct Butterfly5Avx64 { twiddles: [__m256d; 3], direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly5Avx64, 5); impl Butterfly5Avx64 { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { let twiddle1 = twiddles::compute_twiddle(1, 5, direction); let twiddle2 = twiddles::compute_twiddle(2, 5, direction); Self { twiddles: [ _mm256_set_pd(twiddle1.im, twiddle1.im, twiddle1.re, twiddle1.re), _mm256_set_pd(twiddle2.im, twiddle2.im, twiddle2.re, twiddle2.re), _mm256_set_pd(-twiddle1.im, -twiddle1.im, twiddle1.re, twiddle1.re), ], direction, _phantom_t: PhantomData, } } } impl Butterfly5Avx64 { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f64(&self, mut buffer: impl AvxArrayMut) { let input0 = _mm256_loadu2_m128d( buffer.input_ptr() as *const f64, buffer.input_ptr() as *const f64, ); let input12 = buffer.load_complex(1); let input34 = buffer.load_complex(3); // swap elements for inputs 3 and 4 let input43 = AvxVector::reverse_complex_elements(input34); // do some prep work before we can start applying twiddle factors let [sum12, diff43] = AvxVector::column_butterfly2([input12, input43]); let rotation = AvxVector::make_rotation90(FftDirection::Inverse); let rotated43 = AvxVector::rotate90(diff43, rotation); let [mid14, mid23] = avx64_utils::transpose_2x2_f64([sum12, rotated43]); // to compute the first output, compute the sum of all elements. mid14[0] and mid23[0] already have the sum of 1+4 and 2+3 respectively, so if we add them, we'll get the sum of all 4 let sum1234 = AvxVector::add(mid14.lo(), mid23.lo()); let output0 = AvxVector::add(input0.lo(), sum1234); // apply twiddle factors let twiddled_outer14 = AvxVector::mul(mid14, self.twiddles[0]); let twiddled_inner14 = AvxVector::mul(mid14, self.twiddles[1]); let twiddled14 = AvxVector::fmadd(mid23, self.twiddles[1], twiddled_outer14); let twiddled23 = AvxVector::fmadd(mid23, self.twiddles[2], twiddled_inner14); // unpack the data for the last butterfly 2 let [twiddled12, twiddled43] = avx64_utils::transpose_2x2_f64([twiddled14, twiddled23]); let [output12, output43] = AvxVector::column_butterfly2([twiddled12, twiddled43]); // swap the elements in output43 before writing them out, and add the first input to everything let final12 = AvxVector::add(input0, output12); let output34 = AvxVector::reverse_complex_elements(output43); let final34 = AvxVector::add(input0, output34); buffer.store_partial1_complex(output0, 0); buffer.store_complex(final12, 1); buffer.store_complex(final34, 3); } } pub struct Butterfly7Avx64 { twiddles: [__m256d; 5], direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly7Avx64, 7); impl Butterfly7Avx64 { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { let twiddle1 = twiddles::compute_twiddle(1, 7, direction); let twiddle2 = twiddles::compute_twiddle(2, 7, direction); let twiddle3 = twiddles::compute_twiddle(3, 7, direction); Self { twiddles: [ _mm256_set_pd(twiddle1.im, twiddle1.im, twiddle1.re, twiddle1.re), _mm256_set_pd(twiddle2.im, twiddle2.im, twiddle2.re, twiddle2.re), _mm256_set_pd(twiddle3.im, twiddle3.im, twiddle3.re, twiddle3.re), _mm256_set_pd(-twiddle3.im, -twiddle3.im, twiddle3.re, twiddle3.re), _mm256_set_pd(-twiddle1.im, -twiddle1.im, twiddle1.re, twiddle1.re), ], direction, _phantom_t: PhantomData, } } } impl Butterfly7Avx64 { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f64(&self, mut buffer: impl AvxArrayMut) { let input0 = _mm256_loadu2_m128d( buffer.input_ptr() as *const f64, buffer.input_ptr() as *const f64, ); let input12 = buffer.load_complex(1); let input3 = buffer.load_partial1_complex(3); let input4 = buffer.load_partial1_complex(4); let input56 = buffer.load_complex(5); // reverse the order of input56 let input65 = AvxVector::reverse_complex_elements(input56); // do some prep work before we can start applying twiddle factors let [sum12, diff65] = AvxVector::column_butterfly2([input12, input65]); let [sum3, diff4] = AvxVector::column_butterfly2([input3, input4]); let rotation = AvxVector::make_rotation90(FftDirection::Inverse); let rotated65 = AvxVector::rotate90(diff65, rotation); let rotated4 = AvxVector::rotate90(diff4, rotation.lo()); let [mid16, mid25] = avx64_utils::transpose_2x2_f64([sum12, rotated65]); let mid34 = AvxVector128::merge(sum3, rotated4); // to compute the first output, compute the sum of all elements. mid16[0], mid25[0], and mid34[0] already have the sum of 1+6, 2+5 and 3+4 respectively, so if we add them, we'll get 1+2+3+4+5+6 let output0_left = AvxVector::add(mid16.lo(), mid25.lo()); let output0_right = AvxVector::add(input0.lo(), mid34.lo()); let output0 = AvxVector::add(output0_left, output0_right); buffer.store_partial1_complex(output0, 0); // apply twiddle factors let twiddled16_intermediate1 = AvxVector::mul(mid16, self.twiddles[0]); let twiddled25_intermediate1 = AvxVector::mul(mid16, self.twiddles[1]); let twiddled34_intermediate1 = AvxVector::mul(mid16, self.twiddles[2]); let twiddled16_intermediate2 = AvxVector::fmadd(mid25, self.twiddles[1], twiddled16_intermediate1); let twiddled25_intermediate2 = AvxVector::fmadd(mid25, self.twiddles[3], twiddled25_intermediate1); let twiddled34_intermediate2 = AvxVector::fmadd(mid25, self.twiddles[4], twiddled34_intermediate1); let twiddled16 = AvxVector::fmadd(mid34, self.twiddles[2], twiddled16_intermediate2); let twiddled25 = AvxVector::fmadd(mid34, self.twiddles[4], twiddled25_intermediate2); let twiddled34 = AvxVector::fmadd(mid34, self.twiddles[1], twiddled34_intermediate2); // unpack the data for the last butterfly 2 let [twiddled12, twiddled65] = avx64_utils::transpose_2x2_f64([twiddled16, twiddled25]); // we can save one add if we add input0 to twiddled3 now. normally we'd add input0 to the final output, but the arrangement of data makes that a little awkward let twiddled03 = AvxVector::add(twiddled34.lo(), input0.lo()); let [output12, output65] = AvxVector::column_butterfly2([twiddled12, twiddled65]); let final12 = AvxVector::add(output12, input0); let output56 = AvxVector::reverse_complex_elements(output65); let final56 = AvxVector::add(output56, input0); let [final3, final4] = AvxVector::column_butterfly2([twiddled03, twiddled34.hi()]); buffer.store_complex(final12, 1); buffer.store_partial1_complex(final3, 3); buffer.store_partial1_complex(final4, 4); buffer.store_complex(final56, 5); } } pub struct Butterfly11Avx64 { twiddles: [__m256d; 10], direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly11Avx64, 11); impl Butterfly11Avx64 { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { let twiddle1 = twiddles::compute_twiddle(1, 11, direction); let twiddle2 = twiddles::compute_twiddle(2, 11, direction); let twiddle3 = twiddles::compute_twiddle(3, 11, direction); let twiddle4 = twiddles::compute_twiddle(4, 11, direction); let twiddle5 = twiddles::compute_twiddle(5, 11, direction); let twiddles = [ _mm256_set_pd(twiddle1.im, twiddle1.im, twiddle1.re, twiddle1.re), _mm256_set_pd(twiddle2.im, twiddle2.im, twiddle2.re, twiddle2.re), _mm256_set_pd(twiddle3.im, twiddle3.im, twiddle3.re, twiddle3.re), _mm256_set_pd(twiddle4.im, twiddle4.im, twiddle4.re, twiddle4.re), _mm256_set_pd(twiddle5.im, twiddle5.im, twiddle5.re, twiddle5.re), _mm256_set_pd(-twiddle5.im, -twiddle5.im, twiddle5.re, twiddle5.re), _mm256_set_pd(-twiddle4.im, -twiddle4.im, twiddle4.re, twiddle4.re), _mm256_set_pd(-twiddle3.im, -twiddle3.im, twiddle3.re, twiddle3.re), _mm256_set_pd(-twiddle2.im, -twiddle2.im, twiddle2.re, twiddle2.re), _mm256_set_pd(-twiddle1.im, -twiddle1.im, twiddle1.re, twiddle1.re), ]; Self { twiddles, direction, _phantom_t: PhantomData, } } } impl Butterfly11Avx64 { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f64(&self, mut buffer: impl AvxArrayMut) { let input0 = buffer.load_partial1_complex(0); let input12 = buffer.load_complex(1); let input34 = buffer.load_complex(3); let input56 = buffer.load_complex(5); let input78 = buffer.load_complex(7); let input910 = buffer.load_complex(9); // reverse the order of input78910, and separate let [input55, input66] = AvxVector::unpack_complex([input56, input56]); let input87 = AvxVector::reverse_complex_elements(input78); let input109 = AvxVector::reverse_complex_elements(input910); // do some initial butterflies and rotations let [sum12, diff109] = AvxVector::column_butterfly2([input12, input109]); let [sum34, diff87] = AvxVector::column_butterfly2([input34, input87]); let [sum55, diff66] = AvxVector::column_butterfly2([input55, input66]); let rotation = AvxVector::make_rotation90(FftDirection::Inverse); let rotated109 = AvxVector::rotate90(diff109, rotation); let rotated87 = AvxVector::rotate90(diff87, rotation); let rotated66 = AvxVector::rotate90(diff66, rotation); // arrange the data into the format to apply twiddles let [mid110, mid29] = AvxVector::unpack_complex([sum12, rotated109]); let [mid38, mid47] = AvxVector::unpack_complex([sum34, rotated87]); let mid56 = AvxVector::unpacklo_complex([sum55, rotated66]); // to compute the first output, compute the sum of all elements. mid110[0], mid29[0], mid38[0], mid47 already have the sum of 1+10, 2+9 and so on, so if we add them, we'll get the sum of everything let mid12910 = AvxVector::add(mid110.lo(), mid29.lo()); let mid3478 = AvxVector::add(mid38.lo(), mid47.lo()); let output0_left = AvxVector::add(input0, mid56.lo()); let output0_right = AvxVector::add(mid12910, mid3478); let output0 = AvxVector::add(output0_left, output0_right); buffer.store_partial1_complex(output0, 0); // we need to add the first input to each of our 5 twiddles values -- but only the first complex element of each vector. so just use zero for the other element let zero = _mm_setzero_pd(); let input0 = AvxVector256::merge(input0, zero); // apply twiddle factors let twiddled110 = AvxVector::fmadd(mid110, self.twiddles[0], input0); let twiddled38 = AvxVector::fmadd(mid110, self.twiddles[2], input0); let twiddled29 = AvxVector::fmadd(mid110, self.twiddles[1], input0); let twiddled47 = AvxVector::fmadd(mid110, self.twiddles[3], input0); let twiddled56 = AvxVector::fmadd(mid110, self.twiddles[4], input0); let twiddled110 = AvxVector::fmadd(mid29, self.twiddles[1], twiddled110); let twiddled38 = AvxVector::fmadd(mid29, self.twiddles[5], twiddled38); let twiddled29 = AvxVector::fmadd(mid29, self.twiddles[3], twiddled29); let twiddled47 = AvxVector::fmadd(mid29, self.twiddles[7], twiddled47); let twiddled56 = AvxVector::fmadd(mid29, self.twiddles[9], twiddled56); let twiddled110 = AvxVector::fmadd(mid38, self.twiddles[2], twiddled110); let twiddled38 = AvxVector::fmadd(mid38, self.twiddles[8], twiddled38); let twiddled29 = AvxVector::fmadd(mid38, self.twiddles[5], twiddled29); let twiddled47 = AvxVector::fmadd(mid38, self.twiddles[0], twiddled47); let twiddled56 = AvxVector::fmadd(mid38, self.twiddles[3], twiddled56); let twiddled110 = AvxVector::fmadd(mid47, self.twiddles[3], twiddled110); let twiddled38 = AvxVector::fmadd(mid47, self.twiddles[0], twiddled38); let twiddled29 = AvxVector::fmadd(mid47, self.twiddles[7], twiddled29); let twiddled47 = AvxVector::fmadd(mid47, self.twiddles[4], twiddled47); let twiddled56 = AvxVector::fmadd(mid47, self.twiddles[8], twiddled56); let twiddled110 = AvxVector::fmadd(mid56, self.twiddles[4], twiddled110); let twiddled38 = AvxVector::fmadd(mid56, self.twiddles[3], twiddled38); let twiddled29 = AvxVector::fmadd(mid56, self.twiddles[9], twiddled29); let twiddled47 = AvxVector::fmadd(mid56, self.twiddles[8], twiddled47); let twiddled56 = AvxVector::fmadd(mid56, self.twiddles[2], twiddled56); // unpack the data for the last butterfly 2 let [twiddled12, twiddled109] = AvxVector::unpack_complex([twiddled110, twiddled29]); let [twiddled34, twiddled87] = AvxVector::unpack_complex([twiddled38, twiddled47]); let [twiddled55, twiddled66] = AvxVector::unpack_complex([twiddled56, twiddled56]); let [output12, output109] = AvxVector::column_butterfly2([twiddled12, twiddled109]); let [output34, output87] = AvxVector::column_butterfly2([twiddled34, twiddled87]); let [output55, output66] = AvxVector::column_butterfly2([twiddled55, twiddled66]); let output78 = AvxVector::reverse_complex_elements(output87); let output910 = AvxVector::reverse_complex_elements(output109); buffer.store_complex(output12, 1); buffer.store_complex(output34, 3); buffer.store_partial1_complex(output55.lo(), 5); buffer.store_partial1_complex(output66.lo(), 6); buffer.store_complex(output78, 7); buffer.store_complex(output910, 9); } } pub struct Butterfly8Avx64 { twiddles: [__m256d; 2], twiddles_butterfly4: Rotation90<__m256d>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly8Avx64, 8); impl Butterfly8Avx64 { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_interleaved_columns!(2, 4, 0, direction), twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly8Avx64 { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f64(&self, mut buffer: impl AvxArrayMut) { let row0 = buffer.load_complex(0); let row1 = buffer.load_complex(2); let row2 = buffer.load_complex(4); let row3 = buffer.load_complex(6); // Do our butterfly 2's down the columns of a 4x2 array let [mid0, mid2] = AvxVector::column_butterfly2([row0, row2]); let [mid1, mid3] = AvxVector::column_butterfly2([row1, row3]); let mid2_twiddled = AvxVector::mul_complex(mid2, self.twiddles[0]); let mid3_twiddled = AvxVector::mul_complex(mid3, self.twiddles[1]); // transpose to a 2x4 array let transposed = avx64_utils::transpose_4x2_to_2x4_f64([mid0, mid2_twiddled], [mid1, mid3_twiddled]); // butterfly 4's down the transposed array let output_rows = AvxVector::column_butterfly4(transposed, self.twiddles_butterfly4); buffer.store_complex(output_rows[0], 0); buffer.store_complex(output_rows[1], 2); buffer.store_complex(output_rows[2], 4); buffer.store_complex(output_rows[3], 6); } } pub struct Butterfly9Avx64 { twiddles: [__m256d; 2], twiddles_butterfly3: __m256d, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly9Avx64, 9); impl Butterfly9Avx64 { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_interleaved_columns!(3, 3, 1, direction), twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, direction), direction, _phantom_t: PhantomData, } } } impl Butterfly9Avx64 { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f64(&self, mut buffer: impl AvxArrayMut) { // we're going to load our input as a 3x3 array. We have to load 3 columns, which is a little awkward // We can reduce the number of multiplies we do if we load the first column as half-width and the second column as full. let mut rows0 = [AvxVector::zero(); 3]; let mut rows1 = [AvxVector::zero(); 3]; for r in 0..3 { rows0[r] = buffer.load_partial1_complex(3 * r); rows1[r] = buffer.load_complex(3 * r + 1); } // do butterfly 3's down the columns let mid0 = AvxVector::column_butterfly3(rows0, self.twiddles_butterfly3.lo()); let mut mid1 = AvxVector::column_butterfly3(rows1, self.twiddles_butterfly3); // apply twiddle factors for n in 1..3 { mid1[n] = AvxVector::mul_complex(mid1[n], self.twiddles[n - 1]); } // transpose our 3x3 array let (transposed0, transposed1) = avx64_utils::transpose_3x3_f64(mid0, mid1); // apply butterfly 3's down the columns let output0 = AvxVector::column_butterfly3(transposed0, self.twiddles_butterfly3.lo()); let output1 = AvxVector::column_butterfly3(transposed1, self.twiddles_butterfly3); for r in 0..3 { buffer.store_partial1_complex(output0[r], 3 * r); buffer.store_complex(output1[r], 3 * r + 1); } } } pub struct Butterfly12Avx64 { twiddles: [__m256d; 3], twiddles_butterfly3: __m256d, twiddles_butterfly4: Rotation90<__m256d>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly12Avx64, 12); impl Butterfly12Avx64 { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_interleaved_columns!(4, 3, 1, direction), twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, direction), twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly12Avx64 { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f64(&self, mut buffer: impl AvxArrayMut) { // we're going to load our input as a 3x4 array. We have to load 3 columns, which is a little awkward // We can reduce the number of multiplies we do if we load the first column as half-width and the second column as full. let mut rows0 = [AvxVector::zero(); 4]; let mut rows1 = [AvxVector::zero(); 4]; for n in 0..4 { rows0[n] = buffer.load_partial1_complex(n * 3); rows1[n] = buffer.load_complex(n * 3 + 1); } // do butterfly 4's down the columns let mid0 = AvxVector::column_butterfly4(rows0, self.twiddles_butterfly4.lo()); let mut mid1 = AvxVector::column_butterfly4(rows1, self.twiddles_butterfly4); // apply twiddle factors for n in 1..4 { mid1[n] = AvxVector::mul_complex(mid1[n], self.twiddles[n - 1]); } // transpose our 3x4 array to a 4x3 array let (transposed0, transposed1) = avx64_utils::transpose_3x4_to_4x3_f64(mid0, mid1); // apply butterfly 3's down the columns let output0 = AvxVector::column_butterfly3(transposed0, self.twiddles_butterfly3); let output1 = AvxVector::column_butterfly3(transposed1, self.twiddles_butterfly3); for r in 0..3 { buffer.store_complex(output0[r], 4 * r); buffer.store_complex(output1[r], 4 * r + 2); } } } pub struct Butterfly16Avx64 { twiddles: [__m256d; 6], twiddles_butterfly4: Rotation90<__m256d>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly16Avx64, 16); impl Butterfly16Avx64 { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_interleaved_columns!(4, 4, 0, direction), twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly16Avx64 { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f64(&self, mut buffer: impl AvxArrayMut) { let mut rows0 = [AvxVector::zero(); 4]; let mut rows1 = [AvxVector::zero(); 4]; for r in 0..4 { rows0[r] = buffer.load_complex(4 * r); rows1[r] = buffer.load_complex(4 * r + 2); } // We're going to treat our input as a 4x4 2d array. First, do 4 butterfly 4's down the columns of that array. let mut mid0 = AvxVector::column_butterfly4(rows0, self.twiddles_butterfly4); let mut mid1 = AvxVector::column_butterfly4(rows1, self.twiddles_butterfly4); // apply twiddle factors for r in 1..4 { mid0[r] = AvxVector::mul_complex(mid0[r], self.twiddles[2 * r - 2]); mid1[r] = AvxVector::mul_complex(mid1[r], self.twiddles[2 * r - 1]); } // Transpose our 4x4 array let (transposed0, transposed1) = avx64_utils::transpose_4x4_f64(mid0, mid1); // Butterfly 4's down columns of the transposed array let output0 = AvxVector::column_butterfly4(transposed0, self.twiddles_butterfly4); let output1 = AvxVector::column_butterfly4(transposed1, self.twiddles_butterfly4); for r in 0..4 { buffer.store_complex(output0[r], 4 * r); buffer.store_complex(output1[r], 4 * r + 2); } } } pub struct Butterfly18Avx64 { twiddles: [__m256d; 5], twiddles_butterfly3: __m256d, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly18Avx64, 18); impl Butterfly18Avx64 { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_interleaved_columns!(6, 3, 1, direction), twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, direction), direction, _phantom_t: PhantomData, } } } impl Butterfly18Avx64 { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f64(&self, mut buffer: impl AvxArrayMut) { // we're going to load our input as a 3x6 array. We have to load 3 columns, which is a little awkward // We can reduce the number of multiplies we do if we load the first column as half-width and the second column as full. let mut rows0 = [AvxVector::zero(); 6]; let mut rows1 = [AvxVector::zero(); 6]; for n in 0..6 { rows0[n] = buffer.load_partial1_complex(n * 3); rows1[n] = buffer.load_complex(n * 3 + 1); } // do butterfly 6's down the columns let mid0 = AvxVector128::column_butterfly6(rows0, self.twiddles_butterfly3); let mut mid1 = AvxVector256::column_butterfly6(rows1, self.twiddles_butterfly3); // apply twiddle factors for n in 1..6 { mid1[n] = AvxVector::mul_complex(mid1[n], self.twiddles[n - 1]); } // transpose our 3x6 array to a 6x3 array let (transposed0, transposed1, transposed2) = avx64_utils::transpose_3x6_to_6x3_f64(mid0, mid1); // apply butterfly 3's down the columns let output0 = AvxVector::column_butterfly3(transposed0, self.twiddles_butterfly3); let output1 = AvxVector::column_butterfly3(transposed1, self.twiddles_butterfly3); let output2 = AvxVector::column_butterfly3(transposed2, self.twiddles_butterfly3); for r in 0..3 { buffer.store_complex(output0[r], 6 * r); buffer.store_complex(output1[r], 6 * r + 2); buffer.store_complex(output2[r], 6 * r + 4); } } } pub struct Butterfly24Avx64 { twiddles: [__m256d; 9], twiddles_butterfly3: __m256d, twiddles_butterfly4: Rotation90<__m256d>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly24Avx64, 24); impl Butterfly24Avx64 { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_interleaved_columns!(4, 6, 0, direction), twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, direction), twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly24Avx64 { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f64(&self, mut buffer: impl AvxArrayMut) { let mut rows0 = [AvxVector::zero(); 4]; let mut rows1 = [AvxVector::zero(); 4]; let mut rows2 = [AvxVector::zero(); 4]; for r in 0..4 { rows0[r] = buffer.load_complex(6 * r); rows1[r] = buffer.load_complex(6 * r + 2); rows2[r] = buffer.load_complex(6 * r + 4); } // We're going to treat our input as a 6x4 2d array. First, do 6 butterfly 4's down the columns of that array. let mut mid0 = AvxVector::column_butterfly4(rows0, self.twiddles_butterfly4); let mut mid1 = AvxVector::column_butterfly4(rows1, self.twiddles_butterfly4); let mut mid2 = AvxVector::column_butterfly4(rows2, self.twiddles_butterfly4); // apply twiddle factors for r in 1..4 { mid0[r] = AvxVector::mul_complex(mid0[r], self.twiddles[3 * r - 3]); mid1[r] = AvxVector::mul_complex(mid1[r], self.twiddles[3 * r - 2]); mid2[r] = AvxVector::mul_complex(mid2[r], self.twiddles[3 * r - 1]); } // Transpose our 6x4 array let (transposed0, transposed1) = avx64_utils::transpose_6x4_to_4x6_f64(mid0, mid1, mid2); // Butterfly 6's down columns of the transposed array let output0 = AvxVector256::column_butterfly6(transposed0, self.twiddles_butterfly3); let output1 = AvxVector256::column_butterfly6(transposed1, self.twiddles_butterfly3); for r in 0..6 { buffer.store_complex(output0[r], 4 * r); buffer.store_complex(output1[r], 4 * r + 2); } } } pub struct Butterfly27Avx64 { twiddles: [__m256d; 8], twiddles_butterfly9: [__m256d; 3], twiddles_butterfly9_lo: [__m256d; 2], twiddles_butterfly3: __m256d, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly27Avx64, 27); impl Butterfly27Avx64 { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { let twiddle1 = __m128d::broadcast_twiddle(1, 9, direction); let twiddle2 = __m128d::broadcast_twiddle(2, 9, direction); let twiddle4 = __m128d::broadcast_twiddle(4, 9, direction); Self { twiddles: gen_butterfly_twiddles_interleaved_columns!(3, 9, 1, direction), twiddles_butterfly9: [ AvxVector::broadcast_twiddle(1, 9, direction), AvxVector::broadcast_twiddle(2, 9, direction), AvxVector::broadcast_twiddle(4, 9, direction), ], twiddles_butterfly9_lo: [ AvxVector256::merge(twiddle1, twiddle2), AvxVector256::merge(twiddle2, twiddle4), ], twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, direction), direction, _phantom_t: PhantomData, } } } impl Butterfly27Avx64 { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f64(&self, mut buffer: impl AvxArrayMut) { // we're going to load our input as a 9x3 array. We have to load 9 columns, which is a little awkward // We can reduce the number of multiplies we do if we load the first column as half-width and the remaining 4 sets of vectors as full. // We can't fit the whole problem into AVX registers at once, so we'll have to spill some things. // By computing chunks of the problem and then not referencing any of it for a while, we're making it easy for the compiler to decide what to spill let mut rows0 = [AvxVector::zero(); 3]; for n in 0..3 { rows0[n] = buffer.load_partial1_complex(n * 9); } let mid0 = AvxVector::column_butterfly3(rows0, self.twiddles_butterfly3.lo()); // First chunk is done and can be spilled, do 2 more chunks let mut rows1 = [AvxVector::zero(); 3]; let mut rows2 = [AvxVector::zero(); 3]; for n in 0..3 { rows1[n] = buffer.load_complex(n * 9 + 1); rows2[n] = buffer.load_complex(n * 9 + 3); } let mut mid1 = AvxVector::column_butterfly3(rows1, self.twiddles_butterfly3); let mut mid2 = AvxVector::column_butterfly3(rows2, self.twiddles_butterfly3); for r in 1..3 { mid1[r] = AvxVector::mul_complex(mid1[r], self.twiddles[4 * r - 4]); mid2[r] = AvxVector::mul_complex(mid2[r], self.twiddles[4 * r - 3]); } // First 3 chunks are done and can be spilled, do the final 2 chunks let mut rows3 = [AvxVector::zero(); 3]; let mut rows4 = [AvxVector::zero(); 3]; for n in 0..3 { rows3[n] = buffer.load_complex(n * 9 + 5); rows4[n] = buffer.load_complex(n * 9 + 7); } let mut mid3 = AvxVector::column_butterfly3(rows3, self.twiddles_butterfly3); let mut mid4 = AvxVector::column_butterfly3(rows4, self.twiddles_butterfly3); for r in 1..3 { mid3[r] = AvxVector::mul_complex(mid3[r], self.twiddles[4 * r - 2]); mid4[r] = AvxVector::mul_complex(mid4[r], self.twiddles[4 * r - 1]); } // transpose our 9x3 array to a 3x9 array let (transposed0, transposed1) = avx64_utils::transpose_9x3_to_3x9_f64(mid0, mid1, mid2, mid3, mid4); // apply butterfly 9's down the columns. Again, do the work in chunks to make it easier for the compiler to spill let output0 = AvxVector128::column_butterfly9( transposed0, self.twiddles_butterfly9_lo, self.twiddles_butterfly3, ); for r in 0..3 { buffer.store_partial1_complex(output0[r * 3], 9 * r); buffer.store_partial1_complex(output0[r * 3 + 1], 9 * r + 3); buffer.store_partial1_complex(output0[r * 3 + 2], 9 * r + 6); } let output1 = AvxVector256::column_butterfly9( transposed1, self.twiddles_butterfly9, self.twiddles_butterfly3, ); for r in 0..3 { buffer.store_complex(output1[r * 3], 9 * r + 1); buffer.store_complex(output1[r * 3 + 1], 9 * r + 4); buffer.store_complex(output1[r * 3 + 2], 9 * r + 7); } } } pub struct Butterfly32Avx64 { twiddles: [__m256d; 12], twiddles_butterfly4: Rotation90<__m256d>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly32Avx64, 32); impl Butterfly32Avx64 { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_interleaved_columns!(4, 8, 0, direction), twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly32Avx64 { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f64(&self, mut buffer: impl AvxArrayMut) { // We're going to treat our input as a 8x4 2d array. First, do 8 butterfly 4's down the columns of that array. // We can't fit the whole problem into AVX registers at once, so we'll have to spill some things. // By computing half of the problem and then not referencing any of it for a while, we're making it easy for the compiler to decide what to spill let mut rows0 = [AvxVector::zero(); 4]; let mut rows1 = [AvxVector::zero(); 4]; for r in 0..4 { rows0[r] = buffer.load_complex(8 * r); rows1[r] = buffer.load_complex(8 * r + 2); } let mut mid0 = AvxVector::column_butterfly4(rows0, self.twiddles_butterfly4); let mut mid1 = AvxVector::column_butterfly4(rows1, self.twiddles_butterfly4); for r in 1..4 { mid0[r] = AvxVector::mul_complex(mid0[r], self.twiddles[4 * r - 4]); mid1[r] = AvxVector::mul_complex(mid1[r], self.twiddles[4 * r - 3]); } // One half is done, so the compiler can spill everything above this. Now do the other set of columns let mut rows2 = [AvxVector::zero(); 4]; let mut rows3 = [AvxVector::zero(); 4]; for r in 0..4 { rows2[r] = buffer.load_complex(8 * r + 4); rows3[r] = buffer.load_complex(8 * r + 6); } let mut mid2 = AvxVector::column_butterfly4(rows2, self.twiddles_butterfly4); let mut mid3 = AvxVector::column_butterfly4(rows3, self.twiddles_butterfly4); for r in 1..4 { mid2[r] = AvxVector::mul_complex(mid2[r], self.twiddles[4 * r - 2]); mid3[r] = AvxVector::mul_complex(mid3[r], self.twiddles[4 * r - 1]); } // Transpose our 8x4 array to a 4x8 array let (transposed0, transposed1) = avx64_utils::transpose_8x4_to_4x8_f64(mid0, mid1, mid2, mid3); // Do 4 butterfly 8's down columns of the transposed array // Same thing as above - Do the half of the butterfly 8's separately to give the compiler a better hint about what to spill let output0 = AvxVector::column_butterfly8(transposed0, self.twiddles_butterfly4); for r in 0..8 { buffer.store_complex(output0[r], 4 * r); } let output1 = AvxVector::column_butterfly8(transposed1, self.twiddles_butterfly4); for r in 0..8 { buffer.store_complex(output1[r], 4 * r + 2); } } } pub struct Butterfly36Avx64 { twiddles: [__m256d; 15], twiddles_butterfly3: __m256d, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly!(Butterfly36Avx64, 36); impl Butterfly36Avx64 { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_separated_columns!(6, 6, 0, direction), twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, direction), direction, _phantom_t: PhantomData, } } } impl Butterfly36Avx64 { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_fft_f64(&self, mut buffer: impl AvxArrayMut) { // we're going to load our input as a 6x6 array // We can't fit the whole problem into AVX registers at once, so we'll have to spill some things. // By computing chunks of the problem and then not referencing any of it for a while, we're making it easy for the compiler to decide what to spill let mut rows0 = [AvxVector::zero(); 6]; for n in 0..6 { rows0[n] = buffer.load_complex(n * 6); } let mut mid0 = AvxVector256::column_butterfly6(rows0, self.twiddles_butterfly3); for r in 1..6 { mid0[r] = AvxVector::mul_complex(mid0[r], self.twiddles[r - 1]); } // we're going to load our input as a 6x6 array let mut rows1 = [AvxVector::zero(); 6]; for n in 0..6 { rows1[n] = buffer.load_complex(n * 6 + 2); } let mut mid1 = AvxVector256::column_butterfly6(rows1, self.twiddles_butterfly3); for r in 1..6 { mid1[r] = AvxVector::mul_complex(mid1[r], self.twiddles[r + 4]); } // we're going to load our input as a 6x6 array let mut rows2 = [AvxVector::zero(); 6]; for n in 0..6 { rows2[n] = buffer.load_complex(n * 6 + 4); } let mut mid2 = AvxVector256::column_butterfly6(rows2, self.twiddles_butterfly3); for r in 1..6 { mid2[r] = AvxVector::mul_complex(mid2[r], self.twiddles[r + 9]); } // Transpose our 6x6 array let (transposed0, transposed1, transposed2) = avx64_utils::transpose_6x6_f64(mid0, mid1, mid2); // Apply butterfly 6's down the columns. Again, do the work in chunks to make it easier for the compiler to spill let output0 = AvxVector256::column_butterfly6(transposed0, self.twiddles_butterfly3); for r in 0..3 { buffer.store_complex(output0[r * 2], 12 * r); buffer.store_complex(output0[r * 2 + 1], 12 * r + 6); } let output1 = AvxVector256::column_butterfly6(transposed1, self.twiddles_butterfly3); for r in 0..3 { buffer.store_complex(output1[r * 2], 12 * r + 2); buffer.store_complex(output1[r * 2 + 1], 12 * r + 8); } let output2 = AvxVector256::column_butterfly6(transposed2, self.twiddles_butterfly3); for r in 0..3 { buffer.store_complex(output2[r * 2], 12 * r + 4); buffer.store_complex(output2[r * 2 + 1], 12 * r + 10); } } } pub struct Butterfly64Avx64 { twiddles: [__m256d; 28], twiddles_butterfly4: Rotation90<__m256d>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly_with_scratch!(Butterfly64Avx64, 64); impl Butterfly64Avx64 { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_separated_columns!(8, 8, 0, direction), twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly64Avx64 { #[target_feature(enable = "avx", enable = "fma")] unsafe fn column_butterflies_and_transpose( &self, input: &[Complex], mut output: &mut [Complex], ) { // A size-64 FFT is way too big to fit in registers, so instead we're going to compute it in two phases, storing in scratch in between. // First phase is to treat this size-64 array like a 8x8 2D array, and do butterfly 8's down the columns // Then, apply twiddle factors, and finally transpose into the scratch space // But again, we don't have enough registers to load it all at once, so only load one column of AVX vectors at a time for columnset in 0..4 { let mut rows = [AvxVector::zero(); 8]; for r in 0..8 { rows[r] = input.load_complex(columnset * 2 + 8 * r); } // apply butterfly 8 let mut mid = AvxVector::column_butterfly8(rows, self.twiddles_butterfly4); // apply twiddle factors for r in 1..8 { mid[r] = AvxVector::mul_complex(mid[r], self.twiddles[r - 1 + 7 * columnset]); } // transpose let transposed = AvxVector::transpose8_packed(mid); // write out for i in 0..4 { output.store_complex(transposed[i * 2], columnset * 16 + i * 4); output.store_complex(transposed[i * 2 + 1], columnset * 16 + i * 4 + 2); } } } #[target_feature(enable = "avx", enable = "fma")] unsafe fn row_butterflies(&self, mut buffer: impl AvxArrayMut) { // Second phase: Butterfly 8's down the columns of our transposed array. // Thankfully, during the first phase, we set everything up so that all we have to do here is compute the size-8 FFT columns and write them back out where we got them for columnset in 0usize..4 { let mut rows = [AvxVector::zero(); 8]; for r in 0..8 { rows[r] = buffer.load_complex(columnset * 2 + 8 * r); } let mid = AvxVector::column_butterfly8(rows, self.twiddles_butterfly4); for r in 0..8 { buffer.store_complex(mid[r], columnset * 2 + 8 * r); } } } } pub struct Butterfly128Avx64 { twiddles: [__m256d; 56], twiddles_butterfly16: [__m256d; 2], twiddles_butterfly4: Rotation90<__m256d>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly_with_scratch!(Butterfly128Avx64, 128); impl Butterfly128Avx64 { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_separated_columns!(8, 16, 0, direction), twiddles_butterfly16: [ AvxVector::broadcast_twiddle(1, 16, direction), AvxVector::broadcast_twiddle(3, 16, direction), ], twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly128Avx64 { #[target_feature(enable = "avx", enable = "fma")] unsafe fn column_butterflies_and_transpose( &self, input: &[Complex], mut output: &mut [Complex], ) { // A size-128 FFT is way too big to fit in registers, so instead we're going to compute it in two phases, storing in scratch in between. // First phase is to treat this size-128 array like a 16x8 2D array, and do butterfly 8's down the columns // Then, apply twiddle factors, and finally transpose into the scratch space // But again, we don't have enough registers to load it all at once, so only load one column of AVX vectors at a time for columnset in 0..8 { let mut rows = [AvxVector::zero(); 8]; for r in 0..8 { rows[r] = input.load_complex(columnset * 2 + 16 * r); } // apply butterfly 8 let mut mid = AvxVector::column_butterfly8(rows, self.twiddles_butterfly4); // apply twiddle factors for r in 1..8 { mid[r] = AvxVector::mul_complex(mid[r], self.twiddles[r - 1 + 7 * columnset]); } // transpose let transposed = AvxVector::transpose8_packed(mid); // write out for i in 0..4 { output.store_complex(transposed[i * 2], columnset * 16 + i * 4); output.store_complex(transposed[i * 2 + 1], columnset * 16 + i * 4 + 2); } } } #[target_feature(enable = "avx", enable = "fma")] unsafe fn row_butterflies(&self, mut buffer: impl AvxArrayMut) { // Second phase: Butterfly 16's down the columns of our transposed array. // Thankfully, during the first phase, we set everything up so that all we have to do here is compute the size-16 FFT columns and write them back out where we got them // We're also using a customized butterfly16 function that is smarter about when it loads/stores data, to reduce register spilling for columnset in 0usize..4 { column_butterfly16_loadfn!( |index: usize| buffer.load_complex(columnset * 2 + index * 8), |data, index| buffer.store_complex(data, columnset * 2 + index * 8), self.twiddles_butterfly16, self.twiddles_butterfly4 ); } } } pub struct Butterfly256Avx64 { twiddles: [__m256d; 112], twiddles_butterfly32: [__m256d; 6], twiddles_butterfly4: Rotation90<__m256d>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly_with_scratch!(Butterfly256Avx64, 256); impl Butterfly256Avx64 { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_separated_columns!(8, 32, 0, direction), twiddles_butterfly32: [ AvxVector::broadcast_twiddle(1, 32, direction), AvxVector::broadcast_twiddle(2, 32, direction), AvxVector::broadcast_twiddle(3, 32, direction), AvxVector::broadcast_twiddle(5, 32, direction), AvxVector::broadcast_twiddle(6, 32, direction), AvxVector::broadcast_twiddle(7, 32, direction), ], twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly256Avx64 { #[target_feature(enable = "avx", enable = "fma")] unsafe fn column_butterflies_and_transpose( &self, input: &[Complex], mut output: &mut [Complex], ) { // A size-256 FFT is way too big to fit in registers, so instead we're going to compute it in two phases, storing in scratch in between. // First phase is to treeat this size-256 array like a 32x8 2D array, and do butterfly 8's down the columns // Then, apply twiddle factors, and finally transpose into the scratch space // But again, we don't have enough registers to load it all at once, so only load one column of AVX vectors at a time for columnset in 0..16 { let mut rows = [AvxVector::zero(); 8]; for r in 0..8 { rows[r] = input.load_complex(columnset * 2 + 32 * r); } // apply butterfly 8 let mut mid = AvxVector::column_butterfly8(rows, self.twiddles_butterfly4); // apply twiddle factors for r in 1..8 { mid[r] = AvxVector::mul_complex(mid[r], self.twiddles[r - 1 + 7 * columnset]); } // transpose let transposed = AvxVector::transpose8_packed(mid); // write out for i in 0..4 { output.store_complex(transposed[i * 2], columnset * 16 + i * 4); output.store_complex(transposed[i * 2 + 1], columnset * 16 + i * 4 + 2); } } } #[target_feature(enable = "avx", enable = "fma")] unsafe fn row_butterflies(&self, mut buffer: impl AvxArrayMut) { // Second phase: Butterfly 32's down the columns of our transposed array. // Thankfully, during the first phase, we set everything up so that all we have to do here is compute the size-32 FFT columns and write them back out where we got them // We're also using a customized butterfly32 function that is smarter about when it loads/stores data, to reduce register spilling for columnset in 0usize..4 { column_butterfly32_loadfn!( |index: usize| buffer.load_complex(columnset * 2 + index * 8), |data, index| buffer.store_complex(data, columnset * 2 + index * 8), self.twiddles_butterfly32, self.twiddles_butterfly4 ); } } } pub struct Butterfly512Avx64 { twiddles: [__m256d; 240], twiddles_butterfly32: [__m256d; 6], twiddles_butterfly16: [__m256d; 2], twiddles_butterfly4: Rotation90<__m256d>, direction: FftDirection, _phantom_t: std::marker::PhantomData, } boilerplate_fft_simd_butterfly_with_scratch!(Butterfly512Avx64, 512); impl Butterfly512Avx64 { #[target_feature(enable = "avx")] unsafe fn new_with_avx(direction: FftDirection) -> Self { Self { twiddles: gen_butterfly_twiddles_separated_columns!(16, 32, 0, direction), twiddles_butterfly32: [ AvxVector::broadcast_twiddle(1, 32, direction), AvxVector::broadcast_twiddle(2, 32, direction), AvxVector::broadcast_twiddle(3, 32, direction), AvxVector::broadcast_twiddle(5, 32, direction), AvxVector::broadcast_twiddle(6, 32, direction), AvxVector::broadcast_twiddle(7, 32, direction), ], twiddles_butterfly16: [ AvxVector::broadcast_twiddle(1, 16, direction), AvxVector::broadcast_twiddle(3, 16, direction), ], twiddles_butterfly4: AvxVector::make_rotation90(direction), direction, _phantom_t: PhantomData, } } } impl Butterfly512Avx64 { #[target_feature(enable = "avx", enable = "fma")] unsafe fn column_butterflies_and_transpose( &self, input: &[Complex], mut output: &mut [Complex], ) { // A size-512 FFT is way too big to fit in registers, so instead we're going to compute it in two phases, storing in scratch in between. // First phase is to treat this size-512 array like a 32x16 2D array, and do butterfly 16's down the columns // Then, apply twiddle factors, and finally transpose into the scratch space // But again, we don't have enough registers to load it all at once, so only load one column of AVX vectors at a time // We're also using a customized butterfly16 function that is smarter about when it loads/stores data, to reduce register spilling const TWIDDLES_PER_COLUMN: usize = 15; for (columnset, twiddle_chunk) in self.twiddles.chunks_exact(TWIDDLES_PER_COLUMN).enumerate() { // Sadly we have to use MaybeUninit here. If we init an array like normal with AvxVector::Zero(), the compiler can't seem to figure out that it can // eliminate the dead stores of zeroes to the stack. By using uninit here, we avoid those unnecessary writes let mut mid_uninit: [MaybeUninit<__m256d>; 16] = [MaybeUninit::<__m256d>::uninit(); 16]; column_butterfly16_loadfn!( |index: usize| input.load_complex(columnset * 2 + 32 * index), |data, index: usize| { mid_uninit[index].as_mut_ptr().write(data); }, self.twiddles_butterfly16, self.twiddles_butterfly4 ); // Apply twiddle factors, transpose, and store. Traditionally we apply all the twiddle factors at once and then do all the transposes at once, // But our data is pushing the limit of what we can store in registers, so the idea here is to get the data out the door with as few spills to the stack as possible for chunk in 0..4 { let twiddled = [ if chunk > 0 { AvxVector::mul_complex( mid_uninit[4 * chunk].assume_init(), twiddle_chunk[4 * chunk - 1], ) } else { mid_uninit[4 * chunk].assume_init() }, AvxVector::mul_complex( mid_uninit[4 * chunk + 1].assume_init(), twiddle_chunk[4 * chunk], ), AvxVector::mul_complex( mid_uninit[4 * chunk + 2].assume_init(), twiddle_chunk[4 * chunk + 1], ), AvxVector::mul_complex( mid_uninit[4 * chunk + 3].assume_init(), twiddle_chunk[4 * chunk + 2], ), ]; let transposed = AvxVector::transpose4_packed(twiddled); output.store_complex(transposed[0], columnset * 32 + 4 * chunk); output.store_complex(transposed[1], columnset * 32 + 4 * chunk + 2); output.store_complex(transposed[2], columnset * 32 + 4 * chunk + 16); output.store_complex(transposed[3], columnset * 32 + 4 * chunk + 18); } } } #[target_feature(enable = "avx", enable = "fma")] unsafe fn row_butterflies(&self, mut buffer: impl AvxArrayMut) { // Second phase: Butterfly 32's down the columns of our transposed array. // Thankfully, during the first phase, we set everything up so that all we have to do here is compute the size-32 FFT columns and write them back out where we got them // We're also using a customized butterfly32 function that is smarter about when it loads/stores data, to reduce register spilling for columnset in 0usize..8 { column_butterfly32_loadfn!( |index: usize| buffer.load_complex(columnset * 2 + index * 16), |data, index| buffer.store_complex(data, columnset * 2 + index * 16), self.twiddles_butterfly32, self.twiddles_butterfly4 ); } } } #[cfg(test)] mod unit_tests { use super::*; use crate::test_utils::check_fft_algorithm; macro_rules! test_avx_butterfly { ($test_name:ident, $struct_name:ident, $size:expr) => ( #[test] fn $test_name() { let butterfly = $struct_name::new(FftDirection::Forward).expect("Can't run test because this machine doesn't have the required instruction sets"); check_fft_algorithm(&butterfly as &dyn Fft, $size, FftDirection::Forward); let butterfly_inverse = $struct_name::new(FftDirection::Inverse).expect("Can't run test because this machine doesn't have the required instruction sets"); check_fft_algorithm(&butterfly_inverse as &dyn Fft, $size, FftDirection::Inverse); } ) } test_avx_butterfly!(test_avx_butterfly5_f64, Butterfly5Avx64, 5); test_avx_butterfly!(test_avx_butterfly7_f64, Butterfly7Avx64, 7); test_avx_butterfly!(test_avx_mixedradix8_f64, Butterfly8Avx64, 8); test_avx_butterfly!(test_avx_mixedradix9_f64, Butterfly9Avx64, 9); test_avx_butterfly!(test_avx_mixedradix11_f64, Butterfly11Avx64, 11); test_avx_butterfly!(test_avx_mixedradix12_f64, Butterfly12Avx64, 12); test_avx_butterfly!(test_avx_mixedradix16_f64, Butterfly16Avx64, 16); test_avx_butterfly!(test_avx_mixedradix18_f64, Butterfly18Avx64, 18); test_avx_butterfly!(test_avx_mixedradix24_f64, Butterfly24Avx64, 24); test_avx_butterfly!(test_avx_mixedradix27_f64, Butterfly27Avx64, 27); test_avx_butterfly!(test_avx_mixedradix32_f64, Butterfly32Avx64, 32); test_avx_butterfly!(test_avx_mixedradix36_f64, Butterfly36Avx64, 36); test_avx_butterfly!(test_avx_mixedradix64_f64, Butterfly64Avx64, 64); test_avx_butterfly!(test_avx_mixedradix128_f64, Butterfly128Avx64, 128); test_avx_butterfly!(test_avx_mixedradix256_f64, Butterfly256Avx64, 256); test_avx_butterfly!(test_avx_mixedradix512_f64, Butterfly512Avx64, 512); } rustfft-6.2.0/src/avx/avx64_utils.rs000064400000000000000000000262630072674642500155120ustar 00000000000000use std::arch::x86_64::*; // Treat the input like the rows of a 2x2 array, and transpose said rows to the columns #[inline(always)] pub unsafe fn transpose_2x2_f64(rows: [__m256d; 2]) -> [__m256d; 2] { let col0 = _mm256_permute2f128_pd(rows[0], rows[1], 0x20); let col1 = _mm256_permute2f128_pd(rows[0], rows[1], 0x31); [col0, col1] } // Treat the input like the rows of a 4x2 array, and transpose it to a 2x4 array #[inline(always)] pub unsafe fn transpose_4x2_to_2x4_f64(rows0: [__m256d; 2], rows1: [__m256d; 2]) -> [__m256d; 4] { let output00 = transpose_2x2_f64(rows0); let output01 = transpose_2x2_f64(rows1); [output00[0], output00[1], output01[0], output01[1]] } // Treat the input like the rows of a 3x3 array, and transpose it // The assumption here is that it's very likely that the caller wants to do some more AVX operations on the columns of the transposed array, so the output is arranged to make that more convenient #[inline(always)] pub unsafe fn transpose_3x3_f64( rows0: [__m128d; 3], rows1: [__m256d; 3], ) -> ([__m128d; 3], [__m256d; 3]) { // the first column of output will be made up of the first row of input let output0 = [ rows0[0], _mm256_castpd256_pd128(rows1[0]), _mm256_extractf128_pd(rows1[0], 1), ]; // the second column of output will be made of the second 2 rows of input let output10 = _mm256_permute2f128_pd( _mm256_castpd128_pd256(rows0[1]), _mm256_castpd128_pd256(rows0[2]), 0x20, ); let lower_chunk = [rows1[1], rows1[2]]; let lower_transposed = transpose_2x2_f64(lower_chunk); let output1 = [output10, lower_transposed[0], lower_transposed[1]]; (output0, output1) } // Treat the input like the rows of a 3x4 array, and transpose it to a 4x3 array // The assumption here is that the caller wants to do some more AVX operations on the columns of the transposed array, so the output is arranged to make that more convenient #[inline(always)] pub unsafe fn transpose_3x4_to_4x3_f64( rows0: [__m128d; 4], rows1: [__m256d; 4], ) -> ([__m256d; 3], [__m256d; 3]) { // the top row of each output array will come from the first column, and the second 2 rows will come from 2x2 transposing the rows1 array let merged0 = _mm256_permute2f128_pd( _mm256_castpd128_pd256(rows0[0]), _mm256_castpd128_pd256(rows0[1]), 0x20, ); let merged1 = _mm256_permute2f128_pd( _mm256_castpd128_pd256(rows0[2]), _mm256_castpd128_pd256(rows0[3]), 0x20, ); let chunk0 = [rows1[0], rows1[1]]; let chunk1 = [rows1[2], rows1[3]]; let lower0 = transpose_2x2_f64(chunk0); let lower1 = transpose_2x2_f64(chunk1); ( [merged0, lower0[0], lower0[1]], [merged1, lower1[0], lower1[1]], ) } // Treat the input like the rows of a 3x6 array, and transpose it to a 6x3 array // The assumption here is that caller wants to do some more AVX operations on the columns of the transposed array, so the output is arranged to make that more convenient #[inline(always)] pub unsafe fn transpose_3x6_to_6x3_f64( rows0: [__m128d; 6], rows1: [__m256d; 6], ) -> ([__m256d; 3], [__m256d; 3], [__m256d; 3]) { let chunk0 = [rows1[0], rows1[1]]; let chunk1 = [rows1[2], rows1[3]]; let chunk2 = [rows1[4], rows1[5]]; let transposed0 = transpose_2x2_f64(chunk0); let transposed1 = transpose_2x2_f64(chunk1); let transposed2 = transpose_2x2_f64(chunk2); let output0 = [ _mm256_insertf128_pd(_mm256_castpd128_pd256(rows0[0]), rows0[1], 1), transposed0[0], transposed0[1], ]; let output1 = [ _mm256_insertf128_pd(_mm256_castpd128_pd256(rows0[2]), rows0[3], 1), transposed1[0], transposed1[1], ]; let output2 = [ _mm256_insertf128_pd(_mm256_castpd128_pd256(rows0[4]), rows0[5], 1), transposed2[0], transposed2[1], ]; (output0, output1, output2) } // Treat the input like the rows of a 9x3 array, and transpose it to a 3x9 array // The assumption here is that caller wants to do some more AVX operations on the columns of the transposed array, so the output is arranged to make that more convenient #[inline(always)] pub unsafe fn transpose_9x3_to_3x9_f64( rows0: [__m128d; 3], rows1: [__m256d; 3], rows2: [__m256d; 3], rows3: [__m256d; 3], rows4: [__m256d; 3], ) -> ([__m128d; 9], [__m256d; 9]) { let chunk1 = [rows1[1], rows1[2]]; let chunk2 = [rows2[1], rows2[2]]; let chunk3 = [rows3[1], rows3[2]]; let chunk4 = [rows4[1], rows4[2]]; let transposed1 = transpose_2x2_f64(chunk1); let transposed2 = transpose_2x2_f64(chunk2); let transposed3 = transpose_2x2_f64(chunk3); let transposed4 = transpose_2x2_f64(chunk4); let output0 = [ rows0[0], _mm256_castpd256_pd128(rows1[0]), _mm256_extractf128_pd(rows1[0], 1), _mm256_castpd256_pd128(rows2[0]), _mm256_extractf128_pd(rows2[0], 1), _mm256_castpd256_pd128(rows3[0]), _mm256_extractf128_pd(rows3[0], 1), _mm256_castpd256_pd128(rows4[0]), _mm256_extractf128_pd(rows4[0], 1), ]; let output1 = [ _mm256_insertf128_pd(_mm256_castpd128_pd256(rows0[1]), rows0[2], 1), transposed1[0], transposed1[1], transposed2[0], transposed2[1], transposed3[0], transposed3[1], transposed4[0], transposed4[1], ]; (output0, output1) } // Treat the input like the rows of a 4x4 array, and transpose said rows to the columns // The assumption here is that the caller wants to do some more AVX operations on the columns of the transposed array, so the output is arranged to make that more convenient #[inline(always)] pub unsafe fn transpose_4x4_f64( rows0: [__m256d; 4], rows1: [__m256d; 4], ) -> ([__m256d; 4], [__m256d; 4]) { let chunk00 = [rows0[0], rows0[1]]; let chunk01 = [rows0[2], rows0[3]]; let chunk10 = [rows1[0], rows1[1]]; let chunk11 = [rows1[2], rows1[3]]; let output00 = transpose_2x2_f64(chunk00); let output01 = transpose_2x2_f64(chunk10); let output10 = transpose_2x2_f64(chunk01); let output11 = transpose_2x2_f64(chunk11); ( [output00[0], output00[1], output01[0], output01[1]], [output10[0], output10[1], output11[0], output11[1]], ) } // Treat the input like the rows of a 6x6 array, and transpose said rows to the columns // The assumption here is that the caller wants to do some more AVX operations on the columns of the transposed array, so the output is arranged to make that more convenient #[inline(always)] pub unsafe fn transpose_6x6_f64( rows0: [__m256d; 6], rows1: [__m256d; 6], rows2: [__m256d; 6], ) -> ([__m256d; 6], [__m256d; 6], [__m256d; 6]) { let chunk00 = [rows0[0], rows0[1]]; let chunk01 = [rows0[2], rows0[3]]; let chunk02 = [rows0[4], rows0[5]]; let chunk10 = [rows1[0], rows1[1]]; let chunk11 = [rows1[2], rows1[3]]; let chunk12 = [rows1[4], rows1[5]]; let chunk20 = [rows2[0], rows2[1]]; let chunk21 = [rows2[2], rows2[3]]; let chunk22 = [rows2[4], rows2[5]]; let output00 = transpose_2x2_f64(chunk00); let output01 = transpose_2x2_f64(chunk10); let output02 = transpose_2x2_f64(chunk20); let output10 = transpose_2x2_f64(chunk01); let output11 = transpose_2x2_f64(chunk11); let output12 = transpose_2x2_f64(chunk21); let output20 = transpose_2x2_f64(chunk02); let output21 = transpose_2x2_f64(chunk12); let output22 = transpose_2x2_f64(chunk22); ( [ output00[0], output00[1], output01[0], output01[1], output02[0], output02[1], ], [ output10[0], output10[1], output11[0], output11[1], output12[0], output12[1], ], [ output20[0], output20[1], output21[0], output21[1], output22[0], output22[1], ], ) } // Treat the input like the rows of a 6x4 array, and transpose said rows to the columns // The assumption here is that the caller wants to do some more AVX operations on the columns of the transposed array, so the output is arranged to make that more convenient #[inline(always)] pub unsafe fn transpose_6x4_to_4x6_f64( rows0: [__m256d; 4], rows1: [__m256d; 4], rows2: [__m256d; 4], ) -> ([__m256d; 6], [__m256d; 6]) { let chunk00 = [rows0[0], rows0[1]]; let chunk01 = [rows0[2], rows0[3]]; let chunk10 = [rows1[0], rows1[1]]; let chunk11 = [rows1[2], rows1[3]]; let chunk20 = [rows2[0], rows2[1]]; let chunk21 = [rows2[2], rows2[3]]; let output00 = transpose_2x2_f64(chunk00); let output01 = transpose_2x2_f64(chunk10); let output02 = transpose_2x2_f64(chunk20); let output10 = transpose_2x2_f64(chunk01); let output11 = transpose_2x2_f64(chunk11); let output12 = transpose_2x2_f64(chunk21); ( [ output00[0], output00[1], output01[0], output01[1], output02[0], output02[1], ], [ output10[0], output10[1], output11[0], output11[1], output12[0], output12[1], ], ) } // Treat the input like the rows of a 8x4 array, and transpose it to a 4x8 array // The assumption here is that it's very likely that the caller wants to do some more AVX operations on the columns of the transposed array, so the output is arranged to make that more convenient #[inline(always)] pub unsafe fn transpose_8x4_to_4x8_f64( rows0: [__m256d; 4], rows1: [__m256d; 4], rows2: [__m256d; 4], rows3: [__m256d; 4], ) -> ([__m256d; 8], [__m256d; 8]) { let chunk00 = [rows0[0], rows0[1]]; let chunk01 = [rows0[2], rows0[3]]; let chunk10 = [rows1[0], rows1[1]]; let chunk11 = [rows1[2], rows1[3]]; let chunk20 = [rows2[0], rows2[1]]; let chunk21 = [rows2[2], rows2[3]]; let chunk30 = [rows3[0], rows3[1]]; let chunk31 = [rows3[2], rows3[3]]; let output00 = transpose_2x2_f64(chunk00); let output01 = transpose_2x2_f64(chunk10); let output02 = transpose_2x2_f64(chunk20); let output03 = transpose_2x2_f64(chunk30); let output10 = transpose_2x2_f64(chunk01); let output11 = transpose_2x2_f64(chunk11); let output12 = transpose_2x2_f64(chunk21); let output13 = transpose_2x2_f64(chunk31); ( [ output00[0], output00[1], output01[0], output01[1], output02[0], output02[1], output03[0], output03[1], ], [ output10[0], output10[1], output11[0], output11[1], output12[0], output12[1], output13[0], output13[1], ], ) } rustfft-6.2.0/src/avx/avx_bluesteins.rs000064400000000000000000000517340072674642500163560ustar 00000000000000use std::any::TypeId; use std::sync::Arc; use num_complex::Complex; use num_integer::div_ceil; use num_traits::Zero; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::{array_utils, twiddles, FftDirection}; use crate::{Direction, Fft, FftNum, Length}; use super::CommonSimdData; use super::{ avx_vector::{AvxArray, AvxArrayMut, AvxVector, AvxVector128, AvxVector256}, AvxNum, }; /// Implementation of Bluestein's Algorithm /// /// This algorithm computes an arbitrary-sized FFT in O(nlogn) time. It does this by converting this size n FFT into a /// size M where M >= 2N - 1. The most obvious choice for M is a power of two, although that isn't a requirement. /// /// It requires a large scratch space, so it's probably inconvenient to use as an inner FFT to other algorithms. /// /// Bluestein's Algorithm is relatively expensive compared to other FFT algorithms. Benchmarking shows that it is up to /// an order of magnitude slower than similar composite sizes. pub struct BluesteinsAvx { inner_fft_multiplier: Box<[A::VectorType]>, common_data: CommonSimdData, _phantom: std::marker::PhantomData, } boilerplate_avx_fft_commondata!(BluesteinsAvx); impl BluesteinsAvx { /// Pairwise multiply the complex numbers in `left` with the complex numbers in `right`. /// This is exactly the same as `mul_complex` in `AvxVector`, but this implementation also conjugates the `left` input before multiplying #[inline(always)] unsafe fn mul_complex_conjugated(left: V, right: V) -> V { // Extract the real and imaginary components from left into 2 separate registers let (left_real, left_imag) = V::duplicate_complex_components(left); // create a shuffled version of right where the imaginary values are swapped with the reals let right_shuffled = V::swap_complex_components(right); // multiply our duplicated imaginary left vector by our shuffled right vector. that will give us the right side of the traditional complex multiplication formula let output_right = V::mul(left_imag, right_shuffled); // use a FMA instruction to multiply together left side of the complex multiplication formula, then alternatingly add and subtract the left side from the right // By using subadd instead of addsub, we can conjugate the left side for free. V::fmsubadd(left_real, right, output_right) } /// Preallocates necessary arrays and precomputes necessary data to efficiently compute the FFT /// Returns Ok() if this machine has the required instruction sets, Err() if some instruction sets are missing #[inline] pub fn new(len: usize, inner_fft: Arc>) -> Result { // Internal sanity check: Make sure that A == T. // This struct has two generic parameters A and T, but they must always be the same, and are only kept separate to help work around the lack of specialization. // It would be cool if we could do this as a static_assert instead let id_a = TypeId::of::(); let id_t = TypeId::of::(); assert_eq!(id_a, id_t); let has_avx = is_x86_feature_detected!("avx"); let has_fma = is_x86_feature_detected!("fma"); if has_avx && has_fma { // Safety: new_with_avx requires the "avx" feature set. Since we know it's present, we're safe Ok(unsafe { Self::new_with_avx(len, inner_fft) }) } else { Err(()) } } #[target_feature(enable = "avx")] unsafe fn new_with_avx(len: usize, inner_fft: Arc>) -> Self { let inner_fft_len = inner_fft.len(); assert!(len * 2 - 1 <= inner_fft_len, "Bluestein's algorithm requires inner_fft.len() >= self.len() * 2 - 1. Expected >= {}, got {}", len * 2 - 1, inner_fft_len); assert_eq!(inner_fft_len % A::VectorType::COMPLEX_PER_VECTOR, 0, "BluesteinsAvx requires its inner_fft.len() to be a multiple of {} (IE the number of complex numbers in a single vector) inner_fft.len() = {}", A::VectorType::COMPLEX_PER_VECTOR, inner_fft_len); // when computing FFTs, we're going to run our inner multiply pairwise by some precomputed data, then run an inverse inner FFT. We need to precompute that inner data here let inner_fft_scale = A::one() / A::from_usize(inner_fft_len).unwrap(); let direction = inner_fft.fft_direction(); // Compute twiddle factors that we'll run our inner FFT on let mut inner_fft_input = vec![Complex::zero(); inner_fft_len]; twiddles::fill_bluesteins_twiddles( &mut inner_fft_input[..len], direction.opposite_direction(), ); // Scale the computed twiddles and copy them to the end of the array inner_fft_input[0] = inner_fft_input[0] * inner_fft_scale; for i in 1..len { let twiddle = inner_fft_input[i] * inner_fft_scale; inner_fft_input[i] = twiddle; inner_fft_input[inner_fft_len - i] = twiddle; } //Compute the inner fft let mut inner_fft_scratch = vec![Complex::zero(); inner_fft.get_inplace_scratch_len()]; { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_input: &mut [Complex] = array_utils::workaround_transmute_mut(&mut inner_fft_input); inner_fft.process_with_scratch(transmuted_input, &mut inner_fft_scratch); } // When computing the FFT, we'll want this array to be pre-conjugated, so conjugate it now let conjugation_mask = AvxVector256::broadcast_complex_elements(Complex::new(A::zero(), -A::zero())); let inner_fft_multiplier = inner_fft_input .chunks_exact(A::VectorType::COMPLEX_PER_VECTOR) .map(|chunk| { let chunk_vector = chunk.load_complex(0); AvxVector::xor(chunk_vector, conjugation_mask) // compute our conjugation by xoring our data with a precomputed mask }) .collect::>() .into_boxed_slice(); // also compute some more mundane twiddle factors to start and end with. let chunk_count = div_ceil(len, A::VectorType::COMPLEX_PER_VECTOR); let twiddle_count = chunk_count * A::VectorType::COMPLEX_PER_VECTOR; let mut twiddles_scalar: Vec> = vec![Complex::zero(); twiddle_count]; twiddles::fill_bluesteins_twiddles(&mut twiddles_scalar[..len], direction); // We have the twiddles in scalar format, last step is to copy them over to AVX vectors let twiddles: Vec<_> = twiddles_scalar .chunks_exact(A::VectorType::COMPLEX_PER_VECTOR) .map(|chunk| chunk.load_complex(0)) .collect(); let required_scratch = inner_fft_input.len() + inner_fft_scratch.len(); Self { inner_fft_multiplier, common_data: CommonSimdData { inner_fft, twiddles: twiddles.into_boxed_slice(), len, inplace_scratch_len: required_scratch, outofplace_scratch_len: required_scratch, direction, }, _phantom: std::marker::PhantomData, } } // Do the necessary setup for bluestein's algorithm: copy the data to the inner buffers, apply some twiddle factors, zero out the rest of the inner buffer #[target_feature(enable = "avx", enable = "fma")] unsafe fn prepare_bluesteins( &self, input: &[Complex], mut inner_fft_buffer: &mut [Complex], ) { let chunk_count = self.common_data.twiddles.len() - 1; let remainder = self.len() - chunk_count * A::VectorType::COMPLEX_PER_VECTOR; // Copy the buffer into our inner FFT input, applying twiddle factors as we go. the buffer will only fill part of the FFT input, so zero fill the rest for (i, twiddle) in self.common_data.twiddles[..chunk_count].iter().enumerate() { let index = i * A::VectorType::COMPLEX_PER_VECTOR; let input_vector = input.load_complex(index); let product_vector = AvxVector::mul_complex(input_vector, *twiddle); inner_fft_buffer.store_complex(product_vector, index); } // the buffer will almost certainly have a remainder. it's so likely, in fact, that we're just going to apply a remainder unconditionally // it uses a couple more instructions in the rare case when our FFT size is a multiple of 4, but saves instructions when it's not { let remainder_twiddle = self.common_data.twiddles[chunk_count]; let remainder_index = chunk_count * A::VectorType::COMPLEX_PER_VECTOR; let remainder_data = match remainder { 1 => input.load_partial1_complex(remainder_index).zero_extend(), 2 => { if A::VectorType::COMPLEX_PER_VECTOR == 2 { input.load_complex(remainder_index) } else { input.load_partial2_complex(remainder_index).zero_extend() } } 3 => input.load_partial3_complex(remainder_index), 4 => input.load_complex(remainder_index), _ => unreachable!(), }; let twiddled_remainder = AvxVector::mul_complex(remainder_twiddle, remainder_data); inner_fft_buffer.store_complex(twiddled_remainder, remainder_index); } // zero fill the rest of the `inner` array let zerofill_start = chunk_count + 1; for i in zerofill_start..(inner_fft_buffer.len() / A::VectorType::COMPLEX_PER_VECTOR) { let index = i * A::VectorType::COMPLEX_PER_VECTOR; inner_fft_buffer.store_complex(AvxVector::zero(), index); } } // Do the necessary finalization for bluestein's algorithm: Conjugate the inner FFT buffer, apply some twiddle factors, zero out the rest of the inner buffer #[target_feature(enable = "avx", enable = "fma")] unsafe fn finalize_bluesteins( &self, inner_fft_buffer: &[Complex], mut output: &mut [Complex], ) { let chunk_count = self.common_data.twiddles.len() - 1; let remainder = self.len() - chunk_count * A::VectorType::COMPLEX_PER_VECTOR; // copy our data to the output, applying twiddle factors again as we go. Also conjugate inner_fft_buffer to complete the inverse FFT for (i, twiddle) in self.common_data.twiddles[..chunk_count].iter().enumerate() { let index = i * A::VectorType::COMPLEX_PER_VECTOR; let inner_vector = inner_fft_buffer.load_complex(index); let product_vector = Self::mul_complex_conjugated(inner_vector, *twiddle); output.store_complex(product_vector, index); } // again, unconditionally apply a remainder { let remainder_twiddle = self.common_data.twiddles[chunk_count]; let remainder_index = chunk_count * A::VectorType::COMPLEX_PER_VECTOR; let inner_vector = inner_fft_buffer.load_complex(remainder_index); let product_vector = Self::mul_complex_conjugated(inner_vector, remainder_twiddle); match remainder { 1 => output.store_partial1_complex(product_vector.lo(), remainder_index), 2 => { if A::VectorType::COMPLEX_PER_VECTOR == 2 { output.store_complex(product_vector, remainder_index) } else { output.store_partial2_complex(product_vector.lo(), remainder_index) } } 3 => output.store_partial3_complex(product_vector, remainder_index), 4 => output.store_complex(product_vector, remainder_index), _ => unreachable!(), }; } } // compute buffer[i] = buffer[i].conj() * multiplier[i] pairwise complex multiplication for each element. #[target_feature(enable = "avx", enable = "fma")] unsafe fn pairwise_complex_multiply_conjugated( mut buffer: impl AvxArrayMut, multiplier: &[A::VectorType], ) { for (i, right) in multiplier.iter().enumerate() { let left = buffer.load_complex(i * A::VectorType::COMPLEX_PER_VECTOR); // Do a complex multiplication between `left` and `right` let product = Self::mul_complex_conjugated(left, *right); // Store the result buffer.store_complex(product, i * A::VectorType::COMPLEX_PER_VECTOR); } } fn perform_fft_inplace(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { let (inner_input, inner_scratch) = scratch .split_at_mut(self.inner_fft_multiplier.len() * A::VectorType::COMPLEX_PER_VECTOR); // do the necessary setup for bluestein's algorithm: copy the data to the inner buffers, apply some twiddle factors, zero out the rest of the inner buffer unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_buffer: &mut [Complex] = array_utils::workaround_transmute_mut(buffer); let transmuted_inner_input: &mut [Complex] = array_utils::workaround_transmute_mut(inner_input); self.prepare_bluesteins(transmuted_buffer, transmuted_inner_input); } // run our inner forward FFT self.common_data .inner_fft .process_with_scratch(inner_input, inner_scratch); // Multiply our inner FFT output by our precomputed data. Then, conjugate the result to set up for an inverse FFT. // We can conjugate the result of multiplication by conjugating both inputs. We pre-conjugated the multiplier array, // so we just need to conjugate inner_input, which the pairwise_complex_multiply_conjugated function will handle unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_inner_input: &mut [Complex] = array_utils::workaround_transmute_mut(inner_input); Self::pairwise_complex_multiply_conjugated( transmuted_inner_input, &self.inner_fft_multiplier, ) }; // inverse FFT. we're computing a forward but we're massaging it into an inverse by conjugating the inputs and outputs self.common_data .inner_fft .process_with_scratch(inner_input, inner_scratch); // finalize the result unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_buffer: &mut [Complex] = array_utils::workaround_transmute_mut(buffer); let transmuted_inner_input: &mut [Complex] = array_utils::workaround_transmute_mut(inner_input); self.finalize_bluesteins(transmuted_inner_input, transmuted_buffer); } } fn perform_fft_out_of_place( &self, input: &[Complex], output: &mut [Complex], scratch: &mut [Complex], ) { let (inner_input, inner_scratch) = scratch .split_at_mut(self.inner_fft_multiplier.len() * A::VectorType::COMPLEX_PER_VECTOR); // do the necessary setup for bluestein's algorithm: copy the data to the inner buffers, apply some twiddle factors, zero out the rest of the inner buffer unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_input: &[Complex] = array_utils::workaround_transmute(input); let transmuted_inner_input: &mut [Complex] = array_utils::workaround_transmute_mut(inner_input); self.prepare_bluesteins(transmuted_input, transmuted_inner_input) } // run our inner forward FFT self.common_data .inner_fft .process_with_scratch(inner_input, inner_scratch); // Multiply our inner FFT output by our precomputed data. Then, conjugate the result to set up for an inverse FFT. // We can conjugate the result of multiplication by conjugating both inputs. We pre-conjugated the multiplier array, // so we just need to conjugate inner_input, which the pairwise_complex_multiply_conjugated function will handle unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_inner_input: &mut [Complex] = array_utils::workaround_transmute_mut(inner_input); Self::pairwise_complex_multiply_conjugated( transmuted_inner_input, &self.inner_fft_multiplier, ) }; // inverse FFT. we're computing a forward but we're massaging it into an inverse by conjugating the inputs and outputs self.common_data .inner_fft .process_with_scratch(inner_input, inner_scratch); // finalize the result unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_output: &mut [Complex] = array_utils::workaround_transmute_mut(output); let transmuted_inner_input: &mut [Complex] = array_utils::workaround_transmute_mut(inner_input); self.finalize_bluesteins(transmuted_inner_input, transmuted_output) } } } #[cfg(test)] mod unit_tests { use num_traits::Float; use rand::distributions::uniform::SampleUniform; use super::*; use crate::algorithm::Dft; use crate::test_utils::check_fft_algorithm; use std::sync::Arc; #[test] fn test_bluesteins_avx_f32() { for len in 2..16 { // for this len, compute the range of inner FFT lengths we'll use. // Bluesteins AVX f32 requires a multiple of 4 for the inner FFT, so we need to go up to the next multiple of 4 from the minimum let minimum_inner: usize = len * 2 - 1; let remainder = minimum_inner % 4; // remainder will never be 0, because "n * 2 - 1" is guaranteed to be odd. so we can just subtract the remainder and add 4. let next_multiple_of_4 = minimum_inner - remainder + 4; let maximum_inner = minimum_inner.checked_next_power_of_two().unwrap() + 1; // start at the next multiple of 4, and increment by 4 unti lwe get to the next power of 2. for inner_len in (next_multiple_of_4..maximum_inner).step_by(4) { test_bluesteins_avx_with_length::(len, inner_len, FftDirection::Forward); test_bluesteins_avx_with_length::(len, inner_len, FftDirection::Inverse); } } } #[test] fn test_bluesteins_avx_f64() { for len in 2..16 { // for this len, compute the range of inner FFT lengths we'll use. // Bluesteins AVX f64 requires a multiple of 2 for the inner FFT, so we need to go up to the next multiple of 2 from the minimum let minimum_inner: usize = len * 2 - 1; let remainder = minimum_inner % 2; let next_multiple_of_2 = minimum_inner + remainder; let maximum_inner = minimum_inner.checked_next_power_of_two().unwrap() + 1; // start at the next multiple of 2, and increment by 2 unti lwe get to the next power of 2. for inner_len in (next_multiple_of_2..maximum_inner).step_by(2) { test_bluesteins_avx_with_length::(len, inner_len, FftDirection::Forward); test_bluesteins_avx_with_length::(len, inner_len, FftDirection::Inverse); } } } fn test_bluesteins_avx_with_length( len: usize, inner_len: usize, direction: FftDirection, ) { let inner_fft = Arc::new(Dft::new(inner_len, direction)); let fft: BluesteinsAvx = BluesteinsAvx::new(len, inner_fft).expect( "Can't run test because this machine doesn't have the required instruction sets", ); check_fft_algorithm::(&fft, len, direction); } } rustfft-6.2.0/src/avx/avx_mixed_radix.rs000064400000000000000000001222320072674642500164660ustar 00000000000000use std::any::TypeId; use std::sync::Arc; use num_complex::Complex; use num_integer::div_ceil; use crate::array_utils; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::{Direction, Fft, FftDirection, FftNum, Length}; use super::{AvxNum, CommonSimdData}; use super::avx_vector; use super::avx_vector::{AvxArray, AvxArrayMut, AvxVector, AvxVector128, AvxVector256, Rotation90}; macro_rules! boilerplate_mixedradix { () => { /// Preallocates necessary arrays and precomputes necessary data to efficiently compute the FFT /// Returns Ok() if this machine has the required instruction sets, Err() if some instruction sets are missing #[inline] pub fn new(inner_fft: Arc>) -> Result { // Internal sanity check: Make sure that A == T. // This struct has two generic parameters A and T, but they must always be the same, and are only kept separate to help work around the lack of specialization. // It would be cool if we could do this as a static_assert instead let id_a = TypeId::of::(); let id_t = TypeId::of::(); assert_eq!(id_a, id_t); let has_avx = is_x86_feature_detected!("avx"); let has_fma = is_x86_feature_detected!("fma"); if has_avx && has_fma { // Safety: new_with_avx requires the "avx" feature set. Since we know it's present, we're safe Ok(unsafe { Self::new_with_avx(inner_fft) }) } else { Err(()) } } #[inline] fn perform_fft_inplace(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { // Perform the column FFTs // Safety: self.perform_column_butterflies() requres the "avx" and "fma" instruction sets, and we return Err() in our constructor if the instructions aren't available unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_buffer: &mut [Complex] = array_utils::workaround_transmute_mut(buffer); self.perform_column_butterflies(transmuted_buffer) } // process the row FFTs let (scratch, inner_scratch) = scratch.split_at_mut(self.len()); self.common_data.inner_fft.process_outofplace_with_scratch( buffer, scratch, inner_scratch, ); // Transpose // Safety: self.transpose() requres the "avx" instruction set, and we return Err() in our constructor if the instructions aren't available unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_scratch: &mut [Complex] = array_utils::workaround_transmute_mut(scratch); let transmuted_buffer: &mut [Complex] = array_utils::workaround_transmute_mut(buffer); self.transpose(transmuted_scratch, transmuted_buffer) } } #[inline] fn perform_fft_out_of_place( &self, input: &mut [Complex], output: &mut [Complex], scratch: &mut [Complex], ) { // Perform the column FFTs // Safety: self.perform_column_butterflies() requires the "avx" and "fma" instruction sets, and we return Err() in our constructor if the instructions aren't available unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_input: &mut [Complex] = array_utils::workaround_transmute_mut(input); self.perform_column_butterflies(transmuted_input); } // process the row FFTs. If extra scratch was provided, pass it in. Otherwise, use the output. let inner_scratch = if scratch.len() > 0 { scratch } else { &mut output[..] }; self.common_data .inner_fft .process_with_scratch(input, inner_scratch); // Transpose // Safety: self.transpose() requires the "avx" instruction set, and we return Err() in our constructor if the instructions aren't available unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_input: &mut [Complex] = array_utils::workaround_transmute_mut(input); let transmuted_output: &mut [Complex] = array_utils::workaround_transmute_mut(output); self.transpose(transmuted_input, transmuted_output) } } }; } macro_rules! mixedradix_gen_data { ($row_count: expr, $inner_fft:expr) => {{ // Important constants const ROW_COUNT : usize = $row_count; const TWIDDLES_PER_COLUMN : usize = ROW_COUNT - 1; // derive some info from our inner FFT let direction = $inner_fft.fft_direction(); let len_per_row = $inner_fft.len(); let len = len_per_row * ROW_COUNT; // We're going to process each row of the FFT one AVX register at a time. We need to know how many AVX registers each row can fit, // and if the last register in each row going to have partial data (ie a remainder) let quotient = len_per_row / A::VectorType::COMPLEX_PER_VECTOR; let remainder = len_per_row % A::VectorType::COMPLEX_PER_VECTOR; // Compute our twiddle factors, and arrange them so that we can access them one column of AVX vectors at a time let num_twiddle_columns = quotient + div_ceil(remainder, A::VectorType::COMPLEX_PER_VECTOR); let mut twiddles = Vec::with_capacity(num_twiddle_columns * TWIDDLES_PER_COLUMN); for x in 0..num_twiddle_columns { for y in 1..ROW_COUNT { twiddles.push(AvxVector::make_mixedradix_twiddle_chunk(x * A::VectorType::COMPLEX_PER_VECTOR, y, len, direction)); } } let inner_outofplace_scratch = $inner_fft.get_outofplace_scratch_len(); let inner_inplace_scratch = $inner_fft.get_inplace_scratch_len(); CommonSimdData { twiddles: twiddles.into_boxed_slice(), inplace_scratch_len: len + inner_outofplace_scratch, outofplace_scratch_len: if inner_inplace_scratch > len { inner_inplace_scratch } else { 0 }, inner_fft: $inner_fft, len, direction, } }} } macro_rules! mixedradix_column_butterflies { ($row_count: expr, $butterfly_fn: expr, $butterfly_fn_lo: expr) => { #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_column_butterflies(&self, mut buffer: impl AvxArrayMut) { // How many rows this FFT has, ie 2 for 2xn, 4 for 4xn, etc const ROW_COUNT: usize = $row_count; const TWIDDLES_PER_COLUMN: usize = ROW_COUNT - 1; let len_per_row = self.len() / ROW_COUNT; let chunk_count = len_per_row / A::VectorType::COMPLEX_PER_VECTOR; // process the column FFTs for (c, twiddle_chunk) in self .common_data .twiddles .chunks_exact(TWIDDLES_PER_COLUMN) .take(chunk_count) .enumerate() { let index_base = c * A::VectorType::COMPLEX_PER_VECTOR; // Load columns from the buffer into registers let mut columns = [AvxVector::zero(); ROW_COUNT]; for i in 0..ROW_COUNT { columns[i] = buffer.load_complex(index_base + len_per_row * i); } // apply our butterfly function down the columns let output = $butterfly_fn(columns, self); // always write the first row directly back without twiddles buffer.store_complex(output[0], index_base); // for every other row, apply twiddle factors and then write back to memory for i in 1..ROW_COUNT { let twiddle = twiddle_chunk[i - 1]; let output = AvxVector::mul_complex(twiddle, output[i]); buffer.store_complex(output, index_base + len_per_row * i); } } // finally, we might have a remainder chunk // Normally, we can fit COMPLEX_PER_VECTOR complex numbers into an AVX register, but we only have `partial_remainder` columns left, so we need special logic to handle these final columns let partial_remainder = len_per_row % A::VectorType::COMPLEX_PER_VECTOR; if partial_remainder > 0 { let partial_remainder_base = chunk_count * A::VectorType::COMPLEX_PER_VECTOR; let partial_remainder_twiddle_base = self.common_data.twiddles.len() - TWIDDLES_PER_COLUMN; let final_twiddle_chunk = &self.common_data.twiddles[partial_remainder_twiddle_base..]; if partial_remainder > 2 { // Load 3 columns into full AVX vectors to process our remainder let mut columns = [AvxVector::zero(); ROW_COUNT]; for i in 0..ROW_COUNT { columns[i] = buffer.load_partial3_complex(partial_remainder_base + len_per_row * i); } // apply our butterfly function down the columns let mid = $butterfly_fn(columns, self); // always write the first row without twiddles buffer.store_partial3_complex(mid[0], partial_remainder_base); // for the remaining rows, apply twiddle factors and then write back to memory for i in 1..ROW_COUNT { let twiddle = final_twiddle_chunk[i - 1]; let output = AvxVector::mul_complex(twiddle, mid[i]); buffer.store_partial3_complex( output, partial_remainder_base + len_per_row * i, ); } } else { // Load 1 or 2 columns into half vectors to process our remainder. Thankfully, the compiler is smart enough to eliminate this branch on f64, since the partial remainder can only possibly be 1 let mut columns = [AvxVector::zero(); ROW_COUNT]; if partial_remainder == 1 { for i in 0..ROW_COUNT { columns[i] = buffer .load_partial1_complex(partial_remainder_base + len_per_row * i); } } else { for i in 0..ROW_COUNT { columns[i] = buffer .load_partial2_complex(partial_remainder_base + len_per_row * i); } } // apply our butterfly function down the columns let mut mid = $butterfly_fn_lo(columns, self); // apply twiddle factors for i in 1..ROW_COUNT { mid[i] = AvxVector::mul_complex(final_twiddle_chunk[i - 1].lo(), mid[i]); } // store output if partial_remainder == 1 { for i in 0..ROW_COUNT { buffer.store_partial1_complex( mid[i], partial_remainder_base + len_per_row * i, ); } } else { for i in 0..ROW_COUNT { buffer.store_partial2_complex( mid[i], partial_remainder_base + len_per_row * i, ); } } } } } }; } macro_rules! mixedradix_transpose{ ($row_count: expr, $transpose_fn: path, $transpose_fn_lo: path, $($unroll_workaround_index:expr);*, $($remainder3_unroll_workaround_index:expr);*) => ( // Transpose the input (treated as a nxc array) into the output (as a cxn array) #[target_feature(enable = "avx")] unsafe fn transpose(&self, input: &[Complex], mut output: &mut [Complex]) { const ROW_COUNT : usize = $row_count; let len_per_row = self.len() / ROW_COUNT; let chunk_count = len_per_row / A::VectorType::COMPLEX_PER_VECTOR; // transpose the scratch as a nx2 array into the buffer as an 2xn array for c in 0..chunk_count { let input_index_base = c*A::VectorType::COMPLEX_PER_VECTOR; let output_index_base = input_index_base * ROW_COUNT; // Load rows from the input into registers let mut rows : [A::VectorType; ROW_COUNT] = [AvxVector::zero(); ROW_COUNT]; for i in 0..ROW_COUNT { rows[i] = input.load_complex(input_index_base + len_per_row*i); } // transpose the rows to the columns let transposed = $transpose_fn(rows); // store the transposed rows contiguously // IE, unlike the way we loaded the data, which was to load it strided across each of our rows // we will not output it strided, but instead writing it out as a contiguous block // we are using a macro hack to manually unroll the loop, to work around this rustc bug: // https://github.com/rust-lang/rust/issues/71025 // if we don't manually unroll the loop, the compiler will insert unnecessary writes+reads to the stack which tank performance // once the compiler bug is fixed, this can be replaced by a "for i in 0..ROW_COUNT" loop $( output.store_complex(transposed[$unroll_workaround_index], output_index_base + A::VectorType::COMPLEX_PER_VECTOR * $unroll_workaround_index); )* } // transpose the remainder let input_index_base = chunk_count * A::VectorType::COMPLEX_PER_VECTOR; let output_index_base = input_index_base * ROW_COUNT; let partial_remainder = len_per_row % A::VectorType::COMPLEX_PER_VECTOR; if partial_remainder == 1 { // If the partial remainder is 1, there's no transposing to do - just gather from across the rows and store contiguously for i in 0..ROW_COUNT { let input_cell = input.get_unchecked(input_index_base + len_per_row*i); let output_cell = output.get_unchecked_mut(output_index_base + i); *output_cell = *input_cell; } } else if partial_remainder == 2 { // If the partial remainder is 2, use the provided transpose_lo function to do a transpose on half-vectors let mut rows = [AvxVector::zero(); ROW_COUNT]; for i in 0..ROW_COUNT { rows[i] = input.load_partial2_complex(input_index_base + len_per_row*i); } let transposed = $transpose_fn_lo(rows); // use the same macro hack as above to unroll the loop $( output.store_partial2_complex(transposed[$unroll_workaround_index], output_index_base + ::HalfVector::COMPLEX_PER_VECTOR * $unroll_workaround_index); )* } else if partial_remainder == 3 { // If the partial remainder is 3, we have to load full vectors, use the full transpose, and then write out a variable number of outputs let mut rows = [AvxVector::zero(); ROW_COUNT]; for i in 0..ROW_COUNT { rows[i] = input.load_partial3_complex(input_index_base + len_per_row*i); } // transpose the rows to the columns let transposed = $transpose_fn(rows); // We're going to write constant number of full vectors, and then some constant-sized partial vector // Sadly, because of rust limitations, we can't make full_vector_count a const, so we have to cross our fingers that the compiler optimizes it to a constant let element_count = 3*ROW_COUNT; let full_vector_count = element_count / A::VectorType::COMPLEX_PER_VECTOR; let final_remainder_count = element_count % A::VectorType::COMPLEX_PER_VECTOR; // write out our full vectors // we are using a macro hack to manually unroll the loop, to work around this rustc bug: // https://github.com/rust-lang/rust/issues/71025 // if we don't manually unroll the loop, the compiler will insert unnecessary writes+reads to the stack which tank performance // once the compiler bug is fixed, this can be replaced by a "for i in 0..full_vector_count" loop $( output.store_complex(transposed[$remainder3_unroll_workaround_index], output_index_base + A::VectorType::COMPLEX_PER_VECTOR * $remainder3_unroll_workaround_index); )* // write out our partial vector. again, this is a compile-time constant, even if we can't represent that within rust yet match final_remainder_count { 0 => {}, 1 => output.store_partial1_complex(transposed[full_vector_count].lo(), output_index_base + full_vector_count * A::VectorType::COMPLEX_PER_VECTOR), 2 => output.store_partial2_complex(transposed[full_vector_count].lo(), output_index_base + full_vector_count * A::VectorType::COMPLEX_PER_VECTOR), 3 => output.store_partial3_complex(transposed[full_vector_count], output_index_base + full_vector_count * A::VectorType::COMPLEX_PER_VECTOR), _ => unreachable!(), } } } )} pub struct MixedRadix2xnAvx { common_data: CommonSimdData, _phantom: std::marker::PhantomData, } boilerplate_avx_fft_commondata!(MixedRadix2xnAvx); impl MixedRadix2xnAvx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(inner_fft: Arc>) -> Self { Self { common_data: mixedradix_gen_data!(2, inner_fft), _phantom: std::marker::PhantomData, } } mixedradix_column_butterflies!( 2, |columns, _: _| AvxVector::column_butterfly2(columns), |columns, _: _| AvxVector::column_butterfly2(columns) ); mixedradix_transpose!(2, AvxVector::transpose2_packed, AvxVector::transpose2_packed, 0;1, 0 ); boilerplate_mixedradix!(); } pub struct MixedRadix3xnAvx { twiddles_butterfly3: A::VectorType, common_data: CommonSimdData, _phantom: std::marker::PhantomData, } boilerplate_avx_fft_commondata!(MixedRadix3xnAvx); impl MixedRadix3xnAvx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(inner_fft: Arc>) -> Self { Self { twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, inner_fft.fft_direction()), common_data: mixedradix_gen_data!(3, inner_fft), _phantom: std::marker::PhantomData, } } mixedradix_column_butterflies!( 3, |columns, this: &Self| AvxVector::column_butterfly3(columns, this.twiddles_butterfly3), |columns, this: &Self| AvxVector::column_butterfly3(columns, this.twiddles_butterfly3.lo()) ); mixedradix_transpose!(3, AvxVector::transpose3_packed, AvxVector::transpose3_packed, 0;1;2, 0;1 ); boilerplate_mixedradix!(); } pub struct MixedRadix4xnAvx { twiddles_butterfly4: Rotation90, common_data: CommonSimdData, _phantom: std::marker::PhantomData, } boilerplate_avx_fft_commondata!(MixedRadix4xnAvx); impl MixedRadix4xnAvx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(inner_fft: Arc>) -> Self { Self { twiddles_butterfly4: AvxVector::make_rotation90(inner_fft.fft_direction()), common_data: mixedradix_gen_data!(4, inner_fft), _phantom: std::marker::PhantomData, } } mixedradix_column_butterflies!( 4, |columns, this: &Self| AvxVector::column_butterfly4(columns, this.twiddles_butterfly4), |columns, this: &Self| AvxVector::column_butterfly4(columns, this.twiddles_butterfly4.lo()) ); mixedradix_transpose!(4, AvxVector::transpose4_packed, AvxVector::transpose4_packed, 0;1;2;3, 0;1;2 ); boilerplate_mixedradix!(); } pub struct MixedRadix5xnAvx { twiddles_butterfly5: [A::VectorType; 2], common_data: CommonSimdData, _phantom: std::marker::PhantomData, } boilerplate_avx_fft_commondata!(MixedRadix5xnAvx); impl MixedRadix5xnAvx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(inner_fft: Arc>) -> Self { Self { twiddles_butterfly5: [ AvxVector::broadcast_twiddle(1, 5, inner_fft.fft_direction()), AvxVector::broadcast_twiddle(2, 5, inner_fft.fft_direction()), ], common_data: mixedradix_gen_data!(5, inner_fft), _phantom: std::marker::PhantomData, } } mixedradix_column_butterflies!( 5, |columns, this: &Self| AvxVector::column_butterfly5(columns, this.twiddles_butterfly5), |columns, this: &Self| AvxVector::column_butterfly5( columns, [ this.twiddles_butterfly5[0].lo(), this.twiddles_butterfly5[1].lo() ] ) ); mixedradix_transpose!(5, AvxVector::transpose5_packed, AvxVector::transpose5_packed, 0;1;2;3;4, 0;1;2 ); boilerplate_mixedradix!(); } pub struct MixedRadix6xnAvx { twiddles_butterfly3: A::VectorType, common_data: CommonSimdData, _phantom: std::marker::PhantomData, } boilerplate_avx_fft_commondata!(MixedRadix6xnAvx); impl MixedRadix6xnAvx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(inner_fft: Arc>) -> Self { Self { twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, inner_fft.fft_direction()), common_data: mixedradix_gen_data!(6, inner_fft), _phantom: std::marker::PhantomData, } } mixedradix_column_butterflies!( 6, |columns, this: &Self| AvxVector256::column_butterfly6(columns, this.twiddles_butterfly3), |columns, this: &Self| AvxVector128::column_butterfly6(columns, this.twiddles_butterfly3) ); mixedradix_transpose!(6, AvxVector::transpose6_packed, AvxVector::transpose6_packed, 0;1;2;3;4;5, 0;1;2;3 ); boilerplate_mixedradix!(); } pub struct MixedRadix7xnAvx { twiddles_butterfly7: [A::VectorType; 3], common_data: CommonSimdData, _phantom: std::marker::PhantomData, } boilerplate_avx_fft_commondata!(MixedRadix7xnAvx); impl MixedRadix7xnAvx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(inner_fft: Arc>) -> Self { Self { twiddles_butterfly7: [ AvxVector::broadcast_twiddle(1, 7, inner_fft.fft_direction()), AvxVector::broadcast_twiddle(2, 7, inner_fft.fft_direction()), AvxVector::broadcast_twiddle(3, 7, inner_fft.fft_direction()), ], common_data: mixedradix_gen_data!(7, inner_fft), _phantom: std::marker::PhantomData, } } mixedradix_column_butterflies!( 7, |columns, this: &Self| AvxVector::column_butterfly7(columns, this.twiddles_butterfly7), |columns, this: &Self| AvxVector::column_butterfly7( columns, [ this.twiddles_butterfly7[0].lo(), this.twiddles_butterfly7[1].lo(), this.twiddles_butterfly7[2].lo() ] ) ); mixedradix_transpose!(7, AvxVector::transpose7_packed, AvxVector::transpose7_packed, 0;1;2;3;4;5;6, 0;1;2;3;4 ); boilerplate_mixedradix!(); } pub struct MixedRadix8xnAvx { twiddles_butterfly4: Rotation90, common_data: CommonSimdData, _phantom: std::marker::PhantomData, } boilerplate_avx_fft_commondata!(MixedRadix8xnAvx); impl MixedRadix8xnAvx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(inner_fft: Arc>) -> Self { Self { twiddles_butterfly4: AvxVector::make_rotation90(inner_fft.fft_direction()), common_data: mixedradix_gen_data!(8, inner_fft), _phantom: std::marker::PhantomData, } } mixedradix_column_butterflies!( 8, |columns, this: &Self| AvxVector::column_butterfly8(columns, this.twiddles_butterfly4), |columns, this: &Self| AvxVector::column_butterfly8(columns, this.twiddles_butterfly4.lo()) ); mixedradix_transpose!(8, AvxVector::transpose8_packed, AvxVector::transpose8_packed, 0;1;2;3;4;5;6;7, 0;1;2;3;4;5 ); boilerplate_mixedradix!(); } pub struct MixedRadix9xnAvx { twiddles_butterfly9: [A::VectorType; 3], twiddles_butterfly9_lo: [A::VectorType; 2], twiddles_butterfly3: A::VectorType, common_data: CommonSimdData, _phantom: std::marker::PhantomData, } boilerplate_avx_fft_commondata!(MixedRadix9xnAvx); impl MixedRadix9xnAvx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(inner_fft: Arc>) -> Self { let inverse = inner_fft.fft_direction(); let twiddle1 = AvxVector::broadcast_twiddle(1, 9, inner_fft.fft_direction()); let twiddle2 = AvxVector::broadcast_twiddle(2, 9, inner_fft.fft_direction()); let twiddle4 = AvxVector::broadcast_twiddle(4, 9, inner_fft.fft_direction()); Self { twiddles_butterfly9: [ AvxVector::broadcast_twiddle(1, 9, inverse), AvxVector::broadcast_twiddle(2, 9, inverse), AvxVector::broadcast_twiddle(4, 9, inverse), ], twiddles_butterfly9_lo: [ AvxVector256::merge(twiddle1, twiddle2), AvxVector256::merge(twiddle2, twiddle4), ], twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, inner_fft.fft_direction()), common_data: mixedradix_gen_data!(9, inner_fft), _phantom: std::marker::PhantomData, } } mixedradix_column_butterflies!( 9, |columns, this: &Self| AvxVector256::column_butterfly9( columns, this.twiddles_butterfly9, this.twiddles_butterfly3 ), |columns, this: &Self| AvxVector128::column_butterfly9( columns, this.twiddles_butterfly9_lo, this.twiddles_butterfly3 ) ); mixedradix_transpose!(9, AvxVector::transpose9_packed, AvxVector::transpose9_packed, 0;1;2;3;4;5;6;7;8, 0;1;2;3;4;5 ); boilerplate_mixedradix!(); } pub struct MixedRadix11xnAvx { twiddles_butterfly11: [A::VectorType; 5], common_data: CommonSimdData, _phantom: std::marker::PhantomData, } boilerplate_avx_fft_commondata!(MixedRadix11xnAvx); impl MixedRadix11xnAvx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(inner_fft: Arc>) -> Self { Self { twiddles_butterfly11: [ AvxVector::broadcast_twiddle(1, 11, inner_fft.fft_direction()), AvxVector::broadcast_twiddle(2, 11, inner_fft.fft_direction()), AvxVector::broadcast_twiddle(3, 11, inner_fft.fft_direction()), AvxVector::broadcast_twiddle(4, 11, inner_fft.fft_direction()), AvxVector::broadcast_twiddle(5, 11, inner_fft.fft_direction()), ], common_data: mixedradix_gen_data!(11, inner_fft), _phantom: std::marker::PhantomData, } } mixedradix_column_butterflies!( 11, |columns, this: &Self| AvxVector::column_butterfly11(columns, this.twiddles_butterfly11), |columns, this: &Self| AvxVector::column_butterfly11( columns, [ this.twiddles_butterfly11[0].lo(), this.twiddles_butterfly11[1].lo(), this.twiddles_butterfly11[2].lo(), this.twiddles_butterfly11[3].lo(), this.twiddles_butterfly11[4].lo() ] ) ); mixedradix_transpose!(11, AvxVector::transpose11_packed, AvxVector::transpose11_packed, 0;1;2;3;4;5;6;7;8;9;10, 0;1;2;3;4;5;6;7 ); boilerplate_mixedradix!(); } pub struct MixedRadix12xnAvx { twiddles_butterfly4: Rotation90, twiddles_butterfly3: A::VectorType, common_data: CommonSimdData, _phantom: std::marker::PhantomData, } boilerplate_avx_fft_commondata!(MixedRadix12xnAvx); impl MixedRadix12xnAvx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(inner_fft: Arc>) -> Self { let inverse = inner_fft.fft_direction(); Self { twiddles_butterfly4: AvxVector::make_rotation90(inverse), twiddles_butterfly3: AvxVector::broadcast_twiddle(1, 3, inverse), common_data: mixedradix_gen_data!(12, inner_fft), _phantom: std::marker::PhantomData, } } mixedradix_column_butterflies!( 12, |columns, this: &Self| AvxVector256::column_butterfly12( columns, this.twiddles_butterfly3, this.twiddles_butterfly4 ), |columns, this: &Self| AvxVector128::column_butterfly12( columns, this.twiddles_butterfly3, this.twiddles_butterfly4 ) ); mixedradix_transpose!(12, AvxVector::transpose12_packed, AvxVector::transpose12_packed, 0;1;2;3;4;5;6;7;8;9;10;11, 0;1;2;3;4;5;6;7;8 ); boilerplate_mixedradix!(); } pub struct MixedRadix16xnAvx { twiddles_butterfly4: Rotation90, twiddles_butterfly16: [A::VectorType; 2], common_data: CommonSimdData, _phantom: std::marker::PhantomData, } boilerplate_avx_fft_commondata!(MixedRadix16xnAvx); impl MixedRadix16xnAvx { #[target_feature(enable = "avx")] unsafe fn new_with_avx(inner_fft: Arc>) -> Self { let inverse = inner_fft.fft_direction(); Self { twiddles_butterfly4: AvxVector::make_rotation90(inner_fft.fft_direction()), twiddles_butterfly16: [ AvxVector::broadcast_twiddle(1, 16, inverse), AvxVector::broadcast_twiddle(3, 16, inverse), ], common_data: mixedradix_gen_data!(16, inner_fft), _phantom: std::marker::PhantomData, } } #[target_feature(enable = "avx", enable = "fma")] unsafe fn perform_column_butterflies(&self, mut buffer: impl AvxArrayMut) { // How many rows this FFT has, ie 2 for 2xn, 4 for 4xn, etc const ROW_COUNT: usize = 16; const TWIDDLES_PER_COLUMN: usize = ROW_COUNT - 1; let len_per_row = self.len() / ROW_COUNT; let chunk_count = len_per_row / A::VectorType::COMPLEX_PER_VECTOR; // process the column FFTs for (c, twiddle_chunk) in self .common_data .twiddles .chunks_exact(TWIDDLES_PER_COLUMN) .take(chunk_count) .enumerate() { let index_base = c * A::VectorType::COMPLEX_PER_VECTOR; column_butterfly16_loadfn!( |index| buffer.load_complex(index_base + len_per_row * index), |mut data, index| { if index > 0 { data = AvxVector::mul_complex(data, twiddle_chunk[index - 1]); } buffer.store_complex(data, index_base + len_per_row * index) }, self.twiddles_butterfly16, self.twiddles_butterfly4 ); } // finally, we might have a single partial chunk. // Normally, we can fit 4 complex numbers into an AVX register, but we only have `partial_remainder` columns left, so we need special logic to handle these final columns let partial_remainder = len_per_row % A::VectorType::COMPLEX_PER_VECTOR; if partial_remainder > 0 { let partial_remainder_base = chunk_count * A::VectorType::COMPLEX_PER_VECTOR; let partial_remainder_twiddle_base = self.common_data.twiddles.len() - TWIDDLES_PER_COLUMN; let final_twiddle_chunk = &self.common_data.twiddles[partial_remainder_twiddle_base..]; match partial_remainder { 1 => { column_butterfly16_loadfn!( |index| buffer .load_partial1_complex(partial_remainder_base + len_per_row * index), |mut data, index| { if index > 0 { let twiddle: A::VectorType = final_twiddle_chunk[index - 1]; data = AvxVector::mul_complex(data, twiddle.lo()); } buffer.store_partial1_complex( data, partial_remainder_base + len_per_row * index, ) }, [ self.twiddles_butterfly16[0].lo(), self.twiddles_butterfly16[1].lo() ], self.twiddles_butterfly4.lo() ); } 2 => { column_butterfly16_loadfn!( |index| buffer .load_partial2_complex(partial_remainder_base + len_per_row * index), |mut data, index| { if index > 0 { let twiddle: A::VectorType = final_twiddle_chunk[index - 1]; data = AvxVector::mul_complex(data, twiddle.lo()); } buffer.store_partial2_complex( data, partial_remainder_base + len_per_row * index, ) }, [ self.twiddles_butterfly16[0].lo(), self.twiddles_butterfly16[1].lo() ], self.twiddles_butterfly4.lo() ); } 3 => { column_butterfly16_loadfn!( |index| buffer .load_partial3_complex(partial_remainder_base + len_per_row * index), |mut data, index| { if index > 0 { data = AvxVector::mul_complex(data, final_twiddle_chunk[index - 1]); } buffer.store_partial3_complex( data, partial_remainder_base + len_per_row * index, ) }, self.twiddles_butterfly16, self.twiddles_butterfly4 ); } _ => unreachable!(), } } } mixedradix_transpose!(16, AvxVector::transpose16_packed, AvxVector::transpose16_packed, 0;1;2;3;4;5;6;7;8;9;10;11;12;13;14;15, 0;1;2;3;4;5;6;7;8;9;10;11 ); boilerplate_mixedradix!(); } #[cfg(test)] mod unit_tests { use super::*; use crate::algorithm::*; use crate::test_utils::check_fft_algorithm; use std::sync::Arc; macro_rules! test_avx_mixed_radix { ($f32_test_name:ident, $f64_test_name:ident, $struct_name:ident, $inner_count:expr) => ( #[test] fn $f32_test_name() { for inner_fft_len in 1..32 { let len = inner_fft_len * $inner_count; let inner_fft_forward = Arc::new(Dft::new(inner_fft_len, FftDirection::Forward)) as Arc>; let fft_forward = $struct_name::::new(inner_fft_forward).expect("Can't run test because this machine doesn't have the required instruction sets"); check_fft_algorithm(&fft_forward, len, FftDirection::Forward); let inner_fft_inverse = Arc::new(Dft::new(inner_fft_len, FftDirection::Inverse)) as Arc>; let fft_inverse = $struct_name::::new(inner_fft_inverse).expect("Can't run test because this machine doesn't have the required instruction sets"); check_fft_algorithm(&fft_inverse, len, FftDirection::Inverse); } } #[test] fn $f64_test_name() { for inner_fft_len in 1..32 { let len = inner_fft_len * $inner_count; let inner_fft_forward = Arc::new(Dft::new(inner_fft_len, FftDirection::Forward)) as Arc>; let fft_forward = $struct_name::::new(inner_fft_forward).expect("Can't run test because this machine doesn't have the required instruction sets"); check_fft_algorithm(&fft_forward, len, FftDirection::Forward); let inner_fft_inverse = Arc::new(Dft::new(inner_fft_len, FftDirection::Inverse)) as Arc>; let fft_inverse = $struct_name::::new(inner_fft_inverse).expect("Can't run test because this machine doesn't have the required instruction sets"); check_fft_algorithm(&fft_inverse, len, FftDirection::Inverse); } } ) } test_avx_mixed_radix!( test_mixedradix_2xn_avx_f32, test_mixedradix_2xn_avx_f64, MixedRadix2xnAvx, 2 ); test_avx_mixed_radix!( test_mixedradix_3xn_avx_f32, test_mixedradix_3xn_avx_f64, MixedRadix3xnAvx, 3 ); test_avx_mixed_radix!( test_mixedradix_4xn_avx_f32, test_mixedradix_4xn_avx_f64, MixedRadix4xnAvx, 4 ); test_avx_mixed_radix!( test_mixedradix_5xn_avx_f32, test_mixedradix_5xn_avx_f64, MixedRadix5xnAvx, 5 ); test_avx_mixed_radix!( test_mixedradix_6xn_avx_f32, test_mixedradix_6xn_avx_f64, MixedRadix6xnAvx, 6 ); test_avx_mixed_radix!( test_mixedradix_7xn_avx_f32, test_mixedradix_7xn_avx_f64, MixedRadix7xnAvx, 7 ); test_avx_mixed_radix!( test_mixedradix_8xn_avx_f32, test_mixedradix_8xn_avx_f64, MixedRadix8xnAvx, 8 ); test_avx_mixed_radix!( test_mixedradix_9xn_avx_f32, test_mixedradix_9xn_avx_f64, MixedRadix9xnAvx, 9 ); test_avx_mixed_radix!( test_mixedradix_11xn_avx_f32, test_mixedradix_11xn_avx_f64, MixedRadix11xnAvx, 11 ); test_avx_mixed_radix!( test_mixedradix_12xn_avx_f32, test_mixedradix_12xn_avx_f64, MixedRadix12xnAvx, 12 ); test_avx_mixed_radix!( test_mixedradix_16xn_avx_f32, test_mixedradix_16xn_avx_f64, MixedRadix16xnAvx, 16 ); } rustfft-6.2.0/src/avx/avx_planner.rs000064400000000000000000001523340072674642500156360ustar 00000000000000use std::sync::Arc; use std::{any::TypeId, cmp::min}; use primal_check::miller_rabin; use crate::algorithm::*; use crate::common::FftNum; use crate::math_utils::PartialFactors; use crate::Fft; use crate::{algorithm::butterflies::*, fft_cache::FftCache}; use super::*; fn wrap_fft(butterfly: impl Fft + 'static) -> Arc> { Arc::new(butterfly) as Arc> } #[derive(Debug)] enum MixedRadixBase { // The base will be a butterfly algorithm ButterflyBase(usize), // The base will be an instance of Rader's Algorithm. That will require its own plan for the internal FFT, which we'll handle separately RadersBase(usize), // The base will be an instance of Bluestein's Algorithm. That will require its own plan for the internal FFT, which we'll handle separately. // First usize is the base length, second usize is the inner FFT length BluesteinsBase(usize, usize), // The "base" is a FFT instance we already have cached CacheBase(usize), } impl MixedRadixBase { fn base_len(&self) -> usize { match self { Self::ButterflyBase(len) => *len, Self::RadersBase(len) => *len, Self::BluesteinsBase(len, _) => *len, Self::CacheBase(len) => *len, } } } /// repreesnts a FFT plan, stored as a base FFT and a stack of MixedRadix*xn on top of it. #[derive(Debug)] pub struct MixedRadixPlan { len: usize, // product of base and radixes radixes: Vec, // stored from innermost to outermost base: MixedRadixBase, } impl MixedRadixPlan { fn new(base: MixedRadixBase, radixes: Vec) -> Self { Self { len: base.base_len() * radixes.iter().map(|r| *r as usize).product::(), base, radixes, } } fn cached(cached_len: usize) -> Self { Self { len: cached_len, base: MixedRadixBase::CacheBase(cached_len), radixes: Vec::new(), } } fn butterfly(butterfly_len: usize, radixes: Vec) -> Self { Self::new(MixedRadixBase::ButterflyBase(butterfly_len), radixes) } fn push_radix(&mut self, radix: u8) { self.radixes.push(radix); self.len *= radix as usize; } fn push_radix_power(&mut self, radix: u8, power: u32) { self.radixes .extend(std::iter::repeat(radix).take(power as usize)); self.len *= (radix as usize).pow(power); } } /// The AVX FFT planner creates new FFT algorithm instances which take advantage of the AVX instruction set. /// /// Creating an instance of `FftPlannerAvx` requires the `avx` and `fma` instructions to be available on the current machine, and it requires RustFFT's /// `avx` feature flag to be set. A few algorithms will use `avx2` if it's available, but it isn't required. /// /// For the time being, AVX acceleration is black box, and AVX accelerated algorithms are not available without a planner. This may change in the future. /// /// ~~~ /// // Perform a forward Fft of size 1234, accelerated by AVX /// use std::sync::Arc; /// use rustfft::{FftPlannerAvx, num_complex::Complex}; /// /// // If FftPlannerAvx::new() returns Ok(), we'll know AVX algorithms are available /// // on this machine, and that RustFFT was compiled with the `avx` feature flag /// if let Ok(mut planner) = FftPlannerAvx::new() { /// let fft = planner.plan_fft_forward(1234); /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 1234]; /// fft.process(&mut buffer); /// /// // The FFT instance returned by the planner has the type `Arc>`, /// // where T is the numeric type, ie f32 or f64, so it's cheap to clone /// let fft_clone = Arc::clone(&fft); /// } /// ~~~ /// /// If you plan on creating multiple FFT instances, it is recommended to re-use the same planner for all of them. This /// is because the planner re-uses internal data across FFT instances wherever possible, saving memory and reducing /// setup time. (FFT instances created with one planner will never re-use data and buffers with FFT instances created /// by a different planner) /// /// Each FFT instance owns [`Arc`s](std::sync::Arc) to its internal data, rather than borrowing it from the planner, so it's perfectly /// safe to drop the planner after creating Fft instances. pub struct FftPlannerAvx { internal_planner: Box>, } impl FftPlannerAvx { /// Constructs a new `FftPlannerAvx` instance. /// /// Returns `Ok(planner_instance)` if we're compiling for X86_64, AVX support was enabled in feature flags, and the current CPU supports the `avx` and `fma` CPU features. /// Returns `Err(())` if AVX support is not available. pub fn new() -> Result { // Eventually we might make AVX algorithms that don't also require FMA. // If that happens, we can only check for AVX here? seems like a pretty low-priority addition let has_avx = is_x86_feature_detected!("avx"); let has_fma = is_x86_feature_detected!("fma"); if has_avx && has_fma { // Ideally, we would implement the planner with specialization. // Specialization won't be on stable rust for a long time tohugh, so in the meantime, we can hack around it. // // The first step of the hack is to use TypeID to determine if T is f32, f64, or neither. If neither, we don't want to di any AVX acceleration // If it's f32 or f64, then construct an internal type that has two generic parameters, one bounded on AvxNum, the other bounded on FftNum // // - A is bounded on the AvxNum trait, and is the type we use for any AVX computations. It has associated types for AVX vectors, // associated constants for the number of elements per vector, etc. // - T is bounded on the FftNum trait, and thus is the type that every FFT algorithm will recieve its input/output buffers in. // // An important snag relevant to the planner is that we have to box and type-erase the AvxNum bound, // since the only other option is making the AvxNum bound a part of this struct's external API // // Another annoying snag with this setup is that we frequently have to transmute buffers from &mut [Complex] to &mut [Complex] or vice versa. // We know this is safe because we assert everywhere that Type(A)==Type(T), so it's just a matter of "doing it right" every time. // These transmutes are required because the FFT algorithm's input will come through the FFT trait, which may only be bounded by FftNum. // So the buffers will have the type &mut [Complex]. The problem comes in that all of our AVX computation tools are on the AvxNum trait. // // If we had specialization, we could easily convince the compilr that AvxNum and FftNum were different bounds on the same underlying type (IE f32 or f64) // but without it, the compiler is convinced that they are different. So we use the transmute as a last-resort way to overcome this limitation. // // We keep both the A and T types around in all of our AVX-related structs so that we can cast between A and T whenever necessary. let id_f32 = TypeId::of::(); let id_f64 = TypeId::of::(); let id_t = TypeId::of::(); if id_t == id_f32 { return Ok(Self { internal_planner: Box::new(AvxPlannerInternal::::new()), }); } else if id_t == id_f64 { return Ok(Self { internal_planner: Box::new(AvxPlannerInternal::::new()), }); } } Err(()) } /// Returns a `Fft` instance which uses AVX instructions to compute FFTs of size `len`. /// /// If the provided `direction` is `FftDirection::Forward`, the returned instance will compute forward FFTs. If it's `FftDirection::Inverse`, it will compute inverse FFTs. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft(&mut self, len: usize, direction: FftDirection) -> Arc> { self.internal_planner.plan_and_construct_fft(len, direction) } /// Returns a `Fft` instance which uses AVX instructions to compute forward FFTs of size `len`. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_forward(&mut self, len: usize) -> Arc> { self.plan_fft(len, FftDirection::Forward) } /// Returns a `Fft` instance which uses AVX instructions to compute inverse FFTs of size `len. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_inverse(&mut self, len: usize) -> Arc> { self.plan_fft(len, FftDirection::Inverse) } /// Returns a FFT plan without constructing it #[allow(unused)] pub(crate) fn debug_plan_fft(&self, len: usize, direction: FftDirection) -> MixedRadixPlan { self.internal_planner.debug_plan_fft(len, direction) } } trait AvxPlannerInternalAPI: Send { fn plan_and_construct_fft(&mut self, len: usize, direction: FftDirection) -> Arc>; fn debug_plan_fft(&self, len: usize, direction: FftDirection) -> MixedRadixPlan; } struct AvxPlannerInternal { cache: FftCache, _phantom: std::marker::PhantomData, } impl AvxPlannerInternalAPI for AvxPlannerInternal { fn plan_and_construct_fft(&mut self, len: usize, direction: FftDirection) -> Arc> { // Step 1: Create a plan for this FFT length. let plan = self.plan_fft(len, direction, Self::plan_mixed_radix_base); // Step 2: Construct the plan. If the base is rader's algorithm or bluestein's algorithm, this may call self.plan_and_construct_fft recursively! self.construct_plan( plan, direction, Self::construct_butterfly, Self::plan_and_construct_fft, ) } fn debug_plan_fft(&self, len: usize, direction: FftDirection) -> MixedRadixPlan { self.plan_fft(len, direction, Self::plan_mixed_radix_base) } } impl AvxPlannerInternalAPI for AvxPlannerInternal { fn plan_and_construct_fft(&mut self, len: usize, direction: FftDirection) -> Arc> { // Step 1: Create a plan for this FFT length. let plan = self.plan_fft(len, direction, Self::plan_mixed_radix_base); // Step 2: Construct the plan. If the base is rader's algorithm or bluestein's algorithm, this may call self.plan_and_construct_fft recursively! self.construct_plan( plan, direction, Self::construct_butterfly, Self::plan_and_construct_fft, ) } fn debug_plan_fft(&self, len: usize, direction: FftDirection) -> MixedRadixPlan { self.plan_fft(len, direction, Self::plan_mixed_radix_base) } } //------------------------------------------------------------------- // f32-specific planning stuff //------------------------------------------------------------------- impl AvxPlannerInternal { pub fn new() -> Self { // Internal sanity check: Make sure that T == f32. // This struct has two generic parameters A and T, but they must always be the same, and are only kept separate to help work around the lack of specialization. // It would be cool if we could do this as a static_assert instead let id_f32 = TypeId::of::(); let id_t = TypeId::of::(); assert_eq!(id_f32, id_t); Self { cache: FftCache::new(), _phantom: std::marker::PhantomData, } } fn plan_mixed_radix_base(&self, len: usize, factors: &PartialFactors) -> MixedRadixPlan { // if we have non-fast-path factors, use them as our base FFT length, and we will have to use either rader's algorithm or bluestein's algorithm as our base if factors.get_other_factors() > 1 { let other_factors = factors.get_other_factors(); // First, if the "other factors" are a butterfly, use that as the butterfly if self.is_butterfly(other_factors) { return MixedRadixPlan::butterfly(other_factors, vec![]); } // We can only use rader's if `other_factors` is prime if miller_rabin(other_factors as u64) { // len is prime, so we can use Rader's Algorithm as a base. Whether or not that's a good idea is a different story // Rader's Algorithm is only faster in a few narrow cases. // as a heuristic, only use rader's algorithm if its inner FFT can be computed entirely without bluestein's or rader's // We're intentionally being too conservative here. Otherwise we'd be recursively applying a heuristic, and repeated heuristic failures could stack to make a rader's chain significantly slower. // If we were writing a measuring planner, expanding this heuristic and measuring its effectiveness would be an opportunity for up to 2x performance gains. let inner_factors = PartialFactors::compute(other_factors - 1); if self.is_butterfly(inner_factors.get_other_factors()) { // We only have factors of 2,3,5,7, and 11. If we don't have AVX2, we also have to exclude factors of 5 and 7 and 11, because avx2 gives us enough headroom for the overhead of those to not be a problem if is_x86_feature_detected!("avx2") || (inner_factors.product_power2power3() == len - 1) { return MixedRadixPlan::new( MixedRadixBase::RadersBase(other_factors), vec![], ); } } } // At this point, we know we're using bluestein's algorithm for the base. Next step is to plan the inner size we'll use for bluestein's algorithm. let inner_bluesteins_len = self.plan_bluesteins(other_factors, |(_len, factor2, factor3)| { if *factor2 > 16 && *factor3 < 3 { // surprisingly, pure powers of 2 have a pretty steep dropoff in speed after 65536. // the algorithm is designed to generate candidadtes larger than baseline_candidate, so if we hit a large power of 2, there should be more after it that we can skip to return false; } true }); return MixedRadixPlan::new( MixedRadixBase::BluesteinsBase(other_factors, inner_bluesteins_len), vec![], ); } // If this FFT size is a butterfly, use that if self.is_butterfly(len) { return MixedRadixPlan::butterfly(len, vec![]); } // If the power2 * power3 component of this FFT is a butterfly and not too small, return that let power2power3 = factors.product_power2power3(); if power2power3 > 4 && self.is_butterfly(power2power3) { return MixedRadixPlan::butterfly(power2power3, vec![]); } // most of this code is heuristics assuming FFTs of a minimum size. if the FFT is below that minimum size, the heuristics break down. // so the first thing we're going to do is hardcode the plan for osme specific sizes where we know the heuristics won't be enough let hardcoded_base = match power2power3 { // 3 * 2^n special cases 96 => Some(MixedRadixPlan::butterfly(32, vec![3])), // 2^5 * 3 192 => Some(MixedRadixPlan::butterfly(48, vec![4])), // 2^6 * 3 1536 => Some(MixedRadixPlan::butterfly(48, vec![8, 4])), // 2^8 * 3 // 9 * 2^n special cases 18 => Some(MixedRadixPlan::butterfly(3, vec![6])), // 2 * 3^2 144 => Some(MixedRadixPlan::butterfly(36, vec![4])), // 2^4 * 3^2 _ => None, }; if let Some(hardcoded) = hardcoded_base { return hardcoded; } if factors.get_power2() >= 5 { match factors.get_power3() { // if this FFT is a power of 2, our strategy here is to tweak the butterfly to free us up to do an 8xn chain 0 => match factors.get_power2() % 3 { 0 => MixedRadixPlan::butterfly(512, vec![]), 1 => MixedRadixPlan::butterfly(256, vec![]), 2 => MixedRadixPlan::butterfly(256, vec![]), _ => unreachable!(), }, // if this FFT is 3 times a power of 2, our strategy here is to tweak butterflies to make it easier to set up a 8xn chain 1 => match factors.get_power2() % 3 { 0 => MixedRadixPlan::butterfly(64, vec![12, 16]), 1 => MixedRadixPlan::butterfly(48, vec![]), 2 => MixedRadixPlan::butterfly(64, vec![]), _ => unreachable!(), }, // if this FFT is 9 or greater times a power of 2, just use 72. As you might expect, in this vast field of options, what is optimal becomes a lot more muddy and situational // but across all the benchmarking i've done, 72 seems like the best default that will get us the best plan in 95% of the cases // 64, 54, and 48 are occasionally faster, although i haven't been able to discern a pattern. _ => MixedRadixPlan::butterfly(72, vec![]), } } else if factors.get_power3() >= 3 { // Our FFT is a power of 3 times a low power of 2. A high level summary of our strategy is that we want to pick a base that will // A: consume all factors of 2, and B: leave us with an even power of 3, so that we can do a 9xn chain. match factors.get_power2() { 0 => MixedRadixPlan::butterfly(27, vec![]), 1 => MixedRadixPlan::butterfly(54, vec![]), 2 => match factors.get_power3() % 2 { 0 => MixedRadixPlan::butterfly(36, vec![]), 1 => MixedRadixPlan::butterfly(if len < 1000 { 36 } else { 12 }, vec![]), _ => unreachable!(), }, 3 => match factors.get_power3() % 2 { 0 => MixedRadixPlan::butterfly(72, vec![]), 1 => MixedRadixPlan::butterfly( if factors.get_power3() > 7 { 24 } else { 72 }, vec![], ), _ => unreachable!(), }, 4 => match factors.get_power3() % 2 { 0 => MixedRadixPlan::butterfly( if factors.get_power3() > 6 { 16 } else { 72 }, vec![], ), 1 => MixedRadixPlan::butterfly( if factors.get_power3() > 9 { 48 } else { 72 }, vec![], ), _ => unreachable!(), }, // if this FFT is 32 or greater times a power of 3, just use 72. As you might expect, in this vast field of options, what is optimal becomes a lot more muddy and situational // but across all the benchmarking i've done, 72 seems like the best default that will get us the best plan in 95% of the cases // 64, 54, and 48 are occasionally faster, although i haven't been able to discern a pattern. _ => MixedRadixPlan::butterfly(72, vec![]), } } // If this FFT has powers of 11, 7, or 5, use that else if factors.get_power11() > 0 { MixedRadixPlan::butterfly(11, vec![]) } else if factors.get_power7() > 0 { MixedRadixPlan::butterfly(7, vec![]) } else if factors.get_power5() > 0 { MixedRadixPlan::butterfly(5, vec![]) } else { panic!( "Couldn't find a base for FFT size {}, factors={:?}", len, factors ) } } fn is_butterfly(&self, len: usize) -> bool { [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 16, 17, 19, 23, 24, 27, 29, 31, 32, 36, 48, 54, 64, 72, 128, 256, 512, ] .contains(&len) } fn construct_butterfly(&self, len: usize, direction: FftDirection) -> Arc> { match len { 0 | 1 => wrap_fft(Dft::new(len, direction)), 2 => wrap_fft(Butterfly2::new(direction)), 3 => wrap_fft(Butterfly3::new(direction)), 4 => wrap_fft(Butterfly4::new(direction)), 5 => wrap_fft(Butterfly5Avx::new(direction).unwrap()), 6 => wrap_fft(Butterfly6::new(direction)), 7 => wrap_fft(Butterfly7Avx::new(direction).unwrap()), 8 => wrap_fft(Butterfly8Avx::new(direction).unwrap()), 9 => wrap_fft(Butterfly9Avx::new(direction).unwrap()), 11 => wrap_fft(Butterfly11Avx::new(direction).unwrap()), 12 => wrap_fft(Butterfly12Avx::new(direction).unwrap()), 13 => wrap_fft(Butterfly13::new(direction)), 16 => wrap_fft(Butterfly16Avx::new(direction).unwrap()), 17 => wrap_fft(Butterfly17::new(direction)), 19 => wrap_fft(Butterfly19::new(direction)), 23 => wrap_fft(Butterfly23::new(direction)), 24 => wrap_fft(Butterfly24Avx::new(direction).unwrap()), 27 => wrap_fft(Butterfly27Avx::new(direction).unwrap()), 29 => wrap_fft(Butterfly29::new(direction)), 31 => wrap_fft(Butterfly31::new(direction)), 32 => wrap_fft(Butterfly32Avx::new(direction).unwrap()), 36 => wrap_fft(Butterfly36Avx::new(direction).unwrap()), 48 => wrap_fft(Butterfly48Avx::new(direction).unwrap()), 54 => wrap_fft(Butterfly54Avx::new(direction).unwrap()), 64 => wrap_fft(Butterfly64Avx::new(direction).unwrap()), 72 => wrap_fft(Butterfly72Avx::new(direction).unwrap()), 128 => wrap_fft(Butterfly128Avx::new(direction).unwrap()), 256 => wrap_fft(Butterfly256Avx::new(direction).unwrap()), 512 => wrap_fft(Butterfly512Avx::new(direction).unwrap()), _ => panic!("Invalid butterfly len: {}", len), } } } //------------------------------------------------------------------- // f64-specific planning stuff //------------------------------------------------------------------- impl AvxPlannerInternal { pub fn new() -> Self { // Internal sanity check: Make sure that T == f64. // This struct has two generic parameters A and T, but they must always be the same, and are only kept separate to help work around the lack of specialization. // It would be cool if we could do this as a static_assert instead let id_f64 = TypeId::of::(); let id_t = TypeId::of::(); assert_eq!(id_f64, id_t); Self { cache: FftCache::new(), _phantom: std::marker::PhantomData, } } fn plan_mixed_radix_base(&self, len: usize, factors: &PartialFactors) -> MixedRadixPlan { // if we have a factor that can't be computed with 2xn 3xn etc, we'll have to compute it with bluestein's or rader's, so use that as the base if factors.get_other_factors() > 1 { let other_factors = factors.get_other_factors(); // First, if the "other factors" are a butterfly, use that as the butterfly if self.is_butterfly(other_factors) { return MixedRadixPlan::butterfly(other_factors, vec![]); } // We can only use rader's if `other_factors` is prime if miller_rabin(other_factors as u64) { // len is prime, so we can use Rader's Algorithm as a base. Whether or not that's a good idea is a different story // Rader's Algorithm is only faster in a few narrow cases. // as a heuristic, only use rader's algorithm if its inner FFT can be computed entirely without bluestein's or rader's // We're intentionally being too conservative here. Otherwise we'd be recursively applying a heuristic, and repeated heuristic failures could stack to make a rader's chain significantly slower. // If we were writing a measuring planner, expanding this heuristic and measuring its effectiveness would be an opportunity for up to 2x performance gains. let inner_factors = PartialFactors::compute(other_factors - 1); if self.is_butterfly(inner_factors.get_other_factors()) { // We only have factors of 2,3,5,7, and 11. If we don't have AVX2, we also have to exclude factors of 5 and 7 and 11, because avx2 gives us enough headroom for the overhead of those to not be a problem if is_x86_feature_detected!("avx2") || (inner_factors.product_power2power3() == len - 1) { return MixedRadixPlan::new( MixedRadixBase::RadersBase(other_factors), vec![], ); } } } // At this point, we know we're using bluestein's algorithm for the base. Next step is to plan the inner size we'll use for bluestein's algorithm. let inner_bluesteins_len = self.plan_bluesteins(other_factors, |(_len, factor2, factor3)| { if *factor3 < 1 && *factor2 > 13 { return false; } if *factor3 < 4 && *factor2 > 14 { return false; } true }); return MixedRadixPlan::new( MixedRadixBase::BluesteinsBase(other_factors, inner_bluesteins_len), vec![], ); } // If this FFT size is a butterfly, use that if self.is_butterfly(len) { return MixedRadixPlan::butterfly(len, vec![]); } // If the power2 * power3 component of this FFT is a butterfly and not too small, return that let power2power3 = factors.product_power2power3(); if power2power3 > 4 && self.is_butterfly(power2power3) { return MixedRadixPlan::butterfly(power2power3, vec![]); } // most of this code is heuristics assuming FFTs of a minimum size. if the FFT is below that minimum size, the heuristics break down. // so the first thing we're going to do is hardcode the plan for osme specific sizes where we know the heuristics won't be enough let hardcoded_base = match power2power3 { // 2^n special cases 64 => Some(MixedRadixPlan::butterfly(16, vec![4])), // 2^6 // 3 * 2^n special cases 48 => Some(MixedRadixPlan::butterfly(12, vec![4])), // 3 * 2^4 96 => Some(MixedRadixPlan::butterfly(12, vec![8])), // 3 * 2^5 768 => Some(MixedRadixPlan::butterfly(12, vec![8, 8])), // 3 * 2^8 // 9 * 2^n special cases 72 => Some(MixedRadixPlan::butterfly(24, vec![3])), // 2^3 * 3^2 288 => Some(MixedRadixPlan::butterfly(32, vec![9])), // 2^5 * 3^2 // 4 * 3^n special cases 108 => Some(MixedRadixPlan::butterfly(18, vec![6])), // 2^4 * 3^2 _ => None, }; if let Some(hardcoded) = hardcoded_base { return hardcoded; } if factors.get_power2() >= 4 { match factors.get_power3() { // if this FFT is a power of 2, our strategy here is to tweak the butterfly to free us up to do an 8xn chain 0 => match factors.get_power2() % 3 { 0 => MixedRadixPlan::butterfly(512, vec![]), 1 => MixedRadixPlan::butterfly(128, vec![]), 2 => MixedRadixPlan::butterfly(256, vec![]), _ => unreachable!(), }, // if this FFT is 3 times a power of 2, our strategy here is to tweak butterflies to make it easier to set up a 8xn chain 1 => match factors.get_power2() % 3 { 0 => MixedRadixPlan::butterfly(24, vec![]), 1 => MixedRadixPlan::butterfly(32, vec![12]), 2 => MixedRadixPlan::butterfly(32, vec![12, 16]), _ => unreachable!(), }, // if this FFT is 9 times a power of 2, our strategy here is to tweak butterflies to make it easier to set up a 8xn chain 2 => match factors.get_power2() % 3 { 0 => MixedRadixPlan::butterfly(36, vec![16]), 1 => MixedRadixPlan::butterfly(36, vec![]), 2 => MixedRadixPlan::butterfly(18, vec![]), _ => unreachable!(), }, // this FFT is 27 or greater times a power of two. As you might expect, in this vast field of options, what is optimal becomes a lot more muddy and situational // but across all the benchmarking i've done, 36 seems like the best default that will get us the best plan in 95% of the cases // 32 is rarely faster, although i haven't been able to discern a pattern. _ => MixedRadixPlan::butterfly(36, vec![]), } } else if factors.get_power3() >= 3 { // Our FFT is a power of 3 times a low power of 2 match factors.get_power2() { 0 => match factors.get_power3() % 2 { 0 => MixedRadixPlan::butterfly( if factors.get_power3() > 10 { 9 } else { 27 }, vec![], ), 1 => MixedRadixPlan::butterfly(27, vec![]), _ => unreachable!(), }, 1 => MixedRadixPlan::butterfly(18, vec![]), 2 => match factors.get_power3() % 2 { 0 => MixedRadixPlan::butterfly(36, vec![]), 1 => MixedRadixPlan::butterfly( if factors.get_power3() > 10 { 36 } else { 18 }, vec![], ), _ => unreachable!(), }, 3 => MixedRadixPlan::butterfly(18, vec![]), // this FFT is 16 or greater times a power of three. As you might expect, in this vast field of options, what is optimal becomes a lot more muddy and situational // but across all the benchmarking i've done, 36 seems like the best default that will get us the best plan in 95% of the cases // 32 is rarely faster, although i haven't been able to discern a pattern. _ => MixedRadixPlan::butterfly(36, vec![]), } } // If this FFT has powers of 11, 7, or 5, use that else if factors.get_power11() > 0 { MixedRadixPlan::butterfly(11, vec![]) } else if factors.get_power7() > 0 { MixedRadixPlan::butterfly(7, vec![]) } else if factors.get_power5() > 0 { MixedRadixPlan::butterfly(5, vec![]) } else { panic!( "Couldn't find a base for FFT size {}, factors={:?}", len, factors ) } } fn is_butterfly(&self, len: usize) -> bool { [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 16, 17, 18, 19, 23, 24, 27, 29, 31, 32, 36, 64, 128, 256, 512, ] .contains(&len) } fn construct_butterfly(&self, len: usize, direction: FftDirection) -> Arc> { match len { 0 | 1 => wrap_fft(Dft::new(len, direction)), 2 => wrap_fft(Butterfly2::new(direction)), 3 => wrap_fft(Butterfly3::new(direction)), 4 => wrap_fft(Butterfly4::new(direction)), 5 => wrap_fft(Butterfly5Avx64::new(direction).unwrap()), 6 => wrap_fft(Butterfly6::new(direction)), 7 => wrap_fft(Butterfly7Avx64::new(direction).unwrap()), 8 => wrap_fft(Butterfly8Avx64::new(direction).unwrap()), 9 => wrap_fft(Butterfly9Avx64::new(direction).unwrap()), 11 => wrap_fft(Butterfly11Avx64::new(direction).unwrap()), 12 => wrap_fft(Butterfly12Avx64::new(direction).unwrap()), 13 => wrap_fft(Butterfly13::new(direction)), 16 => wrap_fft(Butterfly16Avx64::new(direction).unwrap()), 17 => wrap_fft(Butterfly17::new(direction)), 18 => wrap_fft(Butterfly18Avx64::new(direction).unwrap()), 19 => wrap_fft(Butterfly19::new(direction)), 23 => wrap_fft(Butterfly23::new(direction)), 24 => wrap_fft(Butterfly24Avx64::new(direction).unwrap()), 27 => wrap_fft(Butterfly27Avx64::new(direction).unwrap()), 29 => wrap_fft(Butterfly29::new(direction)), 31 => wrap_fft(Butterfly31::new(direction)), 32 => wrap_fft(Butterfly32Avx64::new(direction).unwrap()), 36 => wrap_fft(Butterfly36Avx64::new(direction).unwrap()), 64 => wrap_fft(Butterfly64Avx64::new(direction).unwrap()), 128 => wrap_fft(Butterfly128Avx64::new(direction).unwrap()), 256 => wrap_fft(Butterfly256Avx64::new(direction).unwrap()), 512 => wrap_fft(Butterfly512Avx64::new(direction).unwrap()), _ => panic!("Invalid butterfly len: {}", len), } } } //------------------------------------------------------------------- // type-agnostic planning stuff //------------------------------------------------------------------- impl AvxPlannerInternal { // Given a length, return a plan for how this FFT should be computed fn plan_fft( &self, len: usize, direction: FftDirection, base_fn: impl FnOnce(&Self, usize, &PartialFactors) -> MixedRadixPlan, ) -> MixedRadixPlan { // First step: If this size is already cached, return it directly if self.cache.contains_fft(len, direction) { return MixedRadixPlan::cached(len); } // We have butterflies for everything below 10, so if it's below 10, just skip the factorization etc // Notably, this step is *required* if the len is 0, since we can't compute a prime factorization for zero if len < 10 { return MixedRadixPlan::butterfly(len, Vec::new()); } // This length is not cached, so we have to come up with a new plan. The first step is to find a suitable base. let factors = PartialFactors::compute(len); let base = base_fn(self, len, &factors); // it's possible that the base planner plans out the whole FFT. it's guaranteed if `len` is a prime number, or if it's a butterfly, for example let uncached_plan = if base.len == len { base } else { // We have some mixed radix steps to compute! Compute the factors that need to computed by mixed radix steps, let radix_factors = factors .divide_by(&PartialFactors::compute(base.len)) .unwrap_or_else(|| { panic!( "Invalid base for FFT length={}, base={:?}, base radixes={:?}", len, base.base, base.radixes ) }); self.plan_mixed_radix(radix_factors, base) }; // Last step: We have a full FFT plan, but some of the steps of that plan may have been cached. If they have, use the largest cached step as the base. self.replan_with_cache(uncached_plan, direction) } // Takes a plan and an algorithm cache, and replaces steps of the plan with cached steps, if possible fn replan_with_cache(&self, plan: MixedRadixPlan, direction: FftDirection) -> MixedRadixPlan { enum CacheLocation { None, Base, Radix(usize, usize), // First value is the length of the cached FFT, and second value is the index in the radix array } let mut largest_cached_len = CacheLocation::None; let base_len = plan.base.base_len(); let mut current_len = base_len; // Check if the cache contains the base if self.cache.contains_fft(current_len, direction) { largest_cached_len = CacheLocation::Base; } // Walk up the radix chain, checking if rthe cache contains each step for (i, radix) in plan.radixes.iter().enumerate() { current_len *= *radix as usize; if self.cache.contains_fft(current_len, direction) { largest_cached_len = CacheLocation::Radix(current_len, i); } } // If we found a cached length within the plan, update the plan to account for the cache match largest_cached_len { CacheLocation::None => plan, CacheLocation::Base => { MixedRadixPlan::new(MixedRadixBase::CacheBase(base_len), plan.radixes) } CacheLocation::Radix(cached_len, cached_index) => { // We know that `plan.radixes[cached_index]` is the largest cache value, and `cached_len` will be our new base legth // Drop every element from `plan.radixes` from up to and including cached_index let mut chain = plan.radixes; chain.drain(0..=cached_index); MixedRadixPlan::new(MixedRadixBase::CacheBase(cached_len), chain) } } } // given a set of factors, compute how many iterations of 12xn and 16xn we should plan for. Returns (k, j) for 12^k and 6^j fn plan_power12_power6(radix_factors: &PartialFactors) -> (u32, u32) { // it's helpful to think of this process as rewriting the FFT length as powers of our radixes // the fastest FFT we could possibly compute is 8^n, because the 8xn algorithm is blazing fast. 9xn and 12xn are also in the top tier for speed, so those 3 algorithms are what we will aim for // Specifically, we want to find a combination of 8, 9, and 12, that will "consume" all factors of 2 and 3, without having any leftovers // Unfortunately, most FFTs don't come in the form 8^n * 9^m * 12^k // Thankfully, 6xn is also reasonably fast, so we can use 6xn to strip away factors. // This function's job will be to divide radix_factors into 8^n * 9^m * 12^k * 6^j, which minimizes j, then maximizes k // we're going to hypothetically add as many 12's to our plan as possible, keeping track of how many 6's were required to balance things out // we can also compute this analytically with modular arithmetic, but that technique only works when the FFT is above a minimum size, but this loop+array technique always works let max_twelves = min(radix_factors.get_power2() / 2, radix_factors.get_power3()); let mut required_sixes = [None; 4]; // only track 6^0 through 6^3. 6^4 can be converted into 12^2 * 9, and 6^5 can be converted into 12 * 8 * 9 * 9 for hypothetical_twelve_power in 0..(max_twelves + 1) { let hypothetical_twos = radix_factors.get_power2() - hypothetical_twelve_power * 2; let hypothetical_threes = radix_factors.get_power3() - hypothetical_twelve_power; // figure out how many sixes we would need to leave our FFT at 8^n * 9^m via modular arithmetic, and write to that index of our twelves_per_sixes array let sixes = match (hypothetical_twos % 3, hypothetical_threes % 2) { (0, 0) => Some(0), (1, 1) => Some(1), (2, 0) => Some(2), (0, 1) => Some(3), (1, 0) => None, // it would take 4 sixes, which can be replaced by 2 twelves, so we'll hit it in a later loop (if we have that many factors) (2, 1) => None, // it would take 5 sixes, but note that 12 is literally 2^2 * 3^1, so instead of applying 5 sixes, we can apply a single 12 (_, _) => unreachable!(), }; // if we can bring the FFT into range for the fast path with sixes, record so in the required_sixes array // but make sure the number of sixes we're going to apply actually fits into our available factors if let Some(sixes) = sixes { if sixes <= hypothetical_twos && sixes <= hypothetical_threes { required_sixes[sixes as usize] = Some(hypothetical_twelve_power) } } } // required_sixes[i] now contains the largest power of twelve that we can apply, given that we also apply 6^i // we want to apply as many of 12 as possible, so take the array element with the largest non-None element // note that it's possible (and very likely) that either power_twelve or power_six is zero, or both of them are zero! this will happen for a pure power of 2 or power of 3 FFT, for example let (power_twelve, mut power_six) = required_sixes .iter() .enumerate() .filter_map(|(i, maybe_twelve)| maybe_twelve.map(|twelve| (twelve, i as u32))) .fold( (0, 0), |best, current| if current.0 >= best.0 { current } else { best }, ); // special case: if we have exactly one factor of 2 and at least one factor of 3, unconditionally apply a factor of 6 to get rid of the 2 if radix_factors.get_power2() == 1 && radix_factors.get_power3() > 0 { power_six = 1; } // special case: if we have a single factor of 3 and more than one factor of 2 (and we don't have any twelves), unconditionally apply a factor of 6 to get rid of the 3 if radix_factors.get_power2() > 1 && radix_factors.get_power3() == 1 && power_twelve == 0 { power_six = 1; } (power_twelve, power_six) } fn plan_mixed_radix( &self, mut radix_factors: PartialFactors, mut plan: MixedRadixPlan, ) -> MixedRadixPlan { // if we can complete the FFT with a single radix, do it if [2, 3, 4, 5, 6, 7, 8, 9, 12, 16].contains(&radix_factors.product()) { plan.push_radix(radix_factors.product() as u8) } else { // Compute how many powers of 12 and powers of 6 we want to strip away let (power_twelve, power_six) = Self::plan_power12_power6(&radix_factors); // divide our powers of 12 and 6 out of our radix factors radix_factors = radix_factors .divide_by(&PartialFactors::compute( 6usize.pow(power_six) * 12usize.pow(power_twelve), )) .unwrap(); // now that we know the 12 and 6 factors, the plan array can be computed in descending radix size if radix_factors.get_power2() % 3 == 1 && radix_factors.get_power2() > 1 { // our factors of 2 might not quite be a power of 8 -- our plan_power12_power6 function tried its best, but if there are very few factors of 3, it can't help. // if we're 2 * 8^N, benchmarking shows that applying a 16 before our chain of 8s is very fast. plan.push_radix(16); radix_factors = radix_factors .divide_by(&PartialFactors::compute(16)) .unwrap(); } plan.push_radix_power(12, power_twelve); plan.push_radix_power(11, radix_factors.get_power11()); plan.push_radix_power(9, radix_factors.get_power3() / 2); plan.push_radix_power(8, radix_factors.get_power2() / 3); plan.push_radix_power(7, radix_factors.get_power7()); plan.push_radix_power(6, power_six); plan.push_radix_power(5, radix_factors.get_power5()); if radix_factors.get_power2() % 3 == 2 { // our factors of 2 might not quite be a power of 8 -- our plan_power12_power6 function tried its best, but if we are a power of 2, it can't help. // if we're 4 * 8^N, benchmarking shows that applying a 4 to the end our chain of 8s is very fast. plan.push_radix(4); } if radix_factors.get_power3() % 2 == 1 { // our factors of 3 might not quite be a power of 9 -- our plan_power12_power6 function tried its best, but if we are a power of 3, it can't help. // if we're 3 * 9^N, our only choice is to add an 8xn step plan.push_radix(3); } if radix_factors.get_power2() % 3 == 1 { // our factors of 2 might not quite be a power of 8. We tried to correct this with a 16 radix and 4 radix, but as a last resort, apply a 2. 2 is very slow, but it's better than not computing the FFT plan.push_radix(2); } // measurement opportunity: is it faster to let the plan_power12_power6 function put a 4 on the end instead of relying on all 8's? // measurement opportunity: is it faster to slap a 16 on top of the stack? // measurement opportunity: if our plan_power12_power6 function adds both 12s and sixes, is it faster to drop combinations of 12+6 down to 8+9? }; plan } // Constructs and returns a FFT instance from a FFT plan. // If the base is a butterfly, it will call the provided `construct_butterfly_fn` to do so. // If constructing the base requires constructing an inner FFT (IE bluetein's or rader's algorithm), it will call the provided `inner_fft_fn` to construct it fn construct_plan( &mut self, plan: MixedRadixPlan, direction: FftDirection, construct_butterfly_fn: impl FnOnce(&Self, usize, FftDirection) -> Arc>, inner_fft_fn: impl FnOnce(&mut Self, usize, FftDirection) -> Arc>, ) -> Arc> { let mut fft = match plan.base { MixedRadixBase::CacheBase(len) => self.cache.get(len, direction).unwrap(), MixedRadixBase::ButterflyBase(len) => { let butterfly_instance = construct_butterfly_fn(self, len, direction); // Cache this FFT instance for future calls to `plan_fft` self.cache.insert(&butterfly_instance); butterfly_instance } MixedRadixBase::RadersBase(len) => { // Rader's Algorithm requires an inner FFT of size len - 1 let inner_fft = inner_fft_fn(self, len - 1, direction); // try to construct our AVX2 rader's algorithm. If that fails (probably because the machine we're running on doesn't have AVX2), fall back to scalar let raders_instance = if let Ok(raders_avx) = RadersAvx2::::new(Arc::clone(&inner_fft)) { wrap_fft(raders_avx) } else { wrap_fft(RadersAlgorithm::new(inner_fft)) }; // Cache this FFT instance for future calls to `plan_fft` self.cache.insert(&raders_instance); raders_instance } MixedRadixBase::BluesteinsBase(len, inner_fft_len) => { // Bluestein's has an inner FFT of arbitrary size. But we've already planned it, so just use what we planned let inner_fft = inner_fft_fn(self, inner_fft_len, direction); // try to construct our AVX2 rader's algorithm. If that fails (probably because the machine we're running on doesn't have AVX2), fall back to scalar let bluesteins_instance = wrap_fft(BluesteinsAvx::::new(len, inner_fft).unwrap()); // Cache this FFT instance for future calls to `plan_fft` self.cache.insert(&bluesteins_instance); bluesteins_instance } }; // We have constructed our base. Now, construct the radix chain. for radix in plan.radixes { fft = match radix { 2 => wrap_fft(MixedRadix2xnAvx::::new(fft).unwrap()), 3 => wrap_fft(MixedRadix3xnAvx::::new(fft).unwrap()), 4 => wrap_fft(MixedRadix4xnAvx::::new(fft).unwrap()), 5 => wrap_fft(MixedRadix5xnAvx::::new(fft).unwrap()), 6 => wrap_fft(MixedRadix6xnAvx::::new(fft).unwrap()), 7 => wrap_fft(MixedRadix7xnAvx::::new(fft).unwrap()), 8 => wrap_fft(MixedRadix8xnAvx::::new(fft).unwrap()), 9 => wrap_fft(MixedRadix9xnAvx::::new(fft).unwrap()), 11 => wrap_fft(MixedRadix11xnAvx::::new(fft).unwrap()), 12 => wrap_fft(MixedRadix12xnAvx::::new(fft).unwrap()), 16 => wrap_fft(MixedRadix16xnAvx::::new(fft).unwrap()), _ => unreachable!(), }; // Cache this FFT instance for future calls to `plan_fft` self.cache.insert(&fft); } fft } // Plan and return the inner size to be used with Bluestein's Algorithm // Calls `filter_fn` on result candidates, giving the caller the opportunity to reject certain sizes fn plan_bluesteins( &self, len: usize, filter_fn: impl FnMut(&&(usize, u32, u32)) -> bool, ) -> usize { assert!(len > 1); // Internal consistency check: The logic in this method doesn't work for a length of 1 // Bluestein's computes a FFT of size `len` by reorganizing it as a FFT of ANY size greater than or equal to len * 2 - 1 // an obvious choice is the next power of two larger than len * 2 - 1, but if we can find a smaller FFT that will go faster, we can save a lot of time. // We can very efficiently compute almost any 2^n * 3^m, so we're going to search for all numbers of the form 2^n * 3^m that lie between len * 2 - 1 and the next power of two. let min_len = len * 2 - 1; let baseline_candidate = min_len.checked_next_power_of_two().unwrap(); // our algorithm here is to start with our next power of 2, and repeatedly divide by 2 and multiply by 3, trying to keep our value in range let mut bluesteins_candidates = Vec::new(); let mut candidate = baseline_candidate; let mut factor2 = candidate.trailing_zeros(); let mut factor3 = 0; let min_factor2 = 2; // benchmarking shows that while 3^n and 2 * 3^n are fast, they're typically slower than the next-higher candidate, so don't bother generating them while factor2 >= min_factor2 { // if this candidate length isn't too small, add it to our candidates list if candidate >= min_len { bluesteins_candidates.push((candidate, factor2, factor3)); } // if the candidate is too large, divide it by 2. if it's too small, divide it by 3 if candidate >= baseline_candidate { candidate >>= 1; factor2 -= 1; } else { candidate *= 3; factor3 += 1; } } bluesteins_candidates.sort(); // we now have a list of candidates to choosse from. some 2^n * 3^m FFTs are faster than others, so apply a filter, which will let us skip sizes that benchmarking has shown to be slow let (chosen_size, _, _) = bluesteins_candidates .iter() .find(filter_fn) .unwrap_or_else(|| { panic!( "Failed to find a bluestein's candidate for len={}, candidates: {:?}", len, bluesteins_candidates ) }); *chosen_size } } #[cfg(test)] mod unit_tests { use super::*; // We don't need to actually compute anything for a FFT size of zero, but we do need to verify that it doesn't explode #[test] fn test_plan_zero_avx() { let mut planner32 = FftPlannerAvx::::new().unwrap(); let fft_zero32 = planner32.plan_fft_forward(0); fft_zero32.process(&mut []); let mut planner64 = FftPlannerAvx::::new().unwrap(); let fft_zero64 = planner64.plan_fft_forward(0); fft_zero64.process(&mut []); } } rustfft-6.2.0/src/avx/avx_raders.rs000064400000000000000000000612500072674642500154530ustar 00000000000000use std::convert::TryInto; use std::sync::Arc; use std::{any::TypeId, arch::x86_64::*}; use num_complex::Complex; use num_integer::{div_ceil, Integer}; use num_traits::Zero; use primal_check::miller_rabin; use strength_reduce::StrengthReducedUsize; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::{array_utils, FftDirection}; use crate::{math_utils, twiddles}; use crate::{Direction, Fft, FftNum, Length}; use super::avx_vector; use super::{ avx_vector::{AvxArray, AvxArrayMut, AvxVector, AvxVector128, AvxVector256}, AvxNum, }; // This struct wraps the necessary data to compute (a * b) % divisor, where b and divisor are determined at runtime but rarely change, and a changes on every call. // It's written using AVX2 instructions and assumes the input a are 64-bit integers, and has a restriction that each a, b, and divisor must be 31-bit numbers or smaller. #[derive(Clone)] struct VectorizedMultiplyMod { b: __m256i, divisor: __m256i, intermediate: __m256i, } impl VectorizedMultiplyMod { #[target_feature(enable = "avx")] unsafe fn new(b: u32, divisor: u32) -> Self { assert!( divisor.leading_zeros() > 0, "divisor must be less than {}, got {}", 1 << 31, divisor ); let b = b % divisor; let intermediate = ((b as i64) << 32) / divisor as i64; Self { b: _mm256_set1_epi64x(b as i64), divisor: _mm256_set1_epi64x(divisor as i64), intermediate: _mm256_set1_epi64x(intermediate), } } // Input: 4 unsigned 64-bit numbers, each less than 2^30 // Output: (x * multiplier) % divisor for each x in input #[allow(unused)] #[inline(always)] unsafe fn mul_rem(&self, a: __m256i) -> __m256i { // Pretty hacky, but we need to prove to the compiler that each entry of the divisor is a 32-bit number, by blending the divisor vector with zeroes in the upper bits of each number. // If we don't do this manually, the compiler will do it anyways, but only for _mm256_mul_epu32, not for the _mm256_sub_epi64 correction step at the end // That inconstistency results in sub-optimal codegen where the compiler inserts extra code to handle the case where divisor is 64-bit. It also results in using one more register than necessary. // Since we know that can't happen, we can placate the compiler by explicitly zeroing the upper 32 bit of each divisor and relying on the compiler to lift it out of the loop. let masked_divisor = _mm256_blend_epi32(self.divisor, _mm256_setzero_si256(), 0xAA); // compute the integer quotient of (a * b) / divisor. Our precomputed intermediate value lets us skip the expensive division via arithmetic strength reduction let quotient = _mm256_srli_epi64(_mm256_mul_epu32(a, self.intermediate), 32); // Now we can compute numerator - quotient * divisor to get the remanider let numerator = _mm256_mul_epu32(a, self.b); let quotient_product = _mm256_mul_epu32(quotient, masked_divisor); // Standard remainder formula: remainder = numerator - quotient * divisor let remainder = _mm256_sub_epi64(numerator, quotient_product); // it's possible for the "remainder" to end up between divisor and 2 * divisor. so we'll subtract divisor from remainder, which will make some of the result negative // We can then use the subtracted result as the input to a blendv. Sadly avx doesn't have a blendv_epi32 or blendv_epi64, so we're gonna do blendv_pd instead // this works because blendv looks at the uppermost bit to decide which variable to use, and for a two's complement i64, the upper most bit is 1 when the number is negative! // So when the subtraction result is negative, the uppermost bit is 1, which means the blend will choose the second param, which is the unsubtracted remainder let casted_remainder = _mm256_castsi256_pd(remainder); let subtracted_remainder = _mm256_castsi256_pd(_mm256_sub_epi64(remainder, masked_divisor)); let wrapped_remainder = _mm256_castpd_si256(_mm256_blendv_pd( subtracted_remainder, casted_remainder, subtracted_remainder, )); wrapped_remainder } } /// Implementation of Rader's Algorithm, using AVX2 instructions. /// /// This algorithm computes a prime-sized FFT in O(nlogn) time. It does this by converting this size n FFT into a /// size (n - 1) which is guaranteed to be composite. /// /// The worst case for this algorithm is when (n - 1) is 2 * prime, resulting in a /// [Cunningham Chain](https://en.wikipedia.org/wiki/Cunningham_chain) /// /// Rader's Algorithm is relatively expensive compared to other FFT algorithms. Benchmarking shows that it is up to /// an order of magnitude slower than similar composite sizes. pub struct RadersAvx2 { input_index_multiplier: VectorizedMultiplyMod, input_index_init: __m256i, output_index_mapping: Box<[__m128i]>, twiddles: Box<[A::VectorType]>, inner_fft: Arc>, len: usize, inplace_scratch_len: usize, outofplace_scratch_len: usize, direction: FftDirection, _phantom: std::marker::PhantomData, } impl RadersAvx2 { /// Preallocates necessary arrays and precomputes necessary data to efficiently compute the FFT /// Returns Ok(instance) if this machine has the required instruction sets ("avx", "fma", and "avx2"), Err() if some instruction sets are missing /// /// # Panics /// Panics if `inner_fft_len() + 1` is not a prime number. #[inline] pub fn new(inner_fft: Arc>) -> Result { // Internal sanity check: Make sure that A == T. // This struct has two generic parameters A and T, but they must always be the same, and are only kept separate to help work around the lack of specialization. // It would be cool if we could do this as a static_assert instead let id_a = TypeId::of::(); let id_t = TypeId::of::(); assert_eq!(id_a, id_t); let has_avx = is_x86_feature_detected!("avx"); let has_avx2 = is_x86_feature_detected!("avx2"); let has_fma = is_x86_feature_detected!("fma"); if has_avx && has_avx2 && has_fma { // Safety: new_with_avx2 requires the "avx" feature set. Since we know it's present, we're safe Ok(unsafe { Self::new_with_avx(inner_fft) }) } else { Err(()) } } #[target_feature(enable = "avx")] unsafe fn new_with_avx(inner_fft: Arc>) -> Self { let inner_fft_len = inner_fft.len(); let len = inner_fft_len + 1; assert!(miller_rabin(len as u64), "For raders algorithm, inner_fft.len() + 1 must be prime. Expected prime number, got {} + 1 = {}", inner_fft_len, len); let direction = inner_fft.fft_direction(); let reduced_len = StrengthReducedUsize::new(len); // compute the primitive root and its inverse for this size let primitive_root = math_utils::primitive_root(len as u64).unwrap() as usize; // compute the multiplicative inverse of primative_root mod len and vice versa. // i64::extended_gcd will compute both the inverse of left mod right, and the inverse of right mod left, but we're only goingto use one of them // the primtive root inverse might be negative, if so make it positive by wrapping let gcd_data = i64::extended_gcd(&(primitive_root as i64), &(len as i64)); let primitive_root_inverse = if gcd_data.x >= 0 { gcd_data.x } else { gcd_data.x + len as i64 } as usize; // precompute the coefficients to use inside the process method let inner_fft_scale = T::one() / T::from_usize(inner_fft_len).unwrap(); let mut inner_fft_input = vec![Complex::zero(); inner_fft_len]; let mut twiddle_input = 1; for input_cell in &mut inner_fft_input { let twiddle = twiddles::compute_twiddle(twiddle_input, len, direction); *input_cell = twiddle * inner_fft_scale; twiddle_input = (twiddle_input * primitive_root_inverse) % reduced_len; } let required_inner_scratch = inner_fft.get_inplace_scratch_len(); let extra_inner_scratch = if required_inner_scratch <= inner_fft_len { 0 } else { required_inner_scratch }; //precompute a FFT of our reordered twiddle factors let mut inner_fft_scratch = vec![Zero::zero(); required_inner_scratch]; inner_fft.process_with_scratch(&mut inner_fft_input, &mut inner_fft_scratch); // When computing the FFT, we'll want this array to be pre-conjugated, so conjugate it. at the same time, convert it to vectors for convenient use later. let conjugation_mask = AvxVector256::broadcast_complex_elements(Complex::new(A::zero(), -A::zero())); let inner_fft_multiplier: Box<[_]> = { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_inner_input: &mut [Complex] = array_utils::workaround_transmute_mut(&mut inner_fft_input); transmuted_inner_input .chunks(A::VectorType::COMPLEX_PER_VECTOR) .map(|chunk| { let chunk_vector = match chunk.len() { 1 => chunk.load_partial1_complex(0).zero_extend(), 2 => { if chunk.len() == A::VectorType::COMPLEX_PER_VECTOR { chunk.load_complex(0) } else { chunk.load_partial2_complex(0).zero_extend() } } 3 => chunk.load_partial3_complex(0), 4 => chunk.load_complex(0), _ => unreachable!(), }; AvxVector::xor(chunk_vector, conjugation_mask) // compute our conjugation by xoring our data with a precomputed mask }) .collect() }; // Set up the data for our input index remapping computation const NUM_POWERS: usize = 5; let mut root_powers = [0; NUM_POWERS]; let mut current_power = 1; for i in 0..NUM_POWERS { root_powers[i] = current_power; current_power = (current_power * primitive_root) % reduced_len; } let (input_index_multiplier, input_index_init) = if A::VectorType::COMPLEX_PER_VECTOR == 4 { ( VectorizedMultiplyMod::new(root_powers[4] as u32, len as u32), _mm256_loadu_si256(root_powers.as_ptr().add(1) as *const __m256i), ) } else { let duplicated_powers = [ root_powers[1], root_powers[1], root_powers[2], root_powers[2], ]; ( VectorizedMultiplyMod::new(root_powers[2] as u32, len as u32), _mm256_loadu_si256(duplicated_powers.as_ptr() as *const __m256i), ) }; // Set up our output index remapping. Ideally we could compute the output indexes on the fly, but the output reindexing requires scatter, which doesn't exist until avx-512 // Instead, we can invert the scatter indexes to be gather indexes. But if there's an algorithmic way to compute this, I don't know what it is -- // so we won't be able to compute it on the fly with some sort of VectorizedMultiplyMod thing. Instead, we're going to precompute the inverted mapping and gather from that mapping. // We want enough elements in our array to fill out an entire set of vectors so that we don't have to deal with any partial indexes etc. let mapping_size = 1 + div_ceil(len, A::VectorType::COMPLEX_PER_VECTOR) * A::VectorType::COMPLEX_PER_VECTOR; let mut output_mapping_inverse: Vec = vec![0; mapping_size]; let mut output_index = 1; for i in 1..len { output_index = (output_index * primitive_root_inverse) % reduced_len; output_mapping_inverse[output_index] = i.try_into().unwrap(); } // the actual vector of indexes depends on whether we're f32 or f64 let output_index_mapping = if A::VectorType::COMPLEX_PER_VECTOR == 4 { (&output_mapping_inverse[1..]) .chunks_exact(A::VectorType::COMPLEX_PER_VECTOR) .map(|chunk| _mm_loadu_si128(chunk.as_ptr() as *const __m128i)) .collect::>() } else { (&output_mapping_inverse[1..]) .chunks_exact(A::VectorType::COMPLEX_PER_VECTOR) .map(|chunk| { let duplicated_indexes = [chunk[0], chunk[0], chunk[1], chunk[1]]; _mm_loadu_si128(duplicated_indexes.as_ptr() as *const __m128i) }) .collect::>() }; Self { input_index_multiplier, input_index_init, output_index_mapping, inner_fft: inner_fft, twiddles: inner_fft_multiplier, len, inplace_scratch_len: len + extra_inner_scratch, outofplace_scratch_len: extra_inner_scratch, direction, _phantom: std::marker::PhantomData, } } // Do the necessary setup for rader's algorithm: Reorder the inputs into the output buffer, gather a sum of all inputs. Return the first input, and the aum of all inputs #[target_feature(enable = "avx2", enable = "avx", enable = "fma")] unsafe fn prepare_raders(&self, input: &[Complex], output: &mut [Complex]) { let mut indexes = self.input_index_init; let index_multiplier = self.input_index_multiplier.clone(); // loop over the output array and use AVX gathers to reorder data from the input let mut chunks_iter = (&mut output[1..]).chunks_exact_mut(A::VectorType::COMPLEX_PER_VECTOR); for mut chunk in chunks_iter.by_ref() { let gathered_elements = A::VectorType::gather_complex_avx2_index64(input.as_ptr(), indexes); // advance our indexes indexes = index_multiplier.mul_rem(indexes); // Store this chunk chunk.store_complex(gathered_elements, 0); } // at this point, we either have 0 or 2 remaining elements to gather. because we know our length ends in 1 or 3. so when we subtract 1 for the inner FFT, that gives us 0 or 2 let mut output_remainder = chunks_iter.into_remainder(); if output_remainder.len() == 2 { let half_data = AvxVector128::gather64_complex_avx2( input.as_ptr(), _mm256_castsi256_si128(indexes), ); // store the remainder in the last chunk output_remainder.store_partial2_complex(half_data, 0); } } // Do the necessary finalization for rader's algorithm: Reorder the inputs into the output buffer, conjugating the input as we go, and add the first input value to every output value #[target_feature(enable = "avx2", enable = "avx", enable = "fma")] unsafe fn finalize_raders(&self, input: &[Complex], output: &mut [Complex]) { // We need to conjugate elements as a part of the finalization step, and sadly we can't roll it into any other instructions. So we'll do it via an xor. let conjugation_mask = AvxVector256::broadcast_complex_elements(Complex::new(A::zero(), -A::zero())); let mut chunks_iter = (&mut output[1..]).chunks_exact_mut(A::VectorType::COMPLEX_PER_VECTOR); for (i, mut chunk) in chunks_iter.by_ref().enumerate() { let index_chunk = *self.output_index_mapping.get_unchecked(i); let gathered_elements = A::VectorType::gather_complex_avx2_index32(input.as_ptr(), index_chunk); let conjugated_elements = AvxVector::xor(gathered_elements, conjugation_mask); chunk.store_complex(conjugated_elements, 0); } // at this point, we either have 0 or 2 remaining elements to gather. because we know our length ends in 1 or 3. so when we subtract 1 for the inner FFT, that gives us 0 or 2 let mut output_remainder = chunks_iter.into_remainder(); if output_remainder.len() == 2 { let index_chunk = *self .output_index_mapping .get_unchecked(self.output_index_mapping.len() - 1); let half_data = AvxVector128::gather32_complex_avx2(input.as_ptr(), index_chunk); let conjugated_elements = AvxVector::xor(half_data, conjugation_mask.lo()); output_remainder.store_partial2_complex(conjugated_elements, 0); } } fn perform_fft_out_of_place( &self, input: &mut [Complex], output: &mut [Complex], scratch: &mut [Complex], ) { unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_input: &mut [Complex] = array_utils::workaround_transmute_mut(input); let transmuted_output: &mut [Complex] = array_utils::workaround_transmute_mut(output); self.prepare_raders(transmuted_input, transmuted_output) } let (first_input, inner_input) = input.split_first_mut().unwrap(); let (first_output, inner_output) = output.split_first_mut().unwrap(); // perform the first of two inner FFTs let inner_scratch = if scratch.len() > 0 { &mut scratch[..] } else { &mut inner_input[..] }; self.inner_fft .process_with_scratch(inner_output, inner_scratch); // inner_output[0] now contains the sum of elements 1..n. we want the sum of all inputs, so all we need to do is add the first input *first_output = inner_output[0] + *first_input; // multiply the inner result with our cached setup data // also conjugate every entry. this sets us up to do an inverse FFT // (because an inverse FFT is equivalent to a normal FFT where you conjugate both the inputs and outputs) unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_inner_input: &mut [Complex] = array_utils::workaround_transmute_mut(inner_input); let transmuted_inner_output: &mut [Complex] = array_utils::workaround_transmute_mut(inner_output); avx_vector::pairwise_complex_mul_conjugated( transmuted_inner_output, transmuted_inner_input, &self.twiddles, ) }; // We need to add the first input value to all output values. We can accomplish this by adding it to the DC input of our inner ifft. // Of course, we have to conjugate it, just like we conjugated the complex multiplied above inner_input[0] = inner_input[0] + first_input.conj(); // execute the second FFT let inner_scratch = if scratch.len() > 0 { scratch } else { &mut inner_output[..] }; self.inner_fft .process_with_scratch(inner_input, inner_scratch); // copy the final values into the output, reordering as we go unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_input: &mut [Complex] = array_utils::workaround_transmute_mut(input); let transmuted_output: &mut [Complex] = array_utils::workaround_transmute_mut(output); self.finalize_raders(transmuted_input, transmuted_output); } } fn perform_fft_inplace(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { let (scratch, extra_scratch) = scratch.split_at_mut(self.len()); unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_scratch: &mut [Complex] = array_utils::workaround_transmute_mut(scratch); let transmuted_buffer: &mut [Complex] = array_utils::workaround_transmute_mut(buffer); self.prepare_raders(transmuted_buffer, transmuted_scratch) } let first_input = buffer[0]; let truncated_scratch = &mut scratch[1..]; // perform the first of two inner FFTs let inner_scratch = if extra_scratch.len() > 0 { extra_scratch } else { &mut buffer[..] }; self.inner_fft .process_with_scratch(truncated_scratch, inner_scratch); // truncated_scratch[0] now contains the sum of elements 1..n. we want the sum of all inputs, so all we need to do is add the first input let first_output = first_input + truncated_scratch[0]; // multiply the inner result with our cached setup data // also conjugate every entry. this sets us up to do an inverse FFT // (because an inverse FFT is equivalent to a normal FFT where you conjugate both the inputs and outputs) unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_scratch: &mut [Complex] = array_utils::workaround_transmute_mut(truncated_scratch); avx_vector::pairwise_complex_mul_assign_conjugated(transmuted_scratch, &self.twiddles) }; // We need to add the first input value to all output values. We can accomplish this by adding it to the DC input of our inner ifft. // Of course, we have to conjugate it, just like we conjugated the complex multiplied above truncated_scratch[0] = truncated_scratch[0] + first_input.conj(); // execute the second FFT self.inner_fft .process_with_scratch(truncated_scratch, inner_scratch); // copy the final values into the output, reordering as we go buffer[0] = first_output; unsafe { // Specialization workaround: See the comments in FftPlannerAvx::new() for why these calls to array_utils::workaround_transmute are necessary let transmuted_scratch: &mut [Complex] = array_utils::workaround_transmute_mut(scratch); let transmuted_buffer: &mut [Complex] = array_utils::workaround_transmute_mut(buffer); self.finalize_raders(transmuted_scratch, transmuted_buffer); } } } boilerplate_avx_fft!( RadersAvx2, |this: &RadersAvx2<_, _>| this.len, |this: &RadersAvx2<_, _>| this.inplace_scratch_len, |this: &RadersAvx2<_, _>| this.outofplace_scratch_len ); #[cfg(test)] mod unit_tests { use num_traits::Float; use rand::distributions::uniform::SampleUniform; use super::*; use crate::algorithm::Dft; use crate::test_utils::check_fft_algorithm; use std::sync::Arc; #[test] fn test_raders_avx_f32() { for len in 3..100 { if miller_rabin(len as u64) { test_raders_with_length::(len, FftDirection::Forward); test_raders_with_length::(len, FftDirection::Inverse); } } } #[test] fn test_raders_avx_f64() { for len in 3..100 { if miller_rabin(len as u64) { test_raders_with_length::(len, FftDirection::Forward); test_raders_with_length::(len, FftDirection::Inverse); } } } fn test_raders_with_length( len: usize, direction: FftDirection, ) { let inner_fft = Arc::new(Dft::new(len - 1, direction)); let fft = RadersAvx2::::new(inner_fft).unwrap(); check_fft_algorithm::(&fft, len, direction); } } rustfft-6.2.0/src/avx/avx_vector.rs000064400000000000000000003232170072674642500155010ustar 00000000000000use std::arch::x86_64::*; use std::fmt::Debug; use std::ops::{Deref, DerefMut}; use num_complex::Complex; use num_traits::Zero; use crate::{array_utils::DoubleBuf, twiddles, FftDirection}; use super::AvxNum; /// A SIMD vector of complex numbers, stored with the real values and imaginary values interleaved. /// Implemented for __m128, __m128d, __m256, __m256d, but these all require the AVX instruction set. /// /// The goal of this trait is to reduce code duplication by letting code be generic over the vector type pub trait AvxVector: Copy + Debug + Send + Sync { const SCALAR_PER_VECTOR: usize; const COMPLEX_PER_VECTOR: usize; // useful constants unsafe fn zero() -> Self; unsafe fn half_root2() -> Self; // an entire vector filled with 0.5.sqrt() // Basic operations that map directly to 1-2 AVX intrinsics unsafe fn add(left: Self, right: Self) -> Self; unsafe fn sub(left: Self, right: Self) -> Self; unsafe fn xor(left: Self, right: Self) -> Self; unsafe fn neg(self) -> Self; unsafe fn mul(left: Self, right: Self) -> Self; unsafe fn fmadd(left: Self, right: Self, add: Self) -> Self; unsafe fn fnmadd(left: Self, right: Self, add: Self) -> Self; unsafe fn fmaddsub(left: Self, right: Self, add: Self) -> Self; unsafe fn fmsubadd(left: Self, right: Self, add: Self) -> Self; // More basic operations that end up being implemented in 1-2 intrinsics, but unlike the ones above, these have higher-level meaning than just arithmetic /// Swap each real number with its corresponding imaginary number unsafe fn swap_complex_components(self) -> Self; /// first return is the reals duplicated into the imaginaries, second return is the imaginaries duplicated into the reals unsafe fn duplicate_complex_components(self) -> (Self, Self); /// Reverse the order of complex numbers in the vector, so that the last is the first and the first is the last unsafe fn reverse_complex_elements(self) -> Self; /// Copies the even elements of rows[1] into the corresponding odd elements of rows[0] and returns the result. unsafe fn unpacklo_complex(rows: [Self; 2]) -> Self; /// Copies the odd elements of rows[0] into the corresponding even elements of rows[1] and returns the result. unsafe fn unpackhi_complex(rows: [Self; 2]) -> Self; #[inline(always)] unsafe fn unpack_complex(rows: [Self; 2]) -> [Self; 2] { [Self::unpacklo_complex(rows), Self::unpackhi_complex(rows)] } /// Fill a vector by computing a twiddle factor and repeating it across the whole vector unsafe fn broadcast_twiddle(index: usize, len: usize, direction: FftDirection) -> Self; /// create a Rotator90 instance to rotate complex numbers either 90 or 270 degrees, based on the value of `inverse` unsafe fn make_rotation90(direction: FftDirection) -> Rotation90; /// Generates a chunk of twiddle factors starting at (X,Y) and incrementing X `COMPLEX_PER_VECTOR` times. /// The result will be [twiddle(x*y, len), twiddle((x+1)*y, len), twiddle((x+2)*y, len), ...] for as many complex numbers fit in a vector unsafe fn make_mixedradix_twiddle_chunk( x: usize, y: usize, len: usize, direction: FftDirection, ) -> Self; /// Packed transposes. Used by mixed radix. These all take a NxC array, where C is COMPLEX_PER_VECTOR, and transpose it to a CxN array. /// But they also pack the result into as few vectors as possible, with the goal of writing the transposed data out contiguously. unsafe fn transpose2_packed(rows: [Self; 2]) -> [Self; 2]; unsafe fn transpose3_packed(rows: [Self; 3]) -> [Self; 3]; unsafe fn transpose4_packed(rows: [Self; 4]) -> [Self; 4]; unsafe fn transpose5_packed(rows: [Self; 5]) -> [Self; 5]; unsafe fn transpose6_packed(rows: [Self; 6]) -> [Self; 6]; unsafe fn transpose7_packed(rows: [Self; 7]) -> [Self; 7]; unsafe fn transpose8_packed(rows: [Self; 8]) -> [Self; 8]; unsafe fn transpose9_packed(rows: [Self; 9]) -> [Self; 9]; unsafe fn transpose11_packed(rows: [Self; 11]) -> [Self; 11]; unsafe fn transpose12_packed(rows: [Self; 12]) -> [Self; 12]; unsafe fn transpose16_packed(rows: [Self; 16]) -> [Self; 16]; /// Pairwise multiply the complex numbers in `left` with the complex numbers in `right`. #[inline(always)] unsafe fn mul_complex(left: Self, right: Self) -> Self { // Extract the real and imaginary components from left into 2 separate registers let (left_real, left_imag) = Self::duplicate_complex_components(left); // create a shuffled version of right where the imaginary values are swapped with the reals let right_shuffled = Self::swap_complex_components(right); // multiply our duplicated imaginary left vector by our shuffled right vector. that will give us the right side of the traditional complex multiplication formula let output_right = Self::mul(left_imag, right_shuffled); // use a FMA instruction to multiply together left side of the complex multiplication formula, then alternatingly add and subtract the left side from the right Self::fmaddsub(left_real, right, output_right) } #[inline(always)] unsafe fn rotate90(self, rotation: Rotation90) -> Self { // Use the pre-computed vector stored in the Rotation90 instance to negate either the reals or imaginaries let negated = Self::xor(self, rotation.0); // Our goal is to swap the reals with the imaginaries, then negate either the reals or the imaginaries, based on whether we're an inverse or not Self::swap_complex_components(negated) } #[inline(always)] unsafe fn column_butterfly2(rows: [Self; 2]) -> [Self; 2] { [Self::add(rows[0], rows[1]), Self::sub(rows[0], rows[1])] } #[inline(always)] unsafe fn column_butterfly3(rows: [Self; 3], twiddles: Self) -> [Self; 3] { // This algorithm is derived directly from the definition of the Dft of size 3 // We'd theoretically have to do 4 complex multiplications, but all of the twiddles we'd be multiplying by are conjugates of each other // By doing some algebra to expand the complex multiplications and factor out the multiplications, we get this let [mut mid1, mid2] = Self::column_butterfly2([rows[1], rows[2]]); let output0 = Self::add(rows[0], mid1); let (twiddle_real, twiddle_imag) = Self::duplicate_complex_components(twiddles); mid1 = Self::fmadd(mid1, twiddle_real, rows[0]); let rotation = Self::make_rotation90(FftDirection::Inverse); let mid2_rotated = Self::rotate90(mid2, rotation); let output1 = Self::fmadd(mid2_rotated, twiddle_imag, mid1); let output2 = Self::fnmadd(mid2_rotated, twiddle_imag, mid1); [output0, output1, output2] } #[inline(always)] unsafe fn column_butterfly4(rows: [Self; 4], rotation: Rotation90) -> [Self; 4] { // Algorithm: 2x2 mixed radix // Perform the first set of size-2 FFTs. let [mid0, mid2] = Self::column_butterfly2([rows[0], rows[2]]); let [mid1, mid3] = Self::column_butterfly2([rows[1], rows[3]]); // Apply twiddle factors (in this case just a rotation) let mid3_rotated = mid3.rotate90(rotation); // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = Self::column_butterfly2([mid0, mid1]); let [output2, output3] = Self::column_butterfly2([mid2, mid3_rotated]); // Swap outputs 1 and 2 in the output to do a square transpose [output0, output2, output1, output3] } // A niche variant of column_butterfly4 that negates row 3 before performing the FFT. It's able to roll it into existing instructions, so the negation is free #[inline(always)] unsafe fn column_butterfly4_negaterow3( rows: [Self; 4], rotation: Rotation90, ) -> [Self; 4] { // Algorithm: 2x2 mixed radix // Perform the first set of size-2 FFTs. let [mid0, mid2] = Self::column_butterfly2([rows[0], rows[2]]); let (mid1, mid3) = (Self::sub(rows[1], rows[3]), Self::add(rows[1], rows[3])); // to negate row 3, swap add and sub in the butterfly 2 // Apply twiddle factors (in this case just a rotation) let mid3_rotated = mid3.rotate90(rotation); // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = Self::column_butterfly2([mid0, mid1]); let [output2, output3] = Self::column_butterfly2([mid2, mid3_rotated]); // Swap outputs 1 and 2 in the output to do a square transpose [output0, output2, output1, output3] } #[inline(always)] unsafe fn column_butterfly5(rows: [Self; 5], twiddles: [Self; 2]) -> [Self; 5] { // This algorithm is derived directly from the definition of the Dft of size 5 // We'd theoretically have to do 16 complex multiplications for the Dft, but many of the twiddles we'd be multiplying by are conjugates of each other // By doing some algebra to expand the complex multiplications and factor out the real multiplications, we get this faster formula where we only do the equivalent of 4 multiplications // do some prep work before we can start applying twiddle factors let [sum1, diff4] = Self::column_butterfly2([rows[1], rows[4]]); let [sum2, diff3] = Self::column_butterfly2([rows[2], rows[3]]); let rotation = Self::make_rotation90(FftDirection::Inverse); let rotated4 = Self::rotate90(diff4, rotation); let rotated3 = Self::rotate90(diff3, rotation); // to compute the first output, compute the sum of all elements. sum1 and sum2 already have the sum of 1+4 and 2+3 respectively, so if we add them, we'll get the sum of all 4 let sum1234 = Self::add(sum1, sum2); let output0 = Self::add(rows[0], sum1234); // apply twiddle factors let (twiddles0_re, twiddles0_im) = Self::duplicate_complex_components(twiddles[0]); let (twiddles1_re, twiddles1_im) = Self::duplicate_complex_components(twiddles[1]); let twiddled1_mid = Self::fmadd(twiddles0_re, sum1, rows[0]); let twiddled2_mid = Self::fmadd(twiddles1_re, sum1, rows[0]); let twiddled3_mid = Self::mul(twiddles1_im, rotated4); let twiddled4_mid = Self::mul(twiddles0_im, rotated4); let twiddled1 = Self::fmadd(twiddles1_re, sum2, twiddled1_mid); let twiddled2 = Self::fmadd(twiddles0_re, sum2, twiddled2_mid); let twiddled3 = Self::fnmadd(twiddles0_im, rotated3, twiddled3_mid); // fnmadd instead of fmadd because we're actually re-using twiddle0 here. remember that this algorithm is all about factoring out conjugated multiplications -- this negation of the twiddle0 imaginaries is a reflection of one of those conugations let twiddled4 = Self::fmadd(twiddles1_im, rotated3, twiddled4_mid); // Post-processing to mix the twiddle factors between the rest of the output let [output1, output4] = Self::column_butterfly2([twiddled1, twiddled4]); let [output2, output3] = Self::column_butterfly2([twiddled2, twiddled3]); [output0, output1, output2, output3, output4] } #[inline(always)] unsafe fn column_butterfly7(rows: [Self; 7], twiddles: [Self; 3]) -> [Self; 7] { // This algorithm is derived directly from the definition of the Dft of size 7 // We'd theoretically have to do 36 complex multiplications for the Dft, but many of the twiddles we'd be multiplying by are conjugates of each other // By doing some algebra to expand the complex multiplications and factor out the real multiplications, we get this faster formula where we only do the equivalent of 9 multiplications // do some prep work before we can start applying twiddle factors let [sum1, diff6] = Self::column_butterfly2([rows[1], rows[6]]); let [sum2, diff5] = Self::column_butterfly2([rows[2], rows[5]]); let [sum3, diff4] = Self::column_butterfly2([rows[3], rows[4]]); let rotation = Self::make_rotation90(FftDirection::Inverse); let rotated4 = Self::rotate90(diff4, rotation); let rotated5 = Self::rotate90(diff5, rotation); let rotated6 = Self::rotate90(diff6, rotation); // to compute the first output, compute the sum of all elements. sum1, sum2, and sum3 already have the sum of 1+6 and 2+5 and 3+4 respectively, so if we add them, we'll get the sum of all 6 let output0_left = Self::add(sum1, sum2); let output0_right = Self::add(sum3, rows[0]); let output0 = Self::add(output0_left, output0_right); // apply twiddle factors. This is probably pushing the limit of how much we should do with this technique. // We probably shouldn't do a size-11 FFT with this technique, for example, because this block of multiplies would grow quadratically let (twiddles0_re, twiddles0_im) = Self::duplicate_complex_components(twiddles[0]); let (twiddles1_re, twiddles1_im) = Self::duplicate_complex_components(twiddles[1]); let (twiddles2_re, twiddles2_im) = Self::duplicate_complex_components(twiddles[2]); // Let's do a plain 7-point Dft // | X0 | | W0 W0 W0 W0 W0 W0 W0 | | x0 | // | X1 | | W0 W1 W2 W3 W4 W5 W6 | | x1 | // | X2 | | W0 W2 W4 W6 W8 W10 W12 | | x2 | // | X3 | | W0 W3 W6 W9 W12 W15 W18 | | x3 | // | X4 | | W0 W4 W8 W12 W16 W20 W24 | | x4 | // | X5 | | W0 W5 W10 W15 W20 W25 W30 | | x5 | // | X6 | | W0 W6 W12 W18 W24 W30 W36 | | x6 | // where Wn = exp(-2*pi*n/7) for a forward transform, and exp(+2*pi*n/7) for an inverse. // Next, take advantage of the fact that twiddle factor indexes for a size-7 Dft are cyclical mod 7 // | X0 | | W0 W0 W0 W0 W0 W0 W0 | | x0 | // | X1 | | W0 W1 W2 W3 W4 W5 W6 | | x1 | // | X2 | | W0 W2 W4 W6 W1 W3 W5 | | x2 | // | X3 | | W0 W3 W6 W2 W5 W1 W4 | | x3 | // | X4 | | W0 W4 W1 W5 W2 W6 W3 | | x4 | // | X5 | | W0 W5 W3 W1 W6 W4 W2 | | x5 | // | X6 | | W0 W6 W5 W4 W3 W2 W1 | | x6 | // Finally, take advantage of the fact that for a size-7 Dft, // twiddles 4 through 6 are conjugates of twiddes 3 through 0 (Asterisk marks conjugates) // | X0 | | W0 W0 W0 W0 W0 W0 W0 | | x0 | // | X1 | | W0 W1 W2 W3 W3* W2* W1* | | x1 | // | X2 | | W0 W2 W3* W1* W1 W3 W2* | | x2 | // | X3 | | W0 W3 W1* W2 W2* W1 W3* | | x3 | // | X4 | | W0 W3* W1 W2* W2 W1* W3 | | x4 | // | X5 | | W0 W2* W3 W1 W1* W3* W2 | | x5 | // | X6 | | W0 W1* W2* W3* W3 W2 W1 | | x6 | let twiddled1_mid = Self::fmadd(twiddles0_re, sum1, rows[0]); let twiddled2_mid = Self::fmadd(twiddles1_re, sum1, rows[0]); let twiddled3_mid = Self::fmadd(twiddles2_re, sum1, rows[0]); let twiddled4_mid = Self::mul(twiddles2_im, rotated6); let twiddled5_mid = Self::mul(twiddles1_im, rotated6); let twiddled6_mid = Self::mul(twiddles0_im, rotated6); let twiddled1_mid2 = Self::fmadd(twiddles1_re, sum2, twiddled1_mid); let twiddled2_mid2 = Self::fmadd(twiddles2_re, sum2, twiddled2_mid); let twiddled3_mid2 = Self::fmadd(twiddles0_re, sum2, twiddled3_mid); let twiddled4_mid2 = Self::fnmadd(twiddles0_im, rotated5, twiddled4_mid); // fnmadd instead of fmadd because we're actually re-using twiddle0 here. remember that this algorithm is all about factoring out conjugated multiplications -- this negation of the twiddle0 imaginaries is a reflection of one of those conugations let twiddled5_mid2 = Self::fnmadd(twiddles2_im, rotated5, twiddled5_mid); let twiddled6_mid2 = Self::fmadd(twiddles1_im, rotated5, twiddled6_mid); let twiddled1 = Self::fmadd(twiddles2_re, sum3, twiddled1_mid2); let twiddled2 = Self::fmadd(twiddles0_re, sum3, twiddled2_mid2); let twiddled3 = Self::fmadd(twiddles1_re, sum3, twiddled3_mid2); let twiddled4 = Self::fmadd(twiddles1_im, rotated4, twiddled4_mid2); let twiddled5 = Self::fnmadd(twiddles0_im, rotated4, twiddled5_mid2); let twiddled6 = Self::fmadd(twiddles2_im, rotated4, twiddled6_mid2); // Post-processing to mix the twiddle factors between the rest of the output let [output1, output6] = Self::column_butterfly2([twiddled1, twiddled6]); let [output2, output5] = Self::column_butterfly2([twiddled2, twiddled5]); let [output3, output4] = Self::column_butterfly2([twiddled3, twiddled4]); [ output0, output1, output2, output3, output4, output5, output6, ] } #[inline(always)] unsafe fn column_butterfly8(rows: [Self; 8], rotation: Rotation90) -> [Self; 8] { // Algorithm: 4x2 mixed radix // Size-4 FFTs down the columns let mid0 = Self::column_butterfly4([rows[0], rows[2], rows[4], rows[6]], rotation); let mut mid1 = Self::column_butterfly4([rows[1], rows[3], rows[5], rows[7]], rotation); // Apply twiddle factors mid1[1] = apply_butterfly8_twiddle1(mid1[1], rotation); mid1[2] = mid1[2].rotate90(rotation); mid1[3] = apply_butterfly8_twiddle3(mid1[3], rotation); // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = Self::column_butterfly2([mid0[0], mid1[0]]); let [output2, output3] = Self::column_butterfly2([mid0[1], mid1[1]]); let [output4, output5] = Self::column_butterfly2([mid0[2], mid1[2]]); let [output6, output7] = Self::column_butterfly2([mid0[3], mid1[3]]); [ output0, output2, output4, output6, output1, output3, output5, output7, ] } #[inline(always)] unsafe fn column_butterfly11(rows: [Self; 11], twiddles: [Self; 5]) -> [Self; 11] { // This algorithm is derived directly from the definition of the Dft of size 11 // We'd theoretically have to do 100 complex multiplications for the Dft, but many of the twiddles we'd be multiplying by are conjugates of each other // By doing some algebra to expand the complex multiplications and factor out the real multiplications, we get this faster formula where we only do the equivalent of 9 multiplications // do some prep work before we can start applying twiddle factors let [sum1, diff10] = Self::column_butterfly2([rows[1], rows[10]]); let [sum2, diff9] = Self::column_butterfly2([rows[2], rows[9]]); let [sum3, diff8] = Self::column_butterfly2([rows[3], rows[8]]); let [sum4, diff7] = Self::column_butterfly2([rows[4], rows[7]]); let [sum5, diff6] = Self::column_butterfly2([rows[5], rows[6]]); let rotation = Self::make_rotation90(FftDirection::Inverse); let rotated10 = Self::rotate90(diff10, rotation); let rotated9 = Self::rotate90(diff9, rotation); let rotated8 = Self::rotate90(diff8, rotation); let rotated7 = Self::rotate90(diff7, rotation); let rotated6 = Self::rotate90(diff6, rotation); // to compute the first output, compute the sum of all elements. sum1, sum2, and sum3 already have the sum of 1+6 and 2+5 and 3+4 respectively, so if we add them, we'll get the sum of all 6 let sum01 = Self::add(rows[0], sum1); let sum23 = Self::add(sum2, sum3); let sum45 = Self::add(sum4, sum5); let sum0123 = Self::add(sum01, sum23); let output0 = Self::add(sum0123, sum45); // apply twiddle factors. This is probably pushing the limit of how much we should do with this technique. // We probably shouldn't do a size-11 FFT with this technique, for example, because this block of multiplies would grow quadratically let (twiddles0_re, twiddles0_im) = Self::duplicate_complex_components(twiddles[0]); let (twiddles1_re, twiddles1_im) = Self::duplicate_complex_components(twiddles[1]); let (twiddles2_re, twiddles2_im) = Self::duplicate_complex_components(twiddles[2]); let (twiddles3_re, twiddles3_im) = Self::duplicate_complex_components(twiddles[3]); let (twiddles4_re, twiddles4_im) = Self::duplicate_complex_components(twiddles[4]); let twiddled1 = Self::fmadd(twiddles0_re, sum1, rows[0]); let twiddled2 = Self::fmadd(twiddles1_re, sum1, rows[0]); let twiddled3 = Self::fmadd(twiddles2_re, sum1, rows[0]); let twiddled4 = Self::fmadd(twiddles3_re, sum1, rows[0]); let twiddled5 = Self::fmadd(twiddles4_re, sum1, rows[0]); let twiddled6 = Self::mul(twiddles4_im, rotated10); let twiddled7 = Self::mul(twiddles3_im, rotated10); let twiddled8 = Self::mul(twiddles2_im, rotated10); let twiddled9 = Self::mul(twiddles1_im, rotated10); let twiddled10 = Self::mul(twiddles0_im, rotated10); let twiddled1 = Self::fmadd(twiddles1_re, sum2, twiddled1); let twiddled2 = Self::fmadd(twiddles3_re, sum2, twiddled2); let twiddled3 = Self::fmadd(twiddles4_re, sum2, twiddled3); let twiddled4 = Self::fmadd(twiddles2_re, sum2, twiddled4); let twiddled5 = Self::fmadd(twiddles0_re, sum2, twiddled5); let twiddled6 = Self::fnmadd(twiddles0_im, rotated9, twiddled6); let twiddled7 = Self::fnmadd(twiddles2_im, rotated9, twiddled7); let twiddled8 = Self::fnmadd(twiddles4_im, rotated9, twiddled8); let twiddled9 = Self::fmadd(twiddles3_im, rotated9, twiddled9); let twiddled10 = Self::fmadd(twiddles1_im, rotated9, twiddled10); let twiddled1 = Self::fmadd(twiddles2_re, sum3, twiddled1); let twiddled2 = Self::fmadd(twiddles4_re, sum3, twiddled2); let twiddled3 = Self::fmadd(twiddles1_re, sum3, twiddled3); let twiddled4 = Self::fmadd(twiddles0_re, sum3, twiddled4); let twiddled5 = Self::fmadd(twiddles3_re, sum3, twiddled5); let twiddled6 = Self::fmadd(twiddles3_im, rotated8, twiddled6); let twiddled7 = Self::fmadd(twiddles0_im, rotated8, twiddled7); let twiddled8 = Self::fnmadd(twiddles1_im, rotated8, twiddled8); let twiddled9 = Self::fnmadd(twiddles4_im, rotated8, twiddled9); let twiddled10 = Self::fmadd(twiddles2_im, rotated8, twiddled10); let twiddled1 = Self::fmadd(twiddles3_re, sum4, twiddled1); let twiddled2 = Self::fmadd(twiddles2_re, sum4, twiddled2); let twiddled3 = Self::fmadd(twiddles0_re, sum4, twiddled3); let twiddled4 = Self::fmadd(twiddles4_re, sum4, twiddled4); let twiddled5 = Self::fmadd(twiddles1_re, sum4, twiddled5); let twiddled6 = Self::fnmadd(twiddles1_im, rotated7, twiddled6); let twiddled7 = Self::fmadd(twiddles4_im, rotated7, twiddled7); let twiddled8 = Self::fmadd(twiddles0_im, rotated7, twiddled8); let twiddled9 = Self::fnmadd(twiddles2_im, rotated7, twiddled9); let twiddled10 = Self::fmadd(twiddles3_im, rotated7, twiddled10); let twiddled1 = Self::fmadd(twiddles4_re, sum5, twiddled1); let twiddled2 = Self::fmadd(twiddles0_re, sum5, twiddled2); let twiddled3 = Self::fmadd(twiddles3_re, sum5, twiddled3); let twiddled4 = Self::fmadd(twiddles1_re, sum5, twiddled4); let twiddled5 = Self::fmadd(twiddles2_re, sum5, twiddled5); let twiddled6 = Self::fmadd(twiddles2_im, rotated6, twiddled6); let twiddled7 = Self::fnmadd(twiddles1_im, rotated6, twiddled7); let twiddled8 = Self::fmadd(twiddles3_im, rotated6, twiddled8); let twiddled9 = Self::fnmadd(twiddles0_im, rotated6, twiddled9); let twiddled10 = Self::fmadd(twiddles4_im, rotated6, twiddled10); // Post-processing to mix the twiddle factors between the rest of the output let [output1, output10] = Self::column_butterfly2([twiddled1, twiddled10]); let [output2, output9] = Self::column_butterfly2([twiddled2, twiddled9]); let [output3, output8] = Self::column_butterfly2([twiddled3, twiddled8]); let [output4, output7] = Self::column_butterfly2([twiddled4, twiddled7]); let [output5, output6] = Self::column_butterfly2([twiddled5, twiddled6]); [ output0, output1, output2, output3, output4, output5, output6, output7, output8, output9, output10, ] } #[inline(always)] unsafe fn column_butterfly16( rows: [Self; 16], twiddles: [Self; 2], rotation: Rotation90, ) -> [Self; 16] { // Algorithm: 4x4 mixed radix // Size-4 FFTs down the columns let mid0 = Self::column_butterfly4([rows[0], rows[4], rows[8], rows[12]], rotation); let mut mid1 = Self::column_butterfly4([rows[1], rows[5], rows[9], rows[13]], rotation); let mut mid2 = Self::column_butterfly4([rows[2], rows[6], rows[10], rows[14]], rotation); let mut mid3 = Self::column_butterfly4([rows[3], rows[7], rows[11], rows[15]], rotation); // Apply twiddle factors mid1[1] = Self::mul_complex(mid1[1], twiddles[0]); // for twiddle(2, 16), we can use the butterfly8 twiddle1 instead, which takes fewer instructions and fewer multiplies mid2[1] = apply_butterfly8_twiddle1(mid2[1], rotation); mid1[2] = apply_butterfly8_twiddle1(mid1[2], rotation); // for twiddle(3,16), we can use twiddle(1,16), sort of, but we'd need a branch, and at this point it's easier to just have another vector mid3[1] = Self::mul_complex(mid3[1], twiddles[1]); mid1[3] = Self::mul_complex(mid1[3], twiddles[1]); // twiddle(4,16) is just a rotate mid2[2] = mid2[2].rotate90(rotation); // for twiddle(6, 16), we can use the butterfly8 twiddle3 instead, which takes fewer instructions and fewer multiplies mid3[2] = apply_butterfly8_twiddle3(mid3[2], rotation); mid2[3] = apply_butterfly8_twiddle3(mid2[3], rotation); // twiddle(9, 16) is twiddle (1,16) negated. we're just going to use the same twiddle for now, and apply the negation as a part of our subsequent butterfly 4's mid3[3] = Self::mul_complex(mid3[3], twiddles[0]); // Up next is a transpose, but since everything is already in registers, we don't actually have to transpose anything! // "transpose" and thne apply butterfly 4's across the columns of our 4x4 array let output0 = Self::column_butterfly4([mid0[0], mid1[0], mid2[0], mid3[0]], rotation); let output1 = Self::column_butterfly4([mid0[1], mid1[1], mid2[1], mid3[1]], rotation); let output2 = Self::column_butterfly4([mid0[2], mid1[2], mid2[2], mid3[2]], rotation); let output3 = Self::column_butterfly4_negaterow3([mid0[3], mid1[3], mid2[3], mid3[3]], rotation); // finish the twiddle of the last row by negating it // finally, one more transpose [ output0[0], output1[0], output2[0], output3[0], output0[1], output1[1], output2[1], output3[1], output0[2], output1[2], output2[2], output3[2], output0[3], output1[3], output2[3], output3[3], ] } } /// A 256-bit SIMD vector of complex numbers, stored with the real values and imaginary values interleaved. /// Implemented for __m256, __m256d /// /// This trait implements things specific to 256-types, like splitting a 256 vector into 128 vectors /// For compiler-placation reasons, all interactions/awareness the scalar type go here pub trait AvxVector256: AvxVector { type HalfVector: AvxVector128; type ScalarType: AvxNum; unsafe fn lo(self) -> Self::HalfVector; unsafe fn hi(self) -> Self::HalfVector; unsafe fn split(self) -> (Self::HalfVector, Self::HalfVector) { (self.lo(), self.hi()) } unsafe fn merge(lo: Self::HalfVector, hi: Self::HalfVector) -> Self; /// Fill a vector by repeating the provided complex number as many times as possible unsafe fn broadcast_complex_elements(value: Complex) -> Self; /// Adds all complex elements from this vector horizontally unsafe fn hadd_complex(self) -> Complex; // loads/stores of complex numbers unsafe fn load_complex(ptr: *const Complex) -> Self; unsafe fn store_complex(ptr: *mut Complex, data: Self); // Gather 4 complex numbers (for f32) or 2 complex numbers (for f64) using 4 i32 indexes (for index32) or 4 i64 indexes (for index64). // For f32, there should be 1 index per complex. For f64, there should be 2 indexes, each duplicated // (So to load the complex at index 5 and 7, the index vector should contain 5,5,7,7. this api sucks but it's internal so whatever.) unsafe fn gather_complex_avx2_index32( ptr: *const Complex, indexes: __m128i, ) -> Self; unsafe fn gather_complex_avx2_index64( ptr: *const Complex, indexes: __m256i, ) -> Self; // loads/stores of partial vectors of complex numbers. When loading, empty elements are zeroed // unimplemented!() if Self::COMPLEX_PER_VECTOR is not greater than the partial count unsafe fn load_partial1_complex(ptr: *const Complex) -> Self::HalfVector; unsafe fn load_partial2_complex(ptr: *const Complex) -> Self::HalfVector; unsafe fn load_partial3_complex(ptr: *const Complex) -> Self; unsafe fn store_partial1_complex(ptr: *mut Complex, data: Self::HalfVector); unsafe fn store_partial2_complex(ptr: *mut Complex, data: Self::HalfVector); unsafe fn store_partial3_complex(ptr: *mut Complex, data: Self); #[inline(always)] unsafe fn column_butterfly6(rows: [Self; 6], twiddles: Self) -> [Self; 6] { // Algorithm: 3x2 good-thomas // Size-3 FFTs down the columns of our reordered array let mid0 = Self::column_butterfly3([rows[0], rows[2], rows[4]], twiddles); let mid1 = Self::column_butterfly3([rows[3], rows[5], rows[1]], twiddles); // We normally would put twiddle factors right here, but since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = Self::column_butterfly2([mid0[0], mid1[0]]); let [output2, output3] = Self::column_butterfly2([mid0[1], mid1[1]]); let [output4, output5] = Self::column_butterfly2([mid0[2], mid1[2]]); // Reorder into output [output0, output3, output4, output1, output2, output5] } #[inline(always)] unsafe fn column_butterfly9( rows: [Self; 9], twiddles: [Self; 3], butterfly3_twiddles: Self, ) -> [Self; 9] { // Algorithm: 3x3 mixed radix // Size-3 FFTs down the columns let mid0 = Self::column_butterfly3([rows[0], rows[3], rows[6]], butterfly3_twiddles); let mut mid1 = Self::column_butterfly3([rows[1], rows[4], rows[7]], butterfly3_twiddles); let mut mid2 = Self::column_butterfly3([rows[2], rows[5], rows[8]], butterfly3_twiddles); // Apply twiddle factors. Note that we're re-using twiddles[1] mid1[1] = Self::mul_complex(twiddles[0], mid1[1]); mid1[2] = Self::mul_complex(twiddles[1], mid1[2]); mid2[1] = Self::mul_complex(twiddles[1], mid2[1]); mid2[2] = Self::mul_complex(twiddles[2], mid2[2]); let [output0, output1, output2] = Self::column_butterfly3([mid0[0], mid1[0], mid2[0]], butterfly3_twiddles); let [output3, output4, output5] = Self::column_butterfly3([mid0[1], mid1[1], mid2[1]], butterfly3_twiddles); let [output6, output7, output8] = Self::column_butterfly3([mid0[2], mid1[2], mid2[2]], butterfly3_twiddles); [ output0, output3, output6, output1, output4, output7, output2, output5, output8, ] } #[inline(always)] unsafe fn column_butterfly12( rows: [Self; 12], butterfly3_twiddles: Self, rotation: Rotation90, ) -> [Self; 12] { // Algorithm: 4x3 good-thomas // Size-4 FFTs down the columns of our reordered array let mid0 = Self::column_butterfly4([rows[0], rows[3], rows[6], rows[9]], rotation); let mid1 = Self::column_butterfly4([rows[4], rows[7], rows[10], rows[1]], rotation); let mid2 = Self::column_butterfly4([rows[8], rows[11], rows[2], rows[5]], rotation); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1, output2] = Self::column_butterfly3([mid0[0], mid1[0], mid2[0]], butterfly3_twiddles); let [output3, output4, output5] = Self::column_butterfly3([mid0[1], mid1[1], mid2[1]], butterfly3_twiddles); let [output6, output7, output8] = Self::column_butterfly3([mid0[2], mid1[2], mid2[2]], butterfly3_twiddles); let [output9, output10, output11] = Self::column_butterfly3([mid0[3], mid1[3], mid2[3]], butterfly3_twiddles); [ output0, output4, output8, output9, output1, output5, output6, output10, output2, output3, output7, output11, ] } } /// A 128-bit SIMD vector of complex numbers, stored with the real values and imaginary values interleaved. /// Implemented for __m128, __m128d, but these are all oriented around AVX, so don't call methods on these from a SSE-only context /// /// This trait implements things specific to 128-types, like merging 2 128 vectors into a 256 vector pub trait AvxVector128: AvxVector { type FullVector: AvxVector256; unsafe fn merge(lo: Self, hi: Self) -> Self::FullVector; unsafe fn zero_extend(self) -> Self::FullVector; unsafe fn lo(input: Self::FullVector) -> Self; unsafe fn hi(input: Self::FullVector) -> Self; unsafe fn split(input: Self::FullVector) -> (Self, Self) { (Self::lo(input), Self::hi(input)) } unsafe fn lo_rotation(input: Rotation90) -> Rotation90; /// Fill a vector by repeating the provided complex number as many times as possible unsafe fn broadcast_complex_elements( value: Complex<<::FullVector as AvxVector256>::ScalarType>, ) -> Self; // Gather 2 complex numbers (for f32) or 1 complex number (for f64) using 2 i32 indexes (for gather32) or 2 i64 indexes (for gather64). // For f32, there should be 1 index per complex. For f64, there should be 2 indexes, each duplicated // (So to load the complex at index 5, the index vector should contain 5,5. this api sucks but it's internal so whatever.) unsafe fn gather32_complex_avx2( ptr: *const Complex<::ScalarType>, indexes: __m128i, ) -> Self; unsafe fn gather64_complex_avx2( ptr: *const Complex<::ScalarType>, indexes: __m128i, ) -> Self; #[inline(always)] unsafe fn column_butterfly6(rows: [Self; 6], twiddles: Self::FullVector) -> [Self; 6] { // Algorithm: 3x2 good-thomas // if we merge some of our 128 registers into 256 registers, we can do 1 inner butterfly3 instead of 2 let rows03 = Self::merge(rows[0], rows[3]); let rows25 = Self::merge(rows[2], rows[5]); let rows41 = Self::merge(rows[4], rows[1]); // Size-3 FFTs down the columns of our reordered array let mid = Self::FullVector::column_butterfly3([rows03, rows25, rows41], twiddles); // We normally would put twiddle factors right here, but since this is good-thomas algorithm, we don't need twiddle factors // we can't use our merged columns anymore. so split them back into half vectors let (mid0_0, mid1_0) = Self::split(mid[0]); let (mid0_1, mid1_1) = Self::split(mid[1]); let (mid0_2, mid1_2) = Self::split(mid[2]); // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = Self::column_butterfly2([mid0_0, mid1_0]); let [output2, output3] = Self::column_butterfly2([mid0_1, mid1_1]); let [output4, output5] = Self::column_butterfly2([mid0_2, mid1_2]); // Reorder into output [output0, output3, output4, output1, output2, output5] } #[inline(always)] unsafe fn column_butterfly9( rows: [Self; 9], twiddles_merged: [Self::FullVector; 2], butterfly3_twiddles: Self::FullVector, ) -> [Self; 9] { // Algorithm: 3x3 mixed radix // if we merge some of our 128 registers into 256 registers, we can do 2 inner butterfly3's instead of 3 let rows12 = Self::merge(rows[1], rows[2]); let rows45 = Self::merge(rows[4], rows[5]); let rows78 = Self::merge(rows[7], rows[8]); let mid0 = Self::column_butterfly3([rows[0], rows[3], rows[6]], Self::lo(butterfly3_twiddles)); let mut mid12 = Self::FullVector::column_butterfly3([rows12, rows45, rows78], butterfly3_twiddles); // Apply twiddle factors. we're applying them on the merged set of vectors, so we need slightly different twiddle factors mid12[1] = Self::FullVector::mul_complex(twiddles_merged[0], mid12[1]); mid12[2] = Self::FullVector::mul_complex(twiddles_merged[1], mid12[2]); // we can't use our merged columns anymore. so split them back into half vectors let (mid1_0, mid2_0) = Self::split(mid12[0]); let (mid1_1, mid2_1) = Self::split(mid12[1]); let (mid1_2, mid2_2) = Self::split(mid12[2]); // Re-merge our half vectors into different, transposed full vectors. Thankfully the compiler is smart enough to combine these inserts and extracts into permutes let transposed12 = Self::merge(mid0[1], mid0[2]); let transposed45 = Self::merge(mid1_1, mid1_2); let transposed78 = Self::merge(mid2_1, mid2_2); let [output0, output1, output2] = Self::column_butterfly3([mid0[0], mid1_0, mid2_0], Self::lo(butterfly3_twiddles)); let [output36, output47, output58] = Self::FullVector::column_butterfly3( [transposed12, transposed45, transposed78], butterfly3_twiddles, ); // Finally, extract our second set of merged columns let (output3, output6) = Self::split(output36); let (output4, output7) = Self::split(output47); let (output5, output8) = Self::split(output58); [ output0, output3, output6, output1, output4, output7, output2, output5, output8, ] } #[inline(always)] unsafe fn column_butterfly12( rows: [Self; 12], butterfly3_twiddles: Self::FullVector, rotation: Rotation90, ) -> [Self; 12] { // Algorithm: 4x3 good-thomas // if we merge some of our 128 registers into 256 registers, we can do 2 inner butterfly4's instead of 3 let rows48 = Self::merge(rows[4], rows[8]); let rows711 = Self::merge(rows[7], rows[11]); let rows102 = Self::merge(rows[10], rows[2]); let rows15 = Self::merge(rows[1], rows[5]); // Size-4 FFTs down the columns of our reordered array let mid0 = Self::column_butterfly4( [rows[0], rows[3], rows[6], rows[9]], Self::lo_rotation(rotation), ); let mid12 = Self::FullVector::column_butterfly4([rows48, rows711, rows102, rows15], rotation); // We normally would put twiddle factors right here, but since this is good-thomas algorithm, we don't need twiddle factors // we can't use our merged columns anymore. so split them back into half vectors let (mid1_0, mid2_0) = Self::split(mid12[0]); let (mid1_1, mid2_1) = Self::split(mid12[1]); let (mid1_2, mid2_2) = Self::split(mid12[2]); let (mid1_3, mid2_3) = Self::split(mid12[3]); // Re-merge our half vectors into different, transposed full vectors. This will let us do 2 inner butterfly 3's instead of 4! // Thankfully the compiler is smart enough to combine these inserts and extracts into permutes let transposed03 = Self::merge(mid0[0], mid0[1]); let transposed14 = Self::merge(mid1_0, mid1_1); let transposed25 = Self::merge(mid2_0, mid2_1); let transposed69 = Self::merge(mid0[2], mid0[3]); let transposed710 = Self::merge(mid1_2, mid1_3); let transposed811 = Self::merge(mid2_2, mid2_3); // Transpose the data and do size-2 FFTs down the columns let [output03, output14, output25] = Self::FullVector::column_butterfly3( [transposed03, transposed14, transposed25], butterfly3_twiddles, ); let [output69, output710, output811] = Self::FullVector::column_butterfly3( [transposed69, transposed710, transposed811], butterfly3_twiddles, ); // Finally, extract our second set of merged columns let (output0, output3) = Self::split(output03); let (output1, output4) = Self::split(output14); let (output2, output5) = Self::split(output25); let (output6, output9) = Self::split(output69); let (output7, output10) = Self::split(output710); let (output8, output11) = Self::split(output811); [ output0, output4, output8, output9, output1, output5, output6, output10, output2, output3, output7, output11, ] } } #[inline(always)] pub unsafe fn apply_butterfly8_twiddle1(input: V, rotation: Rotation90) -> V { let rotated = input.rotate90(rotation); let combined = V::add(rotated, input); V::mul(V::half_root2(), combined) } #[inline(always)] pub unsafe fn apply_butterfly8_twiddle3(input: V, rotation: Rotation90) -> V { let rotated = input.rotate90(rotation); let combined = V::sub(rotated, input); V::mul(V::half_root2(), combined) } #[repr(transparent)] #[derive(Clone, Copy, Debug)] pub struct Rotation90(V); impl Rotation90 { #[inline(always)] pub unsafe fn lo(self) -> Rotation90 { Rotation90(self.0.lo()) } } impl AvxVector for __m256 { const SCALAR_PER_VECTOR: usize = 8; const COMPLEX_PER_VECTOR: usize = 4; #[inline(always)] unsafe fn zero() -> Self { _mm256_setzero_ps() } #[inline(always)] unsafe fn half_root2() -> Self { // note: we're computing a square root here, but checking the assembly says the compiler is smart enough to turn this into a constant _mm256_broadcast_ss(&0.5f32.sqrt()) } #[inline(always)] unsafe fn xor(left: Self, right: Self) -> Self { _mm256_xor_ps(left, right) } #[inline(always)] unsafe fn neg(self) -> Self { _mm256_xor_ps(self, _mm256_broadcast_ss(&-0.0)) } #[inline(always)] unsafe fn add(left: Self, right: Self) -> Self { _mm256_add_ps(left, right) } #[inline(always)] unsafe fn sub(left: Self, right: Self) -> Self { _mm256_sub_ps(left, right) } #[inline(always)] unsafe fn mul(left: Self, right: Self) -> Self { _mm256_mul_ps(left, right) } #[inline(always)] unsafe fn fmadd(left: Self, right: Self, add: Self) -> Self { _mm256_fmadd_ps(left, right, add) } #[inline(always)] unsafe fn fnmadd(left: Self, right: Self, add: Self) -> Self { _mm256_fnmadd_ps(left, right, add) } #[inline(always)] unsafe fn fmaddsub(left: Self, right: Self, add: Self) -> Self { _mm256_fmaddsub_ps(left, right, add) } #[inline(always)] unsafe fn fmsubadd(left: Self, right: Self, add: Self) -> Self { _mm256_fmsubadd_ps(left, right, add) } #[inline(always)] unsafe fn reverse_complex_elements(self) -> Self { // swap the elements in-lane let permuted = _mm256_permute_ps(self, 0x4E); // swap the lanes _mm256_permute2f128_ps(permuted, permuted, 0x01) } #[inline(always)] unsafe fn unpacklo_complex(rows: [Self; 2]) -> Self { let row0_double = _mm256_castps_pd(rows[0]); let row1_double = _mm256_castps_pd(rows[1]); let unpacked = _mm256_unpacklo_pd(row0_double, row1_double); _mm256_castpd_ps(unpacked) } #[inline(always)] unsafe fn unpackhi_complex(rows: [Self; 2]) -> Self { let row0_double = _mm256_castps_pd(rows[0]); let row1_double = _mm256_castps_pd(rows[1]); let unpacked = _mm256_unpackhi_pd(row0_double, row1_double); _mm256_castpd_ps(unpacked) } #[inline(always)] unsafe fn swap_complex_components(self) -> Self { _mm256_permute_ps(self, 0xB1) } #[inline(always)] unsafe fn duplicate_complex_components(self) -> (Self, Self) { (_mm256_moveldup_ps(self), _mm256_movehdup_ps(self)) } #[inline(always)] unsafe fn make_rotation90(direction: FftDirection) -> Rotation90 { let broadcast = match direction { FftDirection::Forward => Complex::new(-0.0, 0.0), FftDirection::Inverse => Complex::new(0.0, -0.0), }; Rotation90(Self::broadcast_complex_elements(broadcast)) } #[inline(always)] unsafe fn make_mixedradix_twiddle_chunk( x: usize, y: usize, len: usize, direction: FftDirection, ) -> Self { let mut twiddle_chunk = [Complex::::zero(); Self::COMPLEX_PER_VECTOR]; for i in 0..Self::COMPLEX_PER_VECTOR { twiddle_chunk[i] = twiddles::compute_twiddle(y * (x + i), len, direction); } twiddle_chunk.as_slice().load_complex(0) } #[inline(always)] unsafe fn broadcast_twiddle(index: usize, len: usize, direction: FftDirection) -> Self { Self::broadcast_complex_elements(twiddles::compute_twiddle(index, len, direction)) } #[inline(always)] unsafe fn transpose2_packed(rows: [Self; 2]) -> [Self; 2] { let unpacked = Self::unpack_complex(rows); let output0 = _mm256_permute2f128_ps(unpacked[0], unpacked[1], 0x20); let output1 = _mm256_permute2f128_ps(unpacked[0], unpacked[1], 0x31); [output0, output1] } #[inline(always)] unsafe fn transpose3_packed(rows: [Self; 3]) -> [Self; 3] { let unpacked0 = Self::unpacklo_complex([rows[0], rows[1]]); let unpacked2 = Self::unpackhi_complex([rows[1], rows[2]]); // output0 and output2 each need to swap some elements. thankfully we can blend those elements into the same intermediate value, and then do a permute 128 from there let blended = _mm256_blend_ps(rows[0], rows[2], 0x33); let output1 = _mm256_permute2f128_ps(unpacked0, unpacked2, 0x12); let output0 = _mm256_permute2f128_ps(unpacked0, blended, 0x20); let output2 = _mm256_permute2f128_ps(unpacked2, blended, 0x13); [output0, output1, output2] } #[inline(always)] unsafe fn transpose4_packed(rows: [Self; 4]) -> [Self; 4] { let permute0 = _mm256_permute2f128_ps(rows[0], rows[2], 0x20); let permute1 = _mm256_permute2f128_ps(rows[1], rows[3], 0x20); let permute2 = _mm256_permute2f128_ps(rows[0], rows[2], 0x31); let permute3 = _mm256_permute2f128_ps(rows[1], rows[3], 0x31); let [unpacked0, unpacked1] = Self::unpack_complex([permute0, permute1]); let [unpacked2, unpacked3] = Self::unpack_complex([permute2, permute3]); [unpacked0, unpacked1, unpacked2, unpacked3] } #[inline(always)] unsafe fn transpose5_packed(rows: [Self; 5]) -> [Self; 5] { let unpacked0 = Self::unpacklo_complex([rows[0], rows[1]]); let unpacked1 = Self::unpackhi_complex([rows[1], rows[2]]); let unpacked2 = Self::unpacklo_complex([rows[2], rows[3]]); let unpacked3 = Self::unpackhi_complex([rows[3], rows[4]]); let blended04 = _mm256_blend_ps(rows[0], rows[4], 0x33); [ _mm256_permute2f128_ps(unpacked0, unpacked2, 0x20), _mm256_permute2f128_ps(blended04, unpacked1, 0x20), _mm256_blend_ps(unpacked0, unpacked3, 0x0f), _mm256_permute2f128_ps(unpacked2, blended04, 0x31), _mm256_permute2f128_ps(unpacked1, unpacked3, 0x31), ] } #[inline(always)] unsafe fn transpose6_packed(rows: [Self; 6]) -> [Self; 6] { let [unpacked0, unpacked1] = Self::unpack_complex([rows[0], rows[1]]); let [unpacked2, unpacked3] = Self::unpack_complex([rows[2], rows[3]]); let [unpacked4, unpacked5] = Self::unpack_complex([rows[4], rows[5]]); [ _mm256_permute2f128_ps(unpacked0, unpacked2, 0x20), _mm256_permute2f128_ps(unpacked1, unpacked4, 0x02), _mm256_permute2f128_ps(unpacked3, unpacked5, 0x20), _mm256_permute2f128_ps(unpacked0, unpacked2, 0x31), _mm256_permute2f128_ps(unpacked1, unpacked4, 0x13), _mm256_permute2f128_ps(unpacked3, unpacked5, 0x31), ] } #[inline(always)] unsafe fn transpose7_packed(rows: [Self; 7]) -> [Self; 7] { let unpacked0 = Self::unpacklo_complex([rows[0], rows[1]]); let unpacked1 = Self::unpackhi_complex([rows[1], rows[2]]); let unpacked2 = Self::unpacklo_complex([rows[2], rows[3]]); let unpacked3 = Self::unpackhi_complex([rows[3], rows[4]]); let unpacked4 = Self::unpacklo_complex([rows[4], rows[5]]); let unpacked5 = Self::unpackhi_complex([rows[5], rows[6]]); let blended06 = _mm256_blend_ps(rows[0], rows[6], 0x33); [ _mm256_permute2f128_ps(unpacked0, unpacked2, 0x20), _mm256_permute2f128_ps(unpacked4, blended06, 0x20), _mm256_permute2f128_ps(unpacked1, unpacked3, 0x20), _mm256_blend_ps(unpacked0, unpacked5, 0x0f), _mm256_permute2f128_ps(unpacked2, unpacked4, 0x31), _mm256_permute2f128_ps(blended06, unpacked1, 0x31), _mm256_permute2f128_ps(unpacked3, unpacked5, 0x31), ] } #[inline(always)] unsafe fn transpose8_packed(rows: [Self; 8]) -> [Self; 8] { let chunk0 = [rows[0], rows[1], rows[2], rows[3]]; let chunk1 = [rows[4], rows[5], rows[6], rows[7]]; let output0 = Self::transpose4_packed(chunk0); let output1 = Self::transpose4_packed(chunk1); [ output0[0], output1[0], output0[1], output1[1], output0[2], output1[2], output0[3], output1[3], ] } #[inline(always)] unsafe fn transpose9_packed(rows: [Self; 9]) -> [Self; 9] { let unpacked0 = Self::unpacklo_complex([rows[0], rows[1]]); let unpacked1 = Self::unpackhi_complex([rows[1], rows[2]]); let unpacked2 = Self::unpacklo_complex([rows[2], rows[3]]); let unpacked3 = Self::unpackhi_complex([rows[3], rows[4]]); let unpacked5 = Self::unpacklo_complex([rows[4], rows[5]]); let unpacked6 = Self::unpackhi_complex([rows[5], rows[6]]); let unpacked7 = Self::unpacklo_complex([rows[6], rows[7]]); let unpacked8 = Self::unpackhi_complex([rows[7], rows[8]]); let blended9 = _mm256_blend_ps(rows[0], rows[8], 0x33); [ _mm256_permute2f128_ps(unpacked0, unpacked2, 0x20), _mm256_permute2f128_ps(unpacked5, unpacked7, 0x20), _mm256_permute2f128_ps(blended9, unpacked1, 0x20), _mm256_permute2f128_ps(unpacked3, unpacked6, 0x20), _mm256_blend_ps(unpacked0, unpacked8, 0x0f), _mm256_permute2f128_ps(unpacked2, unpacked5, 0x31), _mm256_permute2f128_ps(unpacked7, blended9, 0x31), _mm256_permute2f128_ps(unpacked1, unpacked3, 0x31), _mm256_permute2f128_ps(unpacked6, unpacked8, 0x31), ] } #[inline(always)] unsafe fn transpose11_packed(rows: [Self; 11]) -> [Self; 11] { let unpacked0 = Self::unpacklo_complex([rows[0], rows[1]]); let unpacked1 = Self::unpackhi_complex([rows[1], rows[2]]); let unpacked2 = Self::unpacklo_complex([rows[2], rows[3]]); let unpacked3 = Self::unpackhi_complex([rows[3], rows[4]]); let unpacked4 = Self::unpacklo_complex([rows[4], rows[5]]); let unpacked5 = Self::unpackhi_complex([rows[5], rows[6]]); let unpacked6 = Self::unpacklo_complex([rows[6], rows[7]]); let unpacked7 = Self::unpackhi_complex([rows[7], rows[8]]); let unpacked8 = Self::unpacklo_complex([rows[8], rows[9]]); let unpacked9 = Self::unpackhi_complex([rows[9], rows[10]]); let blended10 = _mm256_blend_ps(rows[0], rows[10], 0x33); [ _mm256_permute2f128_ps(unpacked0, unpacked2, 0x20), _mm256_permute2f128_ps(unpacked4, unpacked6, 0x20), _mm256_permute2f128_ps(unpacked8, blended10, 0x20), _mm256_permute2f128_ps(unpacked1, unpacked3, 0x20), _mm256_permute2f128_ps(unpacked5, unpacked7, 0x20), _mm256_blend_ps(unpacked0, unpacked9, 0x0f), _mm256_permute2f128_ps(unpacked2, unpacked4, 0x31), _mm256_permute2f128_ps(unpacked6, unpacked8, 0x31), _mm256_permute2f128_ps(blended10, unpacked1, 0x31), _mm256_permute2f128_ps(unpacked3, unpacked5, 0x31), _mm256_permute2f128_ps(unpacked7, unpacked9, 0x31), ] } #[inline(always)] unsafe fn transpose12_packed(rows: [Self; 12]) -> [Self; 12] { let chunk0 = [rows[0], rows[1], rows[2], rows[3]]; let chunk1 = [rows[4], rows[5], rows[6], rows[7]]; let chunk2 = [rows[8], rows[9], rows[10], rows[11]]; let output0 = Self::transpose4_packed(chunk0); let output1 = Self::transpose4_packed(chunk1); let output2 = Self::transpose4_packed(chunk2); [ output0[0], output1[0], output2[0], output0[1], output1[1], output2[1], output0[2], output1[2], output2[2], output0[3], output1[3], output2[3], ] } #[inline(always)] unsafe fn transpose16_packed(rows: [Self; 16]) -> [Self; 16] { let chunk0 = [ rows[0], rows[1], rows[2], rows[3], rows[4], rows[5], rows[6], rows[7], ]; let chunk1 = [ rows[8], rows[9], rows[10], rows[11], rows[12], rows[13], rows[14], rows[15], ]; let output0 = Self::transpose8_packed(chunk0); let output1 = Self::transpose8_packed(chunk1); [ output0[0], output0[1], output1[0], output1[1], output0[2], output0[3], output1[2], output1[3], output0[4], output0[5], output1[4], output1[5], output0[6], output0[7], output1[6], output1[7], ] } } impl AvxVector256 for __m256 { type ScalarType = f32; type HalfVector = __m128; #[inline(always)] unsafe fn lo(self) -> Self::HalfVector { _mm256_castps256_ps128(self) } #[inline(always)] unsafe fn hi(self) -> Self::HalfVector { _mm256_extractf128_ps(self, 1) } #[inline(always)] unsafe fn merge(lo: Self::HalfVector, hi: Self::HalfVector) -> Self { _mm256_insertf128_ps(_mm256_castps128_ps256(lo), hi, 1) } #[inline(always)] unsafe fn broadcast_complex_elements(value: Complex) -> Self { _mm256_set_ps( value.im, value.re, value.im, value.re, value.im, value.re, value.im, value.re, ) } #[inline(always)] unsafe fn hadd_complex(self) -> Complex { let lo = self.lo(); let hi = self.hi(); let sum = _mm_add_ps(lo, hi); let shuffled_sum = Self::HalfVector::unpackhi_complex([sum, sum]); let result = _mm_add_ps(sum, shuffled_sum); let mut result_storage = [Complex::zero(); 1]; result_storage .as_mut_slice() .store_partial1_complex(result, 0); result_storage[0] } #[inline(always)] unsafe fn load_complex(ptr: *const Complex) -> Self { _mm256_loadu_ps(ptr as *const Self::ScalarType) } #[inline(always)] unsafe fn store_complex(ptr: *mut Complex, data: Self) { _mm256_storeu_ps(ptr as *mut Self::ScalarType, data) } #[inline(always)] unsafe fn gather_complex_avx2_index32( ptr: *const Complex, indexes: __m128i, ) -> Self { _mm256_castpd_ps(_mm256_i32gather_pd(ptr as *const f64, indexes, 8)) } #[inline(always)] unsafe fn gather_complex_avx2_index64( ptr: *const Complex, indexes: __m256i, ) -> Self { _mm256_castpd_ps(_mm256_i64gather_pd(ptr as *const f64, indexes, 8)) } #[inline(always)] unsafe fn load_partial1_complex(ptr: *const Complex) -> Self::HalfVector { let data = _mm_load_sd(ptr as *const f64); _mm_castpd_ps(data) } #[inline(always)] unsafe fn load_partial2_complex(ptr: *const Complex) -> Self::HalfVector { _mm_loadu_ps(ptr as *const f32) } #[inline(always)] unsafe fn load_partial3_complex(ptr: *const Complex) -> Self { let lo = Self::load_partial2_complex(ptr); let hi = Self::load_partial1_complex(ptr.add(2)); Self::merge(lo, hi) } #[inline(always)] unsafe fn store_partial1_complex(ptr: *mut Complex, data: Self::HalfVector) { _mm_store_sd(ptr as *mut f64, _mm_castps_pd(data)); } #[inline(always)] unsafe fn store_partial2_complex(ptr: *mut Complex, data: Self::HalfVector) { _mm_storeu_ps(ptr as *mut f32, data); } #[inline(always)] unsafe fn store_partial3_complex(ptr: *mut Complex, data: Self) { Self::store_partial2_complex(ptr, data.lo()); Self::store_partial1_complex(ptr.add(2), data.hi()); } } impl AvxVector for __m128 { const SCALAR_PER_VECTOR: usize = 4; const COMPLEX_PER_VECTOR: usize = 2; #[inline(always)] unsafe fn zero() -> Self { _mm_setzero_ps() } #[inline(always)] unsafe fn half_root2() -> Self { // note: we're computing a square root here, but checking the assembly says the compiler is smart enough to turn this into a constant _mm_broadcast_ss(&0.5f32.sqrt()) } #[inline(always)] unsafe fn xor(left: Self, right: Self) -> Self { _mm_xor_ps(left, right) } #[inline(always)] unsafe fn neg(self) -> Self { _mm_xor_ps(self, _mm_broadcast_ss(&-0.0)) } #[inline(always)] unsafe fn add(left: Self, right: Self) -> Self { _mm_add_ps(left, right) } #[inline(always)] unsafe fn sub(left: Self, right: Self) -> Self { _mm_sub_ps(left, right) } #[inline(always)] unsafe fn mul(left: Self, right: Self) -> Self { _mm_mul_ps(left, right) } #[inline(always)] unsafe fn fmadd(left: Self, right: Self, add: Self) -> Self { _mm_fmadd_ps(left, right, add) } #[inline(always)] unsafe fn fnmadd(left: Self, right: Self, add: Self) -> Self { _mm_fnmadd_ps(left, right, add) } #[inline(always)] unsafe fn fmaddsub(left: Self, right: Self, add: Self) -> Self { _mm_fmaddsub_ps(left, right, add) } #[inline(always)] unsafe fn fmsubadd(left: Self, right: Self, add: Self) -> Self { _mm_fmsubadd_ps(left, right, add) } #[inline(always)] unsafe fn reverse_complex_elements(self) -> Self { // swap the elements in-lane _mm_permute_ps(self, 0x4E) } #[inline(always)] unsafe fn unpacklo_complex(rows: [Self; 2]) -> Self { let row0_double = _mm_castps_pd(rows[0]); let row1_double = _mm_castps_pd(rows[1]); let unpacked = _mm_unpacklo_pd(row0_double, row1_double); _mm_castpd_ps(unpacked) } #[inline(always)] unsafe fn unpackhi_complex(rows: [Self; 2]) -> Self { let row0_double = _mm_castps_pd(rows[0]); let row1_double = _mm_castps_pd(rows[1]); let unpacked = _mm_unpackhi_pd(row0_double, row1_double); _mm_castpd_ps(unpacked) } #[inline(always)] unsafe fn swap_complex_components(self) -> Self { _mm_permute_ps(self, 0xB1) } #[inline(always)] unsafe fn duplicate_complex_components(self) -> (Self, Self) { (_mm_moveldup_ps(self), _mm_movehdup_ps(self)) } #[inline(always)] unsafe fn make_rotation90(direction: FftDirection) -> Rotation90 { let broadcast = match direction { FftDirection::Forward => Complex::new(-0.0, 0.0), FftDirection::Inverse => Complex::new(0.0, -0.0), }; Rotation90(Self::broadcast_complex_elements(broadcast)) } #[inline(always)] unsafe fn make_mixedradix_twiddle_chunk( x: usize, y: usize, len: usize, direction: FftDirection, ) -> Self { let mut twiddle_chunk = [Complex::::zero(); Self::COMPLEX_PER_VECTOR]; for i in 0..Self::COMPLEX_PER_VECTOR { twiddle_chunk[i] = twiddles::compute_twiddle(y * (x + i), len, direction); } _mm_loadu_ps(twiddle_chunk.as_ptr() as *const f32) } #[inline(always)] unsafe fn broadcast_twiddle(index: usize, len: usize, direction: FftDirection) -> Self { Self::broadcast_complex_elements(twiddles::compute_twiddle(index, len, direction)) } #[inline(always)] unsafe fn transpose2_packed(rows: [Self; 2]) -> [Self; 2] { Self::unpack_complex(rows) } #[inline(always)] unsafe fn transpose3_packed(rows: [Self; 3]) -> [Self; 3] { let unpacked0 = Self::unpacklo_complex([rows[0], rows[1]]); let blended = _mm_blend_ps(rows[0], rows[2], 0x03); let unpacked2 = Self::unpackhi_complex([rows[1], rows[2]]); [unpacked0, blended, unpacked2] } #[inline(always)] unsafe fn transpose4_packed(rows: [Self; 4]) -> [Self; 4] { let [unpacked0, unpacked1] = Self::unpack_complex([rows[0], rows[1]]); let [unpacked2, unpacked3] = Self::unpack_complex([rows[2], rows[3]]); [unpacked0, unpacked2, unpacked1, unpacked3] } #[inline(always)] unsafe fn transpose5_packed(rows: [Self; 5]) -> [Self; 5] { [ Self::unpacklo_complex([rows[0], rows[1]]), Self::unpacklo_complex([rows[2], rows[3]]), _mm_blend_ps(rows[0], rows[4], 0x03), Self::unpackhi_complex([rows[1], rows[2]]), Self::unpackhi_complex([rows[3], rows[4]]), ] } #[inline(always)] unsafe fn transpose6_packed(rows: [Self; 6]) -> [Self; 6] { let [unpacked0, unpacked1] = Self::unpack_complex([rows[0], rows[1]]); let [unpacked2, unpacked3] = Self::unpack_complex([rows[2], rows[3]]); let [unpacked4, unpacked5] = Self::unpack_complex([rows[4], rows[5]]); [ unpacked0, unpacked2, unpacked4, unpacked1, unpacked3, unpacked5, ] } #[inline(always)] unsafe fn transpose7_packed(rows: [Self; 7]) -> [Self; 7] { [ Self::unpacklo_complex([rows[0], rows[1]]), Self::unpacklo_complex([rows[2], rows[3]]), Self::unpacklo_complex([rows[4], rows[5]]), _mm_shuffle_ps(rows[6], rows[0], 0xE4), Self::unpackhi_complex([rows[1], rows[2]]), Self::unpackhi_complex([rows[3], rows[4]]), Self::unpackhi_complex([rows[5], rows[6]]), ] } #[inline(always)] unsafe fn transpose8_packed(rows: [Self; 8]) -> [Self; 8] { let chunk0 = [rows[0], rows[1], rows[2], rows[3]]; let chunk1 = [rows[4], rows[5], rows[6], rows[7]]; let output0 = Self::transpose4_packed(chunk0); let output1 = Self::transpose4_packed(chunk1); [ output0[0], output0[1], output1[0], output1[1], output0[2], output0[3], output1[2], output1[3], ] } #[inline(always)] unsafe fn transpose9_packed(rows: [Self; 9]) -> [Self; 9] { [ Self::unpacklo_complex([rows[0], rows[1]]), Self::unpacklo_complex([rows[2], rows[3]]), Self::unpacklo_complex([rows[4], rows[5]]), Self::unpacklo_complex([rows[6], rows[7]]), _mm_shuffle_ps(rows[8], rows[0], 0xE4), Self::unpackhi_complex([rows[1], rows[2]]), Self::unpackhi_complex([rows[3], rows[4]]), Self::unpackhi_complex([rows[5], rows[6]]), Self::unpackhi_complex([rows[7], rows[8]]), ] } #[inline(always)] unsafe fn transpose11_packed(rows: [Self; 11]) -> [Self; 11] { [ Self::unpacklo_complex([rows[0], rows[1]]), Self::unpacklo_complex([rows[2], rows[3]]), Self::unpacklo_complex([rows[4], rows[5]]), Self::unpacklo_complex([rows[6], rows[7]]), Self::unpacklo_complex([rows[8], rows[9]]), _mm_shuffle_ps(rows[10], rows[0], 0xE4), Self::unpackhi_complex([rows[1], rows[2]]), Self::unpackhi_complex([rows[3], rows[4]]), Self::unpackhi_complex([rows[5], rows[6]]), Self::unpackhi_complex([rows[7], rows[8]]), Self::unpackhi_complex([rows[9], rows[10]]), ] } #[inline(always)] unsafe fn transpose12_packed(rows: [Self; 12]) -> [Self; 12] { let chunk0 = [rows[0], rows[1], rows[2], rows[3]]; let chunk1 = [rows[4], rows[5], rows[6], rows[7]]; let chunk2 = [rows[8], rows[9], rows[10], rows[11]]; let output0 = Self::transpose4_packed(chunk0); let output1 = Self::transpose4_packed(chunk1); let output2 = Self::transpose4_packed(chunk2); [ output0[0], output0[1], output1[0], output1[1], output2[0], output2[1], output0[2], output0[3], output1[2], output1[3], output2[2], output2[3], ] } #[inline(always)] unsafe fn transpose16_packed(rows: [Self; 16]) -> [Self; 16] { let chunk0 = [ rows[0], rows[1], rows[2], rows[3], rows[4], rows[5], rows[6], rows[7], ]; let chunk1 = [ rows[8], rows[9], rows[10], rows[11], rows[12], rows[13], rows[14], rows[15], ]; let output0 = Self::transpose8_packed(chunk0); let output1 = Self::transpose8_packed(chunk1); [ output0[0], output0[1], output0[2], output0[3], output1[0], output1[1], output1[2], output1[3], output0[4], output0[5], output0[6], output0[7], output1[4], output1[5], output1[6], output1[7], ] } } impl AvxVector128 for __m128 { type FullVector = __m256; #[inline(always)] unsafe fn lo(input: Self::FullVector) -> Self { _mm256_castps256_ps128(input) } #[inline(always)] unsafe fn hi(input: Self::FullVector) -> Self { _mm256_extractf128_ps(input, 1) } #[inline(always)] unsafe fn merge(lo: Self, hi: Self) -> Self::FullVector { _mm256_insertf128_ps(_mm256_castps128_ps256(lo), hi, 1) } #[inline(always)] unsafe fn zero_extend(self) -> Self::FullVector { _mm256_zextps128_ps256(self) } #[inline(always)] unsafe fn lo_rotation(input: Rotation90) -> Rotation90 { input.lo() } #[inline(always)] unsafe fn broadcast_complex_elements(value: Complex) -> Self { _mm_set_ps(value.im, value.re, value.im, value.re) } #[inline(always)] unsafe fn gather32_complex_avx2(ptr: *const Complex, indexes: __m128i) -> Self { _mm_castpd_ps(_mm_i32gather_pd(ptr as *const f64, indexes, 8)) } #[inline(always)] unsafe fn gather64_complex_avx2(ptr: *const Complex, indexes: __m128i) -> Self { _mm_castpd_ps(_mm_i64gather_pd(ptr as *const f64, indexes, 8)) } } impl AvxVector for __m256d { const SCALAR_PER_VECTOR: usize = 4; const COMPLEX_PER_VECTOR: usize = 2; #[inline(always)] unsafe fn zero() -> Self { _mm256_setzero_pd() } #[inline(always)] unsafe fn half_root2() -> Self { // note: we're computing a square root here, but checking the assembly says the compiler is smart enough to turn this into a constant _mm256_broadcast_sd(&0.5f64.sqrt()) } #[inline(always)] unsafe fn xor(left: Self, right: Self) -> Self { _mm256_xor_pd(left, right) } #[inline(always)] unsafe fn neg(self) -> Self { _mm256_xor_pd(self, _mm256_broadcast_sd(&-0.0)) } #[inline(always)] unsafe fn add(left: Self, right: Self) -> Self { _mm256_add_pd(left, right) } #[inline(always)] unsafe fn sub(left: Self, right: Self) -> Self { _mm256_sub_pd(left, right) } #[inline(always)] unsafe fn mul(left: Self, right: Self) -> Self { _mm256_mul_pd(left, right) } #[inline(always)] unsafe fn fmadd(left: Self, right: Self, add: Self) -> Self { _mm256_fmadd_pd(left, right, add) } #[inline(always)] unsafe fn fnmadd(left: Self, right: Self, add: Self) -> Self { _mm256_fnmadd_pd(left, right, add) } #[inline(always)] unsafe fn fmaddsub(left: Self, right: Self, add: Self) -> Self { _mm256_fmaddsub_pd(left, right, add) } #[inline(always)] unsafe fn fmsubadd(left: Self, right: Self, add: Self) -> Self { _mm256_fmsubadd_pd(left, right, add) } #[inline(always)] unsafe fn reverse_complex_elements(self) -> Self { _mm256_permute2f128_pd(self, self, 0x01) } #[inline(always)] unsafe fn unpacklo_complex(rows: [Self; 2]) -> Self { _mm256_permute2f128_pd(rows[0], rows[1], 0x20) } #[inline(always)] unsafe fn unpackhi_complex(rows: [Self; 2]) -> Self { _mm256_permute2f128_pd(rows[0], rows[1], 0x31) } #[inline(always)] unsafe fn swap_complex_components(self) -> Self { _mm256_permute_pd(self, 0x05) } #[inline(always)] unsafe fn duplicate_complex_components(self) -> (Self, Self) { (_mm256_movedup_pd(self), _mm256_permute_pd(self, 0x0F)) } #[inline(always)] unsafe fn make_rotation90(direction: FftDirection) -> Rotation90 { let broadcast = match direction { FftDirection::Forward => Complex::new(-0.0, 0.0), FftDirection::Inverse => Complex::new(0.0, -0.0), }; Rotation90(Self::broadcast_complex_elements(broadcast)) } #[inline(always)] unsafe fn make_mixedradix_twiddle_chunk( x: usize, y: usize, len: usize, direction: FftDirection, ) -> Self { let mut twiddle_chunk = [Complex::::zero(); Self::COMPLEX_PER_VECTOR]; for i in 0..Self::COMPLEX_PER_VECTOR { twiddle_chunk[i] = twiddles::compute_twiddle(y * (x + i), len, direction); } twiddle_chunk.as_slice().load_complex(0) } #[inline(always)] unsafe fn broadcast_twiddle(index: usize, len: usize, direction: FftDirection) -> Self { Self::broadcast_complex_elements(twiddles::compute_twiddle(index, len, direction)) } #[inline(always)] unsafe fn transpose2_packed(rows: [Self; 2]) -> [Self; 2] { Self::unpack_complex(rows) } #[inline(always)] unsafe fn transpose3_packed(rows: [Self; 3]) -> [Self; 3] { let unpacked0 = Self::unpacklo_complex([rows[0], rows[1]]); let blended = _mm256_blend_pd(rows[0], rows[2], 0x03); let unpacked2 = Self::unpackhi_complex([rows[1], rows[2]]); [unpacked0, blended, unpacked2] } #[inline(always)] unsafe fn transpose4_packed(rows: [Self; 4]) -> [Self; 4] { let [unpacked0, unpacked1] = Self::unpack_complex([rows[0], rows[1]]); let [unpacked2, unpacked3] = Self::unpack_complex([rows[2], rows[3]]); [unpacked0, unpacked2, unpacked1, unpacked3] } #[inline(always)] unsafe fn transpose5_packed(rows: [Self; 5]) -> [Self; 5] { [ Self::unpacklo_complex([rows[0], rows[1]]), Self::unpacklo_complex([rows[2], rows[3]]), _mm256_blend_pd(rows[0], rows[4], 0x03), Self::unpackhi_complex([rows[1], rows[2]]), Self::unpackhi_complex([rows[3], rows[4]]), ] } #[inline(always)] unsafe fn transpose6_packed(rows: [Self; 6]) -> [Self; 6] { let [unpacked0, unpacked1] = Self::unpack_complex([rows[0], rows[1]]); let [unpacked2, unpacked3] = Self::unpack_complex([rows[2], rows[3]]); let [unpacked4, unpacked5] = Self::unpack_complex([rows[4], rows[5]]); [ unpacked0, unpacked2, unpacked4, unpacked1, unpacked3, unpacked5, ] } #[inline(always)] unsafe fn transpose7_packed(rows: [Self; 7]) -> [Self; 7] { [ Self::unpacklo_complex([rows[0], rows[1]]), Self::unpacklo_complex([rows[2], rows[3]]), Self::unpacklo_complex([rows[4], rows[5]]), _mm256_blend_pd(rows[0], rows[6], 0x03), Self::unpackhi_complex([rows[1], rows[2]]), Self::unpackhi_complex([rows[3], rows[4]]), Self::unpackhi_complex([rows[5], rows[6]]), ] } #[inline(always)] unsafe fn transpose8_packed(rows: [Self; 8]) -> [Self; 8] { let chunk0 = [rows[0], rows[1], rows[2], rows[3]]; let chunk1 = [rows[4], rows[5], rows[6], rows[7]]; let output0 = Self::transpose4_packed(chunk0); let output1 = Self::transpose4_packed(chunk1); [ output0[0], output0[1], output1[0], output1[1], output0[2], output0[3], output1[2], output1[3], ] } #[inline(always)] unsafe fn transpose9_packed(rows: [Self; 9]) -> [Self; 9] { [ _mm256_permute2f128_pd(rows[0], rows[1], 0x20), _mm256_permute2f128_pd(rows[2], rows[3], 0x20), _mm256_permute2f128_pd(rows[4], rows[5], 0x20), _mm256_permute2f128_pd(rows[6], rows[7], 0x20), _mm256_permute2f128_pd(rows[8], rows[0], 0x30), _mm256_permute2f128_pd(rows[1], rows[2], 0x31), _mm256_permute2f128_pd(rows[3], rows[4], 0x31), _mm256_permute2f128_pd(rows[5], rows[6], 0x31), _mm256_permute2f128_pd(rows[7], rows[8], 0x31), ] } #[inline(always)] unsafe fn transpose11_packed(rows: [Self; 11]) -> [Self; 11] { [ _mm256_permute2f128_pd(rows[0], rows[1], 0x20), _mm256_permute2f128_pd(rows[2], rows[3], 0x20), _mm256_permute2f128_pd(rows[4], rows[5], 0x20), _mm256_permute2f128_pd(rows[6], rows[7], 0x20), _mm256_permute2f128_pd(rows[8], rows[9], 0x20), _mm256_permute2f128_pd(rows[10], rows[0], 0x30), _mm256_permute2f128_pd(rows[1], rows[2], 0x31), _mm256_permute2f128_pd(rows[3], rows[4], 0x31), _mm256_permute2f128_pd(rows[5], rows[6], 0x31), _mm256_permute2f128_pd(rows[7], rows[8], 0x31), _mm256_permute2f128_pd(rows[9], rows[10], 0x31), ] } #[inline(always)] unsafe fn transpose12_packed(rows: [Self; 12]) -> [Self; 12] { let chunk0 = [rows[0], rows[1], rows[2], rows[3]]; let chunk1 = [rows[4], rows[5], rows[6], rows[7]]; let chunk2 = [rows[8], rows[9], rows[10], rows[11]]; let output0 = Self::transpose4_packed(chunk0); let output1 = Self::transpose4_packed(chunk1); let output2 = Self::transpose4_packed(chunk2); [ output0[0], output0[1], output1[0], output1[1], output2[0], output2[1], output0[2], output0[3], output1[2], output1[3], output2[2], output2[3], ] } #[inline(always)] unsafe fn transpose16_packed(rows: [Self; 16]) -> [Self; 16] { let chunk0 = [ rows[0], rows[1], rows[2], rows[3], rows[4], rows[5], rows[6], rows[7], ]; let chunk1 = [ rows[8], rows[9], rows[10], rows[11], rows[12], rows[13], rows[14], rows[15], ]; let output0 = Self::transpose8_packed(chunk0); let output1 = Self::transpose8_packed(chunk1); [ output0[0], output0[1], output0[2], output0[3], output1[0], output1[1], output1[2], output1[3], output0[4], output0[5], output0[6], output0[7], output1[4], output1[5], output1[6], output1[7], ] } } impl AvxVector256 for __m256d { type ScalarType = f64; type HalfVector = __m128d; #[inline(always)] unsafe fn lo(self) -> Self::HalfVector { _mm256_castpd256_pd128(self) } #[inline(always)] unsafe fn hi(self) -> Self::HalfVector { _mm256_extractf128_pd(self, 1) } #[inline(always)] unsafe fn merge(lo: Self::HalfVector, hi: Self::HalfVector) -> Self { _mm256_insertf128_pd(_mm256_castpd128_pd256(lo), hi, 1) } #[inline(always)] unsafe fn broadcast_complex_elements(value: Complex) -> Self { _mm256_set_pd(value.im, value.re, value.im, value.re) } #[inline(always)] unsafe fn hadd_complex(self) -> Complex { let lo = self.lo(); let hi = self.hi(); let sum = _mm_add_pd(lo, hi); let mut result_storage = [Complex::zero(); 1]; result_storage.as_mut_slice().store_partial1_complex(sum, 0); result_storage[0] } #[inline(always)] unsafe fn load_complex(ptr: *const Complex) -> Self { _mm256_loadu_pd(ptr as *const Self::ScalarType) } #[inline(always)] unsafe fn store_complex(ptr: *mut Complex, data: Self) { _mm256_storeu_pd(ptr as *mut Self::ScalarType, data) } #[inline(always)] unsafe fn gather_complex_avx2_index32( ptr: *const Complex, indexes: __m128i, ) -> Self { let offsets = _mm_set_epi32(1, 0, 1, 0); let shifted = _mm_slli_epi32(indexes, 1); let modified_indexes = _mm_add_epi32(offsets, shifted); _mm256_i32gather_pd(ptr as *const f64, modified_indexes, 8) } #[inline(always)] unsafe fn gather_complex_avx2_index64( ptr: *const Complex, indexes: __m256i, ) -> Self { let offsets = _mm256_set_epi64x(1, 0, 1, 0); let shifted = _mm256_slli_epi64(indexes, 1); let modified_indexes = _mm256_add_epi64(offsets, shifted); _mm256_i64gather_pd(ptr as *const f64, modified_indexes, 8) } #[inline(always)] unsafe fn load_partial1_complex(ptr: *const Complex) -> Self::HalfVector { _mm_loadu_pd(ptr as *const f64) } #[inline(always)] unsafe fn load_partial2_complex(_ptr: *const Complex) -> Self::HalfVector { unimplemented!("Impossible to do a partial load of 2 complex f64's") } #[inline(always)] unsafe fn load_partial3_complex(_ptr: *const Complex) -> Self { unimplemented!("Impossible to do a partial load of 3 complex f64's") } #[inline(always)] unsafe fn store_partial1_complex(ptr: *mut Complex, data: Self::HalfVector) { _mm_storeu_pd(ptr as *mut f64, data); } #[inline(always)] unsafe fn store_partial2_complex( _ptr: *mut Complex, _data: Self::HalfVector, ) { unimplemented!("Impossible to do a partial store of 2 complex f64's") } #[inline(always)] unsafe fn store_partial3_complex(_ptr: *mut Complex, _data: Self) { unimplemented!("Impossible to do a partial store of 3 complex f64's") } } impl AvxVector for __m128d { const SCALAR_PER_VECTOR: usize = 2; const COMPLEX_PER_VECTOR: usize = 1; #[inline(always)] unsafe fn zero() -> Self { _mm_setzero_pd() } #[inline(always)] unsafe fn half_root2() -> Self { // note: we're computing a square root here, but checking the assembly says the compiler is smart enough to turn this into a constant _mm_load1_pd(&0.5f64.sqrt()) } #[inline(always)] unsafe fn xor(left: Self, right: Self) -> Self { _mm_xor_pd(left, right) } #[inline(always)] unsafe fn neg(self) -> Self { _mm_xor_pd(self, _mm_load1_pd(&-0.0)) } #[inline(always)] unsafe fn add(left: Self, right: Self) -> Self { _mm_add_pd(left, right) } #[inline(always)] unsafe fn sub(left: Self, right: Self) -> Self { _mm_sub_pd(left, right) } #[inline(always)] unsafe fn mul(left: Self, right: Self) -> Self { _mm_mul_pd(left, right) } #[inline(always)] unsafe fn fmadd(left: Self, right: Self, add: Self) -> Self { _mm_fmadd_pd(left, right, add) } #[inline(always)] unsafe fn fnmadd(left: Self, right: Self, add: Self) -> Self { _mm_fnmadd_pd(left, right, add) } #[inline(always)] unsafe fn fmaddsub(left: Self, right: Self, add: Self) -> Self { _mm_fmaddsub_pd(left, right, add) } #[inline(always)] unsafe fn fmsubadd(left: Self, right: Self, add: Self) -> Self { _mm_fmsubadd_pd(left, right, add) } #[inline(always)] unsafe fn reverse_complex_elements(self) -> Self { // nothing to reverse self } #[inline(always)] unsafe fn unpacklo_complex(_rows: [Self; 2]) -> Self { unimplemented!(); // this operation doesn't make sense with one element. TODO: I don't know if it would be more useful to error here or to just return the inputs unchanged. If returning the inputs is useful, do that. } #[inline(always)] unsafe fn unpackhi_complex(_rows: [Self; 2]) -> Self { unimplemented!(); // this operation doesn't make sense with one element. TODO: I don't know if it would be more useful to error here or to just return the inputs unchanged. If returning the inputs is useful, do that. } #[inline(always)] unsafe fn swap_complex_components(self) -> Self { _mm_permute_pd(self, 0x01) } #[inline(always)] unsafe fn duplicate_complex_components(self) -> (Self, Self) { (_mm_movedup_pd(self), _mm_permute_pd(self, 0x03)) } #[inline(always)] unsafe fn make_rotation90(direction: FftDirection) -> Rotation90 { let broadcast = match direction { FftDirection::Forward => Complex::new(-0.0, 0.0), FftDirection::Inverse => Complex::new(0.0, -0.0), }; Rotation90(Self::broadcast_complex_elements(broadcast)) } #[inline(always)] unsafe fn make_mixedradix_twiddle_chunk( x: usize, y: usize, len: usize, direction: FftDirection, ) -> Self { let mut twiddle_chunk = [Complex::::zero(); Self::COMPLEX_PER_VECTOR]; for i in 0..Self::COMPLEX_PER_VECTOR { twiddle_chunk[i] = twiddles::compute_twiddle(y * (x + i), len, direction); } _mm_loadu_pd(twiddle_chunk.as_ptr() as *const f64) } #[inline(always)] unsafe fn broadcast_twiddle(index: usize, len: usize, direction: FftDirection) -> Self { Self::broadcast_complex_elements(twiddles::compute_twiddle(index, len, direction)) } #[inline(always)] unsafe fn transpose2_packed(rows: [Self; 2]) -> [Self; 2] { rows } #[inline(always)] unsafe fn transpose3_packed(rows: [Self; 3]) -> [Self; 3] { rows } #[inline(always)] unsafe fn transpose4_packed(rows: [Self; 4]) -> [Self; 4] { rows } #[inline(always)] unsafe fn transpose5_packed(rows: [Self; 5]) -> [Self; 5] { rows } #[inline(always)] unsafe fn transpose6_packed(rows: [Self; 6]) -> [Self; 6] { rows } #[inline(always)] unsafe fn transpose7_packed(rows: [Self; 7]) -> [Self; 7] { rows } #[inline(always)] unsafe fn transpose8_packed(rows: [Self; 8]) -> [Self; 8] { rows } #[inline(always)] unsafe fn transpose9_packed(rows: [Self; 9]) -> [Self; 9] { rows } #[inline(always)] unsafe fn transpose11_packed(rows: [Self; 11]) -> [Self; 11] { rows } #[inline(always)] unsafe fn transpose12_packed(rows: [Self; 12]) -> [Self; 12] { rows } #[inline(always)] unsafe fn transpose16_packed(rows: [Self; 16]) -> [Self; 16] { rows } } impl AvxVector128 for __m128d { type FullVector = __m256d; #[inline(always)] unsafe fn lo(input: Self::FullVector) -> Self { _mm256_castpd256_pd128(input) } #[inline(always)] unsafe fn hi(input: Self::FullVector) -> Self { _mm256_extractf128_pd(input, 1) } #[inline(always)] unsafe fn merge(lo: Self, hi: Self) -> Self::FullVector { _mm256_insertf128_pd(_mm256_castpd128_pd256(lo), hi, 1) } #[inline(always)] unsafe fn zero_extend(self) -> Self::FullVector { _mm256_zextpd128_pd256(self) } #[inline(always)] unsafe fn lo_rotation(input: Rotation90) -> Rotation90 { input.lo() } #[inline(always)] unsafe fn broadcast_complex_elements(value: Complex) -> Self { _mm_set_pd(value.im, value.re) } #[inline(always)] unsafe fn gather32_complex_avx2(ptr: *const Complex, indexes: __m128i) -> Self { let mut index_storage: [i32; 4] = [0; 4]; _mm_storeu_si128(index_storage.as_mut_ptr() as *mut __m128i, indexes); _mm_loadu_pd(ptr.offset(index_storage[0] as isize) as *const f64) } #[inline(always)] unsafe fn gather64_complex_avx2(ptr: *const Complex, indexes: __m128i) -> Self { let mut index_storage: [i64; 4] = [0; 4]; _mm_storeu_si128(index_storage.as_mut_ptr() as *mut __m128i, indexes); _mm_loadu_pd(ptr.offset(index_storage[0] as isize) as *const f64) } } pub trait AvxArray: Deref { unsafe fn load_complex(&self, index: usize) -> T::VectorType; unsafe fn load_partial1_complex( &self, index: usize, ) -> ::HalfVector; unsafe fn load_partial2_complex( &self, index: usize, ) -> ::HalfVector; unsafe fn load_partial3_complex(&self, index: usize) -> T::VectorType; // some avx operations need bespoke one-off things that don't fit into the methods above, so we should provide an escape hatch for them fn input_ptr(&self) -> *const Complex; } pub trait AvxArrayMut: AvxArray + DerefMut { unsafe fn store_complex(&mut self, data: T::VectorType, index: usize); unsafe fn store_partial1_complex( &mut self, data: ::HalfVector, index: usize, ); unsafe fn store_partial2_complex( &mut self, data: ::HalfVector, index: usize, ); unsafe fn store_partial3_complex(&mut self, data: T::VectorType, index: usize); // some avx operations need bespoke one-off things that don't fit into the methods above, so we should provide an escape hatch for them fn output_ptr(&mut self) -> *mut Complex; } impl AvxArray for &[Complex] { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> T::VectorType { debug_assert!(self.len() >= index + T::VectorType::COMPLEX_PER_VECTOR); T::VectorType::load_complex(self.as_ptr().add(index)) } #[inline(always)] unsafe fn load_partial1_complex( &self, index: usize, ) -> ::HalfVector { debug_assert!(self.len() >= index + 1); T::VectorType::load_partial1_complex(self.as_ptr().add(index)) } #[inline(always)] unsafe fn load_partial2_complex( &self, index: usize, ) -> ::HalfVector { debug_assert!(self.len() >= index + 2); T::VectorType::load_partial2_complex(self.as_ptr().add(index)) } #[inline(always)] unsafe fn load_partial3_complex(&self, index: usize) -> T::VectorType { debug_assert!(self.len() >= index + 3); T::VectorType::load_partial3_complex(self.as_ptr().add(index)) } #[inline(always)] fn input_ptr(&self) -> *const Complex { self.as_ptr() } } impl AvxArray for &mut [Complex] { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> T::VectorType { debug_assert!(self.len() >= index + T::VectorType::COMPLEX_PER_VECTOR); T::VectorType::load_complex(self.as_ptr().add(index)) } #[inline(always)] unsafe fn load_partial1_complex( &self, index: usize, ) -> ::HalfVector { debug_assert!(self.len() >= index + 1); T::VectorType::load_partial1_complex(self.as_ptr().add(index)) } #[inline(always)] unsafe fn load_partial2_complex( &self, index: usize, ) -> ::HalfVector { debug_assert!(self.len() >= index + 2); T::VectorType::load_partial2_complex(self.as_ptr().add(index)) } #[inline(always)] unsafe fn load_partial3_complex(&self, index: usize) -> T::VectorType { debug_assert!(self.len() >= index + 3); T::VectorType::load_partial3_complex(self.as_ptr().add(index)) } #[inline(always)] fn input_ptr(&self) -> *const Complex { self.as_ptr() } } impl<'a, T: AvxNum> AvxArray for DoubleBuf<'a, T> where &'a [Complex]: AvxArray, { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> T::VectorType { self.input.load_complex(index) } #[inline(always)] unsafe fn load_partial1_complex( &self, index: usize, ) -> ::HalfVector { self.input.load_partial1_complex(index) } #[inline(always)] unsafe fn load_partial2_complex( &self, index: usize, ) -> ::HalfVector { self.input.load_partial2_complex(index) } #[inline(always)] unsafe fn load_partial3_complex(&self, index: usize) -> T::VectorType { self.input.load_partial3_complex(index) } #[inline(always)] fn input_ptr(&self) -> *const Complex { self.input.input_ptr() } } impl AvxArrayMut for &mut [Complex] { #[inline(always)] unsafe fn store_complex(&mut self, data: T::VectorType, index: usize) { debug_assert!(self.len() >= index + T::VectorType::COMPLEX_PER_VECTOR); T::VectorType::store_complex(self.as_mut_ptr().add(index), data); } #[inline(always)] unsafe fn store_partial1_complex( &mut self, data: ::HalfVector, index: usize, ) { debug_assert!(self.len() >= index + 1); T::VectorType::store_partial1_complex(self.as_mut_ptr().add(index), data); } #[inline(always)] unsafe fn store_partial2_complex( &mut self, data: ::HalfVector, index: usize, ) { debug_assert!(self.len() >= index + 2); T::VectorType::store_partial2_complex(self.as_mut_ptr().add(index), data); } #[inline(always)] unsafe fn store_partial3_complex(&mut self, data: T::VectorType, index: usize) { debug_assert!(self.len() >= index + 3); T::VectorType::store_partial3_complex(self.as_mut_ptr().add(index), data); } #[inline(always)] fn output_ptr(&mut self) -> *mut Complex { self.as_mut_ptr() } } impl<'a, T: AvxNum> AvxArrayMut for DoubleBuf<'a, T> where Self: AvxArray, &'a mut [Complex]: AvxArrayMut, { #[inline(always)] unsafe fn store_complex(&mut self, data: T::VectorType, index: usize) { self.output.store_complex(data, index); } #[inline(always)] unsafe fn store_partial1_complex( &mut self, data: ::HalfVector, index: usize, ) { self.output.store_partial1_complex(data, index); } #[inline(always)] unsafe fn store_partial2_complex( &mut self, data: ::HalfVector, index: usize, ) { self.output.store_partial2_complex(data, index); } #[inline(always)] unsafe fn store_partial3_complex(&mut self, data: T::VectorType, index: usize) { self.output.store_partial3_complex(data, index); } #[inline(always)] fn output_ptr(&mut self) -> *mut Complex { self.output.output_ptr() } } // A custom butterfly-16 function that calls a lambda to load/store data instead of taking an array // This is particularly useful for butterfly 16, because the whole problem doesn't fit into registers, and the compiler isn't smart enough to only load data when it's needed // So the version that takes an array ends up loading data and immediately re-storing it on the stack. By lazily loading and storing exactly when we need to, we can avoid some data reshuffling macro_rules! column_butterfly16_loadfn{ ($load_expr: expr, $store_expr: expr, $twiddles: expr, $rotation: expr) => ( // Size-4 FFTs down the columns let input1 = [$load_expr(1), $load_expr(5), $load_expr(9), $load_expr(13)]; let mut mid1 = AvxVector::column_butterfly4(input1, $rotation); mid1[1] = AvxVector::mul_complex(mid1[1], $twiddles[0]); mid1[2] = avx_vector::apply_butterfly8_twiddle1(mid1[2], $rotation); mid1[3] = AvxVector::mul_complex(mid1[3], $twiddles[1]); let input2 = [$load_expr(2), $load_expr(6), $load_expr(10), $load_expr(14)]; let mut mid2 = AvxVector::column_butterfly4(input2, $rotation); mid2[1] = avx_vector::apply_butterfly8_twiddle1(mid2[1], $rotation); mid2[2] = mid2[2].rotate90($rotation); mid2[3] = avx_vector::apply_butterfly8_twiddle3(mid2[3], $rotation); let input3 = [$load_expr(3), $load_expr(7), $load_expr(11), $load_expr(15)]; let mut mid3 = AvxVector::column_butterfly4(input3, $rotation); mid3[1] = AvxVector::mul_complex(mid3[1], $twiddles[1]); mid3[2] = avx_vector::apply_butterfly8_twiddle3(mid3[2], $rotation); mid3[3] = AvxVector::mul_complex(mid3[3], $twiddles[0].neg()); // do the first row last, because it doesn't need twiddles and therefore requires fewer intermediates let input0 = [$load_expr(0), $load_expr(4), $load_expr(8), $load_expr(12)]; let mid0 = AvxVector::column_butterfly4(input0, $rotation); // All of the data is now in the right format to just do a bunch of butterfly 8's. // Write the data out to the final output as we go so that the compiler can stop worrying about finding stack space for it for i in 0..4 { let output = AvxVector::column_butterfly4([mid0[i], mid1[i], mid2[i], mid3[i]], $rotation); $store_expr(output[0], i); $store_expr(output[1], i + 4); $store_expr(output[2], i + 8); $store_expr(output[3], i + 12); } ) } // A custom butterfly-32 function that calls a lambda to load/store data instead of taking an array // This is particularly useful for butterfly 32, because the whole problem doesn't fit into registers, and the compiler isn't smart enough to only load data when it's needed // So the version that takes an array ends up loading data and immediately re-storing it on the stack. By lazily loading and storing exactly when we need to, we can avoid some data reshuffling macro_rules! column_butterfly32_loadfn{ ($load_expr: expr, $store_expr: expr, $twiddles: expr, $rotation: expr) => ( // Size-4 FFTs down the columns let input1 = [$load_expr(1), $load_expr(9), $load_expr(17), $load_expr(25)]; let mut mid1 = AvxVector::column_butterfly4(input1, $rotation); mid1[1] = AvxVector::mul_complex(mid1[1], $twiddles[0]); mid1[2] = AvxVector::mul_complex(mid1[2], $twiddles[1]); mid1[3] = AvxVector::mul_complex(mid1[3], $twiddles[2]); let input2 = [$load_expr(2), $load_expr(10), $load_expr(18), $load_expr(26)]; let mut mid2 = AvxVector::column_butterfly4(input2, $rotation); mid2[1] = AvxVector::mul_complex(mid2[1], $twiddles[1]); mid2[2] = avx_vector::apply_butterfly8_twiddle1(mid2[2], $rotation); mid2[3] = AvxVector::mul_complex(mid2[3], $twiddles[4]); let input3 = [$load_expr(3), $load_expr(11), $load_expr(19), $load_expr(27)]; let mut mid3 = AvxVector::column_butterfly4(input3, $rotation); mid3[1] = AvxVector::mul_complex(mid3[1], $twiddles[2]); mid3[2] = AvxVector::mul_complex(mid3[2], $twiddles[4]); mid3[3] = AvxVector::mul_complex(mid3[3], $twiddles[0].rotate90($rotation)); let input4 = [$load_expr(4), $load_expr(12), $load_expr(20), $load_expr(28)]; let mut mid4 = AvxVector::column_butterfly4(input4, $rotation); mid4[1] = avx_vector::apply_butterfly8_twiddle1(mid4[1], $rotation); mid4[2] = mid4[2].rotate90($rotation); mid4[3] = avx_vector::apply_butterfly8_twiddle3(mid4[3], $rotation); let input5 = [$load_expr(5), $load_expr(13), $load_expr(21), $load_expr(29)]; let mut mid5 = AvxVector::column_butterfly4(input5, $rotation); mid5[1] = AvxVector::mul_complex(mid5[1], $twiddles[3]); mid5[2] = AvxVector::mul_complex(mid5[2], $twiddles[1].rotate90($rotation)); mid5[3] = AvxVector::mul_complex(mid5[3], $twiddles[5].rotate90($rotation)); let input6 = [$load_expr(6), $load_expr(14), $load_expr(22), $load_expr(30)]; let mut mid6 = AvxVector::column_butterfly4(input6, $rotation); mid6[1] = AvxVector::mul_complex(mid6[1], $twiddles[4]); mid6[2] = avx_vector::apply_butterfly8_twiddle3(mid6[2], $rotation); mid6[3] = AvxVector::mul_complex(mid6[3], $twiddles[1].neg()); let input7 = [$load_expr(7), $load_expr(15), $load_expr(23), $load_expr(31)]; let mut mid7 = AvxVector::column_butterfly4(input7, $rotation); mid7[1] = AvxVector::mul_complex(mid7[1], $twiddles[5]); mid7[2] = AvxVector::mul_complex(mid7[2], $twiddles[4].rotate90($rotation)); mid7[3] = AvxVector::mul_complex(mid7[3], $twiddles[3].neg()); let input0 = [$load_expr(0), $load_expr(8), $load_expr(16), $load_expr(24)]; let mid0 = AvxVector::column_butterfly4(input0, $rotation); // All of the data is now in the right format to just do a bunch of butterfly 8's in a loop. // Write the data out to the final output as we go so that the compiler can stop worrying about finding stack space for it for i in 0..4 { let output = AvxVector::column_butterfly8([mid0[i], mid1[i], mid2[i], mid3[i], mid4[i], mid5[i], mid6[i], mid7[i]], $rotation); $store_expr(output[0], i); $store_expr(output[1], i + 4); $store_expr(output[2], i + 8); $store_expr(output[3], i + 12); $store_expr(output[4], i + 16); $store_expr(output[5], i + 20); $store_expr(output[6], i + 24); $store_expr(output[7], i + 28); } ) } /// Multiply the complex numbers in `left` by the complex numbers in `right`. /// This is exactly the same as `mul_complex` in `AvxVector`, but this implementation also conjugates the `left` input before multiplying #[inline(always)] unsafe fn mul_complex_conjugated(left: V, right: V) -> V { // Extract the real and imaginary components from left into 2 separate registers let (left_real, left_imag) = V::duplicate_complex_components(left); // create a shuffled version of right where the imaginary values are swapped with the reals let right_shuffled = V::swap_complex_components(right); // multiply our duplicated imaginary left vector by our shuffled right vector. that will give us the right side of the traditional complex multiplication formula let output_right = V::mul(left_imag, right_shuffled); // use a FMA instruction to multiply together left side of the complex multiplication formula, then alternatingly add and subtract the left side from the right // By using subadd instead of addsub, we can conjugate the left side for free. V::fmsubadd(left_real, right, output_right) } // compute buffer[i] = buffer[i].conj() * multiplier[i] pairwise complex multiplication for each element. #[target_feature(enable = "avx", enable = "fma")] pub unsafe fn pairwise_complex_mul_assign_conjugated( mut buffer: &mut [Complex], multiplier: &[T::VectorType], ) { assert!(multiplier.len() * T::VectorType::COMPLEX_PER_VECTOR >= buffer.len()); // Assert to convince the compiler to omit bounds checks inside the loop for (i, mut buffer_chunk) in buffer .chunks_exact_mut(T::VectorType::COMPLEX_PER_VECTOR) .enumerate() { let left = buffer_chunk.load_complex(0); // Do a complex multiplication between `left` and `right` let product = mul_complex_conjugated(left, multiplier[i]); // Store the result buffer_chunk.store_complex(product, 0); } // Process the remainder, if there is one let remainder_count = buffer.len() % T::VectorType::COMPLEX_PER_VECTOR; if remainder_count > 0 { let remainder_index = buffer.len() - remainder_count; let remainder_multiplier = multiplier.last().unwrap(); match remainder_count { 1 => { let left = buffer.load_partial1_complex(remainder_index); let product = mul_complex_conjugated(left, remainder_multiplier.lo()); buffer.store_partial1_complex(product, remainder_index); } 2 => { let left = buffer.load_partial2_complex(remainder_index); let product = mul_complex_conjugated(left, remainder_multiplier.lo()); buffer.store_partial2_complex(product, remainder_index); } 3 => { let left = buffer.load_partial3_complex(remainder_index); let product = mul_complex_conjugated(left, *remainder_multiplier); buffer.store_partial3_complex(product, remainder_index); } _ => unreachable!(), } } } // compute output[i] = input[i].conj() * multiplier[i] pairwise complex multiplication for each element. #[target_feature(enable = "avx", enable = "fma")] pub unsafe fn pairwise_complex_mul_conjugated( input: &[Complex], mut output: &mut [Complex], multiplier: &[T::VectorType], ) { assert!( multiplier.len() * T::VectorType::COMPLEX_PER_VECTOR >= input.len(), "multiplier len = {}, input len = {}", multiplier.len(), input.len() ); // Assert to convince the compiler to omit bounds checks inside the loop assert!(input.len() == output.len()); // Assert to convince the compiler to omit bounds checks inside the loop let main_loop_count = input.len() / T::VectorType::COMPLEX_PER_VECTOR; let remainder_count = input.len() % T::VectorType::COMPLEX_PER_VECTOR; for (i, m) in (&multiplier[..main_loop_count]).iter().enumerate() { let left = input.load_complex(i * T::VectorType::COMPLEX_PER_VECTOR); // Do a complex multiplication between `left` and `right` let product = mul_complex_conjugated(left, *m); // Store the result output.store_complex(product, i * T::VectorType::COMPLEX_PER_VECTOR); } // Process the remainder, if there is one if remainder_count > 0 { let remainder_index = input.len() - remainder_count; let remainder_multiplier = multiplier.last().unwrap(); match remainder_count { 1 => { let left = input.load_partial1_complex(remainder_index); let product = mul_complex_conjugated(left, remainder_multiplier.lo()); output.store_partial1_complex(product, remainder_index); } 2 => { let left = input.load_partial2_complex(remainder_index); let product = mul_complex_conjugated(left, remainder_multiplier.lo()); output.store_partial2_complex(product, remainder_index); } 3 => { let left = input.load_partial3_complex(remainder_index); let product = mul_complex_conjugated(left, *remainder_multiplier); output.store_partial3_complex(product, remainder_index); } _ => unreachable!(), } } } rustfft-6.2.0/src/avx/mod.rs000064400000000000000000000264070072674642500141010ustar 00000000000000use crate::{Fft, FftDirection, FftNum}; use std::arch::x86_64::{__m256, __m256d}; use std::sync::Arc; pub trait AvxNum: FftNum { type VectorType: AvxVector256; } impl AvxNum for f32 { type VectorType = __m256; } impl AvxNum for f64 { type VectorType = __m256d; } // Data that most (non-butterfly) SIMD FFT algorithms share // Algorithms aren't required to use this struct, but it allows for a lot of reduction in code duplication struct CommonSimdData { inner_fft: Arc>, twiddles: Box<[V]>, len: usize, inplace_scratch_len: usize, outofplace_scratch_len: usize, direction: FftDirection, } macro_rules! boilerplate_avx_fft { ($struct_name:ident, $len_fn:expr, $inplace_scratch_len_fn:expr, $out_of_place_scratch_len_fn:expr) => { impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], scratch: &mut [Complex], ) { let required_scratch = self.get_outofplace_scratch_len(); if scratch.len() < required_scratch || input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace( self.len(), input.len(), output.len(), self.get_outofplace_scratch_len(), scratch.len(), ); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } let scratch = &mut scratch[..required_scratch]; let result = array_utils::iter_chunks_zipped( input, output, self.len(), |in_chunk, out_chunk| { self.perform_fft_out_of_place(in_chunk, out_chunk, scratch) }, ); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace( self.len(), input.len(), output.len(), self.get_outofplace_scratch_len(), scratch.len(), ) } } fn process_with_scratch(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { let required_scratch = self.get_inplace_scratch_len(); if scratch.len() < required_scratch || buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let scratch = &mut scratch[..required_scratch]; let result = array_utils::iter_chunks(buffer, self.len(), |chunk| { self.perform_fft_inplace(chunk, scratch) }); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ) } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { $inplace_scratch_len_fn(self) } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { $out_of_place_scratch_len_fn(self) } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { $len_fn(self) } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { self.direction } } }; } macro_rules! boilerplate_avx_fft_commondata { ($struct_name:ident) => { impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], scratch: &mut [Complex], ) { if self.len() == 0 { return; } let required_scratch = self.get_outofplace_scratch_len(); if scratch.len() < required_scratch || input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace( self.len(), input.len(), output.len(), self.get_outofplace_scratch_len(), scratch.len(), ); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } let scratch = &mut scratch[..required_scratch]; let result = array_utils::iter_chunks_zipped( input, output, self.len(), |in_chunk, out_chunk| { self.perform_fft_out_of_place(in_chunk, out_chunk, scratch) }, ); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace( self.len(), input.len(), output.len(), self.get_outofplace_scratch_len(), scratch.len(), ); } } fn process_with_scratch(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { if self.len() == 0 { return; } let required_scratch = self.get_inplace_scratch_len(); if scratch.len() < required_scratch || buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let scratch = &mut scratch[..required_scratch]; let result = array_utils::iter_chunks(buffer, self.len(), |chunk| { self.perform_fft_inplace(chunk, scratch) }); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { self.common_data.inplace_scratch_len } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { self.common_data.outofplace_scratch_len } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { self.common_data.len } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { self.common_data.direction } } }; } #[macro_use] mod avx_vector; mod avx32_butterflies; mod avx32_utils; mod avx64_butterflies; mod avx64_utils; mod avx_bluesteins; mod avx_mixed_radix; mod avx_raders; pub mod avx_planner; pub use self::avx32_butterflies::{ Butterfly11Avx, Butterfly128Avx, Butterfly12Avx, Butterfly16Avx, Butterfly24Avx, Butterfly256Avx, Butterfly27Avx, Butterfly32Avx, Butterfly36Avx, Butterfly48Avx, Butterfly512Avx, Butterfly54Avx, Butterfly5Avx, Butterfly64Avx, Butterfly72Avx, Butterfly7Avx, Butterfly8Avx, Butterfly9Avx, }; pub use self::avx64_butterflies::{ Butterfly11Avx64, Butterfly128Avx64, Butterfly12Avx64, Butterfly16Avx64, Butterfly18Avx64, Butterfly24Avx64, Butterfly256Avx64, Butterfly27Avx64, Butterfly32Avx64, Butterfly36Avx64, Butterfly512Avx64, Butterfly5Avx64, Butterfly64Avx64, Butterfly7Avx64, Butterfly8Avx64, Butterfly9Avx64, }; pub use self::avx_bluesteins::BluesteinsAvx; pub use self::avx_mixed_radix::{ MixedRadix11xnAvx, MixedRadix12xnAvx, MixedRadix16xnAvx, MixedRadix2xnAvx, MixedRadix3xnAvx, MixedRadix4xnAvx, MixedRadix5xnAvx, MixedRadix6xnAvx, MixedRadix7xnAvx, MixedRadix8xnAvx, MixedRadix9xnAvx, }; pub use self::avx_raders::RadersAvx2; use self::avx_vector::AvxVector256; rustfft-6.2.0/src/common.rs000064400000000000000000000264200072674642500140070ustar 00000000000000use num_traits::{FromPrimitive, Signed}; use std::fmt::Debug; /// Generic floating point number, implemented for f32 and f64 pub trait FftNum: Copy + FromPrimitive + Signed + Sync + Send + Debug + 'static {} impl FftNum for T where T: Copy + FromPrimitive + Signed + Sync + Send + Debug + 'static {} // Prints an error raised by an in-place FFT algorithm's `process_inplace` method // Marked cold and inline never to keep all formatting code out of the many monomorphized process_inplace methods #[cold] #[inline(never)] pub fn fft_error_inplace( expected_len: usize, actual_len: usize, expected_scratch: usize, actual_scratch: usize, ) { assert!( actual_len >= expected_len, "Provided FFT buffer was too small. Expected len = {}, got len = {}", expected_len, actual_len ); assert_eq!( actual_len % expected_len, 0, "Input FFT buffer must be a multiple of FFT length. Expected multiple of {}, got len = {}", expected_len, actual_len ); assert!( actual_scratch >= expected_scratch, "Not enough scratch space was provided. Expected scratch len >= {}, got scratch len = {}", expected_scratch, actual_scratch ); } // Prints an error raised by an in-place FFT algorithm's `process_inplace` method // Marked cold and inline never to keep all formatting code out of the many monomorphized process_inplace methods #[cold] #[inline(never)] pub fn fft_error_outofplace( expected_len: usize, actual_input: usize, actual_output: usize, expected_scratch: usize, actual_scratch: usize, ) { assert_eq!(actual_input, actual_output, "Provided FFT input buffer and output buffer must have the same length. Got input.len() = {}, output.len() = {}", actual_input, actual_output); assert!( actual_input >= expected_len, "Provided FFT buffer was too small. Expected len = {}, got len = {}", expected_len, actual_input ); assert_eq!( actual_input % expected_len, 0, "Input FFT buffer must be a multiple of FFT length. Expected multiple of {}, got len = {}", expected_len, actual_input ); assert!( actual_scratch >= expected_scratch, "Not enough scratch space was provided. Expected scratch len >= {}, got scratch len = {}", expected_scratch, actual_scratch ); } macro_rules! boilerplate_fft_oop { ($struct_name:ident, $len_fn:expr) => { impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], _scratch: &mut [Complex], ) { if self.len() == 0 { return; } if input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } let result = array_utils::iter_chunks_zipped( input, output, self.len(), |in_chunk, out_chunk| { self.perform_fft_out_of_place(in_chunk, out_chunk, &mut []) }, ); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); } } fn process_with_scratch(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { if self.len() == 0 { return; } let required_scratch = self.get_inplace_scratch_len(); if scratch.len() < required_scratch || buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let scratch = &mut scratch[..required_scratch]; let result = array_utils::iter_chunks(buffer, self.len(), |chunk| { self.perform_fft_out_of_place(chunk, scratch, &mut []); chunk.copy_from_slice(scratch); }); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { self.len() } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { 0 } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { $len_fn(self) } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { self.direction } } }; } macro_rules! boilerplate_fft { ($struct_name:ident, $len_fn:expr, $inplace_scratch_len_fn:expr, $out_of_place_scratch_len_fn:expr) => { impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], scratch: &mut [Complex], ) { if self.len() == 0 { return; } let required_scratch = self.get_outofplace_scratch_len(); if scratch.len() < required_scratch || input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace( self.len(), input.len(), output.len(), self.get_outofplace_scratch_len(), scratch.len(), ); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } let scratch = &mut scratch[..required_scratch]; let result = array_utils::iter_chunks_zipped( input, output, self.len(), |in_chunk, out_chunk| { self.perform_fft_out_of_place(in_chunk, out_chunk, scratch) }, ); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace( self.len(), input.len(), output.len(), self.get_outofplace_scratch_len(), scratch.len(), ); } } fn process_with_scratch(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { if self.len() == 0 { return; } let required_scratch = self.get_inplace_scratch_len(); if scratch.len() < required_scratch || buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let scratch = &mut scratch[..required_scratch]; let result = array_utils::iter_chunks(buffer, self.len(), |chunk| { self.perform_fft_inplace(chunk, scratch) }); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { $inplace_scratch_len_fn(self) } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { $out_of_place_scratch_len_fn(self) } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { $len_fn(self) } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { self.direction } } }; } rustfft-6.2.0/src/fft_cache.rs000064400000000000000000000024740072674642500144240ustar 00000000000000use std::{collections::HashMap, sync::Arc}; use crate::{Fft, FftDirection}; pub(crate) struct FftCache { forward_cache: HashMap>>, inverse_cache: HashMap>>, } impl FftCache { pub fn new() -> Self { Self { forward_cache: HashMap::new(), inverse_cache: HashMap::new(), } } #[allow(unused)] pub fn contains_fft(&self, len: usize, direction: FftDirection) -> bool { match direction { FftDirection::Forward => self.forward_cache.contains_key(&len), FftDirection::Inverse => self.inverse_cache.contains_key(&len), } } pub fn get(&self, len: usize, direction: FftDirection) -> Option>> { match direction { FftDirection::Forward => self.forward_cache.get(&len), FftDirection::Inverse => self.inverse_cache.get(&len), } .map(Arc::clone) } pub fn insert(&mut self, fft: &Arc>) { let cloned = Arc::clone(fft); let len = cloned.len(); match cloned.fft_direction() { FftDirection::Forward => self.forward_cache.insert(len, cloned), FftDirection::Inverse => self.inverse_cache.insert(len, cloned), }; } } rustfft-6.2.0/src/lib.rs000064400000000000000000000744620072674642500132760ustar 00000000000000#![cfg_attr(all(feature = "bench", test), feature(test))] //! RustFFT is a high-performance FFT library written in pure Rust. //! //! On X86_64, RustFFT supports the AVX instruction set for increased performance. No special code is needed to activate AVX: //! Simply plan a FFT using the FftPlanner on a machine that supports the `avx` and `fma` CPU features, and RustFFT //! will automatically switch to faster AVX-accelerated algorithms. //! //! For machines that do not have AVX, RustFFT also supports the SSE4.1 instruction set. //! As for AVX, this is enabled automatically when using the FftPlanner. //! //! Additionally, there is automatic support for the Neon instruction set on AArch64, //! and support for WASM SIMD when compiling for WASM targets. //! //! ### Usage //! //! The recommended way to use RustFFT is to create a [`FftPlanner`](crate::FftPlanner) instance and then call its //! [`plan_fft`](crate::FftPlanner::plan_fft) method. This method will automatically choose which FFT algorithms are best //! for a given size and initialize the required buffers and precomputed data. //! //! ``` //! // Perform a forward FFT of size 1234 //! use rustfft::{FftPlanner, num_complex::Complex}; //! //! let mut planner = FftPlanner::new(); //! let fft = planner.plan_fft_forward(1234); //! //! let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 1234]; //! fft.process(&mut buffer); //! ``` //! The planner returns trait objects of the [`Fft`](crate::Fft) trait, allowing for FFT sizes that aren't known //! until runtime. //! //! RustFFT also exposes individual FFT algorithms. For example, if you know beforehand that you need a power-of-two FFT, you can //! avoid the overhead of the planner and trait object by directly creating instances of the [`Radix4`](crate::algorithm::Radix4) algorithm: //! //! ``` //! // Computes a forward FFT of size 4096 //! use rustfft::{Fft, FftDirection, num_complex::Complex, algorithm::Radix4}; //! //! let fft = Radix4::new(4096, FftDirection::Forward); //! //! let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 4096]; //! fft.process(&mut buffer); //! ``` //! //! For the vast majority of situations, simply using the [`FftPlanner`](crate::FftPlanner) will be enough, but //! advanced users may have better insight than the planner into which algorithms are best for a specific size. See the //! [`algorithm`](crate::algorithm) module for a complete list of scalar algorithms implemented by RustFFT. //! //! Users should beware, however, that bypassing the planner will disable all AVX, SSE, Neon, and WASM SIMD optimizations. //! //! ### Feature Flags //! //! * `avx` (Enabled by default) //! //! On x86_64, the `avx` feature enables compilation of AVX-accelerated code. Enabling it greatly improves performance if the //! client CPU supports AVX and FMA, while disabling it reduces compile time and binary size. //! //! On every platform besides x86_64, this feature does nothing, and RustFFT will behave like it's not set. //! * `sse` (Enabled by default) //! //! On x86_64, the `sse` feature enables compilation of SSE4.1-accelerated code. Enabling it improves performance //! if the client CPU supports SSE4.1, while disabling it reduces compile time and binary size. If AVX is also //! supported and its feature flag is enabled, RustFFT will use AVX instead of SSE4.1. //! //! On every platform besides x86_64, this feature does nothing, and RustFFT will behave like it's not set. //! * `neon` (Enabled by default) //! //! On AArch64 (64-bit ARM) the `neon` feature enables compilation of Neon-accelerated code. Enabling it improves //! performance, while disabling it reduces compile time and binary size. //! //! On every platform besides AArch64, this feature does nothing, and RustFFT will behave like it's not set. //! * `wasm_simd` (Disabled by default) //! //! On the WASM platform, this feature enables compilation of WASM SIMD accelerated code. //! //! To execute binaries compiled with `wasm_simd`, you need a [target browser or runtime which supports `fixed-width SIMD`](https://webassembly.org/roadmap/). //! If you run your SIMD accelerated code on an unsupported platform, WebAssembly will specify a [trap](https://webassembly.github.io/spec/core/intro/overview.html#trap) leading to immediate execution cancelation. //! //! On every platform besides WASM, this feature does nothing and RustFFT will behave like it is not set. //! //! ### Normalization //! //! RustFFT does not normalize outputs. Callers must manually normalize the results by scaling each element by //! `1/len().sqrt()`. Multiple normalization steps can be merged into one via pairwise multiplication, so when //! doing a forward FFT followed by an inverse callers can normalize once by scaling each element by `1/len()` //! //! ### Output Order //! //! Elements in the output are ordered by ascending frequency, with the first element corresponding to frequency 0. //! //! ### AVX Performance Tips //! //! In any FFT computation, the time required to compute a FFT of size N relies heavily on the [prime factorization](https://en.wikipedia.org/wiki/Integer_factorization) of N. //! If N's prime factors are all very small, computing a FFT of size N will be fast, and it'll be slow if N has large prime //! factors, or if N is a prime number. //! //! In most FFT libraries (Including RustFFT when using non-AVX code), power-of-two FFT sizes are the fastest, and users see a steep //! falloff in performance when using non-power-of-two sizes. Thankfully, RustFFT using AVX acceleration is not quite as restrictive: //! //! - Any FFT whose size is of the form `2^n * 3^m` can be considered the "fastest" in RustFFT. //! - Any FFT whose prime factors are all 11 or smaller will also be very fast, but the fewer the factors of 2 and 3 the slower it will be. //! For example, computing a FFT of size 13552 `(2^4*7*11*11)` is takes 12% longer to compute than 13824 `(2^9 * 3^3)`, //! and computing a FFT of size 2541 `(3*7*11*11)` takes 65% longer to compute than 2592 `(2^5 * 3^4)` //! - Any other FFT size will be noticeably slower. A considerable amount of effort has been put into making these FFT sizes as fast as //! they can be, but some FFT sizes just take more work than others. For example, computing a FFT of size 5183 `(71 * 73)` takes about //! 5x longer than computing a FFT of size 5184 `(2^6 * 3^4)`. //! //! In most cases, even prime-sized FFTs will be fast enough for your application. In the example of 5183 above, even that "slow" FFT //! only takes a few tens of microseconds to compute. //! //! Some applications of the FFT allow for choosing an arbitrary FFT size (In many applications the size is pre-determined by whatever you're computing). //! If your application supports choosing your own size, our advice is still to start by trying the size that's most convenient to your application. //! If that's too slow, see if you can find a nearby size whose prime factors are all 11 or smaller, and you can expect a 2x-5x speedup. //! If that's still too slow, find a nearby size whose prime factors are all 2 or 3, and you can expect a 1.1x-1.5x speedup. use std::fmt::Display; pub use num_complex; pub use num_traits; #[macro_use] mod common; /// Individual FFT algorithms pub mod algorithm; mod array_utils; mod fft_cache; mod math_utils; mod plan; mod twiddles; use num_complex::Complex; use num_traits::Zero; pub use crate::common::FftNum; pub use crate::plan::{FftPlanner, FftPlannerScalar}; /// A trait that allows FFT algorithms to report their expected input/output size pub trait Length { /// The FFT size that this algorithm can process fn len(&self) -> usize; } /// Represents a FFT direction, IE a forward FFT or an inverse FFT #[derive(Copy, Clone, PartialEq, Eq, Debug)] pub enum FftDirection { Forward, Inverse, } impl FftDirection { /// Returns the opposite direction of `self`. /// /// - If `self` is `FftDirection::Forward`, returns `FftDirection::Inverse` /// - If `self` is `FftDirection::Inverse`, returns `FftDirection::Forward` #[inline] pub fn opposite_direction(&self) -> FftDirection { match self { Self::Forward => Self::Inverse, Self::Inverse => Self::Forward, } } } impl Display for FftDirection { fn fmt(&self, f: &mut ::std::fmt::Formatter) -> Result<(), ::std::fmt::Error> { match self { Self::Forward => f.write_str("Forward"), Self::Inverse => f.write_str("Inverse"), } } } /// A trait that allows FFT algorithms to report whether they compute forward FFTs or inverse FFTs pub trait Direction { /// Returns FftDirection::Forward if this instance computes forward FFTs, or FftDirection::Inverse for inverse FFTs fn fft_direction(&self) -> FftDirection; } /// Trait for algorithms that compute FFTs. /// /// This trait has a few methods for computing FFTs. Its most conveinent method is [`process(slice)`](crate::Fft::process). /// It takes in a slice of `Complex` and computes a FFT on that slice, in-place. It may copy the data over to internal scratch buffers /// if that speeds up the computation, but the output will always end up in the same slice as the input. pub trait Fft: Length + Direction + Sync + Send { /// Computes a FFT in-place. /// /// Convenience method that allocates a `Vec` with the required scratch space and calls `self.process_with_scratch`. /// If you want to re-use that allocation across multiple FFT computations, consider calling `process_with_scratch` instead. /// /// # Panics /// /// This method panics if: /// - `buffer.len() % self.len() > 0` /// - `buffer.len() < self.len()` fn process(&self, buffer: &mut [Complex]) { let mut scratch = vec![Complex::zero(); self.get_inplace_scratch_len()]; self.process_with_scratch(buffer, &mut scratch); } /// Divides `buffer` into chunks of size `self.len()`, and computes a FFT on each chunk. /// /// Uses the `scratch` buffer as scratch space, so the contents of `scratch` should be considered garbage /// after calling. /// /// # Panics /// /// This method panics if: /// - `buffer.len() % self.len() > 0` /// - `buffer.len() < self.len()` /// - `scratch.len() < self.get_inplace_scratch_len()` fn process_with_scratch(&self, buffer: &mut [Complex], scratch: &mut [Complex]); /// Divides `input` and `output` into chunks of size `self.len()`, and computes a FFT on each chunk. /// /// This method uses both the `input` buffer and `scratch` buffer as scratch space, so the contents of both should be /// considered garbage after calling. /// /// This is a more niche way of computing a FFT. It's useful to avoid a `copy_from_slice()` if you need the output /// in a different buffer than the input for some reason. This happens frequently in RustFFT internals, but is probably /// less common among RustFFT users. /// /// For many FFT sizes, `self.get_outofplace_scratch_len()` returns 0 /// /// # Panics /// /// This method panics if: /// - `output.len() != input.len()` /// - `input.len() % self.len() > 0` /// - `input.len() < self.len()` /// - `scratch.len() < self.get_outofplace_scratch_len()` fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], scratch: &mut [Complex], ); /// Returns the size of the scratch buffer required by `process_with_scratch` /// /// For most FFT sizes, this method will return `self.len()`. For a few small sizes it will return 0, and for some special FFT sizes /// (Sizes that require the use of Bluestein's Algorithm), this may return a scratch size larger than `self.len()`. /// The returned value may change from one version of RustFFT to the next. fn get_inplace_scratch_len(&self) -> usize; /// Returns the size of the scratch buffer required by `process_outofplace_with_scratch` /// /// For most FFT sizes, this method will return 0. For some special FFT sizes /// (Sizes that require the use of Bluestein's Algorithm), this may return a scratch size larger than `self.len()`. /// The returned value may change from one version of RustFFT to the next. fn get_outofplace_scratch_len(&self) -> usize; } // Algorithms implemented to use AVX instructions. Only compiled on x86_64, and only compiled if the "avx" feature flag is set. #[cfg(all(target_arch = "x86_64", feature = "avx"))] mod avx; // If we're not on x86_64, or if the "avx" feature was disabled, keep a stub implementation around that has the same API, but does nothing // That way, users can write code using the AVX planner and compile it on any platform #[cfg(not(all(target_arch = "x86_64", feature = "avx")))] mod avx { pub mod avx_planner { use crate::{Fft, FftDirection, FftNum}; use std::sync::Arc; /// The AVX FFT planner creates new FFT algorithm instances which take advantage of the AVX instruction set. /// /// Creating an instance of `FftPlannerAvx` requires the `avx` and `fma` instructions to be available on the current machine, and it requires RustFFT's /// `avx` feature flag to be set. A few algorithms will use `avx2` if it's available, but it isn't required. /// /// For the time being, AVX acceleration is black box, and AVX accelerated algorithms are not available without a planner. This may change in the future. /// /// ~~~ /// // Perform a forward Fft of size 1234, accelerated by AVX /// use std::sync::Arc; /// use rustfft::{FftPlannerAvx, num_complex::Complex}; /// /// // If FftPlannerAvx::new() returns Ok(), we'll know AVX algorithms are available /// // on this machine, and that RustFFT was compiled with the `avx` feature flag /// if let Ok(mut planner) = FftPlannerAvx::new() { /// let fft = planner.plan_fft_forward(1234); /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 1234]; /// fft.process(&mut buffer); /// /// // The FFT instance returned by the planner has the type `Arc>`, /// // where T is the numeric type, ie f32 or f64, so it's cheap to clone /// let fft_clone = Arc::clone(&fft); /// } /// ~~~ /// /// If you plan on creating multiple FFT instances, it is recommended to reuse the same planner for all of them. This /// is because the planner re-uses internal data across FFT instances wherever possible, saving memory and reducing /// setup time. (FFT instances created with one planner will never re-use data and buffers with FFT instances created /// by a different planner) /// /// Each FFT instance owns [`Arc`s](std::sync::Arc) to its internal data, rather than borrowing it from the planner, so it's perfectly /// safe to drop the planner after creating Fft instances. pub struct FftPlannerAvx { _phantom: std::marker::PhantomData, } impl FftPlannerAvx { /// Constructs a new `FftPlannerAvx` instance. /// /// Returns `Ok(planner_instance)` if this machine has the required instruction sets and the `avx` feature flag is set. /// Returns `Err(())` if some instruction sets are missing, or if the `avx` feature flag is not set. pub fn new() -> Result { Err(()) } /// Returns a `Fft` instance which uses AVX instructions to compute FFTs of size `len`. /// /// If the provided `direction` is `FftDirection::Forward`, the returned instance will compute forward FFTs. If it's `FftDirection::Inverse`, it will compute inverse FFTs. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft(&mut self, _len: usize, _direction: FftDirection) -> Arc> { unreachable!() } /// Returns a `Fft` instance which uses AVX instructions to compute forward FFTs of size `len`. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_forward(&mut self, _len: usize) -> Arc> { unreachable!() } /// Returns a `Fft` instance which uses AVX instructions to compute inverse FFTs of size `len. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_inverse(&mut self, _len: usize) -> Arc> { unreachable!() } } } } pub use self::avx::avx_planner::FftPlannerAvx; // Algorithms implemented to use SSE4.1 instructions. Only compiled on x86_64, and only compiled if the "sse" feature flag is set. #[cfg(all(target_arch = "x86_64", feature = "sse"))] mod sse; // If we're not on x86_64, or if the "sse" feature was disabled, keep a stub implementation around that has the same API, but does nothing // That way, users can write code using the SSE planner and compile it on any platform #[cfg(not(all(target_arch = "x86_64", feature = "sse")))] mod sse { pub mod sse_planner { use crate::{Fft, FftDirection, FftNum}; use std::sync::Arc; /// The SSE FFT planner creates new FFT algorithm instances using a mix of scalar and SSE accelerated algorithms. /// It requires at least SSE4.1, which is available on all reasonably recent x86_64 cpus. /// /// RustFFT has several FFT algorithms available. For a given FFT size, the `FftPlannerSse` decides which of the /// available FFT algorithms to use and then initializes them. /// /// ~~~ /// // Perform a forward Fft of size 1234 /// use std::sync::Arc; /// use rustfft::{FftPlannerSse, num_complex::Complex}; /// /// if let Ok(mut planner) = FftPlannerSse::new() { /// let fft = planner.plan_fft_forward(1234); /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 1234]; /// fft.process(&mut buffer); /// /// // The FFT instance returned by the planner has the type `Arc>`, /// // where T is the numeric type, ie f32 or f64, so it's cheap to clone /// let fft_clone = Arc::clone(&fft); /// } /// ~~~ /// /// If you plan on creating multiple FFT instances, it is recommended to reuse the same planner for all of them. This /// is because the planner re-uses internal data across FFT instances wherever possible, saving memory and reducing /// setup time. (FFT instances created with one planner will never re-use data and buffers with FFT instances created /// by a different planner) /// /// Each FFT instance owns [`Arc`s](std::sync::Arc) to its internal data, rather than borrowing it from the planner, so it's perfectly /// safe to drop the planner after creating Fft instances. pub struct FftPlannerSse { _phantom: std::marker::PhantomData, } impl FftPlannerSse { /// Creates a new `FftPlannerSse` instance. /// /// Returns `Ok(planner_instance)` if this machine has the required instruction sets. /// Returns `Err(())` if some instruction sets are missing. pub fn new() -> Result { Err(()) } /// Returns a `Fft` instance which uses SSE4.1 instructions to compute FFTs of size `len`. /// /// If the provided `direction` is `FftDirection::Forward`, the returned instance will compute forward FFTs. If it's `FftDirection::Inverse`, it will compute inverse FFTs. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft(&mut self, _len: usize, _direction: FftDirection) -> Arc> { unreachable!() } /// Returns a `Fft` instance which uses SSE4.1 instructions to compute forward FFTs of size `len`. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_forward(&mut self, _len: usize) -> Arc> { unreachable!() } /// Returns a `Fft` instance which uses SSE4.1 instructions to compute inverse FFTs of size `len. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_inverse(&mut self, _len: usize) -> Arc> { unreachable!() } } } } pub use self::sse::sse_planner::FftPlannerSse; // Algorithms implemented to use Neon instructions. Only compiled on AArch64, and only compiled if the "neon" feature flag is set. #[cfg(all(target_arch = "aarch64", feature = "neon"))] mod neon; // If we're not on AArch64, or if the "neon" feature was disabled, keep a stub implementation around that has the same API, but does nothing // That way, users can write code using the Neon planner and compile it on any platform #[cfg(not(all(target_arch = "aarch64", feature = "neon")))] mod neon { pub mod neon_planner { use crate::{Fft, FftDirection, FftNum}; use std::sync::Arc; /// The Neon FFT planner creates new FFT algorithm instances using a mix of scalar and Neon accelerated algorithms. /// It is supported when using the 64-bit AArch64 instruction set. /// /// RustFFT has several FFT algorithms available. For a given FFT size, the `FftPlannerNeon` decides which of the /// available FFT algorithms to use and then initializes them. /// /// ~~~ /// // Perform a forward Fft of size 1234 /// use std::sync::Arc; /// use rustfft::{FftPlannerNeon, num_complex::Complex}; /// /// if let Ok(mut planner) = FftPlannerNeon::new() { /// let fft = planner.plan_fft_forward(1234); /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 1234]; /// fft.process(&mut buffer); /// /// // The FFT instance returned by the planner has the type `Arc>`, /// // where T is the numeric type, ie f32 or f64, so it's cheap to clone /// let fft_clone = Arc::clone(&fft); /// } /// ~~~ /// /// If you plan on creating multiple FFT instances, it is recommended to reuse the same planner for all of them. This /// is because the planner re-uses internal data across FFT instances wherever possible, saving memory and reducing /// setup time. (FFT instances created with one planner will never re-use data and buffers with FFT instances created /// by a different planner) /// /// Each FFT instance owns [`Arc`s](std::sync::Arc) to its internal data, rather than borrowing it from the planner, so it's perfectly /// safe to drop the planner after creating Fft instances. pub struct FftPlannerNeon { _phantom: std::marker::PhantomData, } impl FftPlannerNeon { /// Creates a new `FftPlannerNeon` instance. /// /// Returns `Ok(planner_instance)` if this machine has the required instruction sets. /// Returns `Err(())` if some instruction sets are missing. pub fn new() -> Result { Err(()) } /// Returns a `Fft` instance which uses Neon instructions to compute FFTs of size `len`. /// /// If the provided `direction` is `FftDirection::Forward`, the returned instance will compute forward FFTs. If it's `FftDirection::Inverse`, it will compute inverse FFTs. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft(&mut self, _len: usize, _direction: FftDirection) -> Arc> { unreachable!() } /// Returns a `Fft` instance which uses Neon instructions to compute forward FFTs of size `len`. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_forward(&mut self, _len: usize) -> Arc> { unreachable!() } /// Returns a `Fft` instance which uses Neon instructions to compute inverse FFTs of size `len. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_inverse(&mut self, _len: usize) -> Arc> { unreachable!() } } } } pub use self::neon::neon_planner::FftPlannerNeon; #[cfg(all(target_arch = "wasm32", feature = "wasm_simd"))] mod wasm_simd; // If we're not compiling to WebAssembly, or if the "wasm_simd" feature was disabled, keep a stub implementation around that has the same API, but does nothing // That way, users can write code using the WASM planner and compile it on any platform #[cfg(not(all(target_arch = "wasm32", feature = "wasm_simd")))] mod wasm_simd { pub mod wasm_simd_planner { use crate::{Fft, FftDirection, FftNum}; use std::sync::Arc; /// The WASM FFT planner creates new FFT algorithm instances using a mix of scalar and WASM SIMD accelerated algorithms. /// It is supported when using fairly recent browser versions as outlined in [the WebAssembly roadmap](https://webassembly.org/roadmap/). /// /// RustFFT has several FFT algorithms available. For a given FFT size, `FftPlannerWasmSimd` decides which of the /// available FFT algorithms to use and then initializes them. /// /// ~~~ /// // Perform a forward Fft of size 1234 /// use std::sync::Arc; /// use rustfft::{FftPlannerWasmSimd, num_complex::Complex}; /// /// if let Ok(mut planner) = FftPlannerWasmSimd::new() { /// let fft = planner.plan_fft_forward(1234); /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 1234]; /// fft.process(&mut buffer); /// /// // The FFT instance returned by the planner has the type `Arc>`, /// // where T is the numeric type, ie f32 or f64, so it's cheap to clone /// let fft_clone = Arc::clone(&fft); /// } /// ~~~ /// /// If you plan on creating multiple FFT instances, it is recommended to reuse the same planner for all of them. This /// is because the planner re-uses internal data across FFT instances wherever possible, saving memory and reducing /// setup time. (FFT instances created with one planner will never re-use data and buffers with FFT instances created /// by a different planner) /// /// Each FFT instance owns [`Arc`s](std::sync::Arc) to its internal data, rather than borrowing it from the planner, so it's perfectly /// safe to drop the planner after creating Fft instances. pub struct FftPlannerWasmSimd { _phantom: std::marker::PhantomData, } impl FftPlannerWasmSimd { /// Creates a new `FftPlannerWasmSimd` instance. /// /// Returns `Ok(planner_instance)` if this machine has the required instruction sets. /// Returns `Err(())` if some instruction sets are missing. pub fn new() -> Result { Err(()) } /// Returns a `Fft` instance which uses WebAssembly SIMD instructions to compute FFTs of size `len`. /// /// If the provided `direction` is `FftDirection::Forward`, the returned instance will compute forward FFTs. If it's `FftDirection::Inverse`, it will compute inverse FFTs. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft(&mut self, _len: usize, _direction: FftDirection) -> Arc> { unreachable!() } /// Returns a `Fft` instance which uses WebAssembly SIMD instructions to compute forward FFTs of size `len`. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_forward(&mut self, _len: usize) -> Arc> { unreachable!() } /// Returns a `Fft` instance which uses WebAssembly SIMD instructions to compute inverse FFTs of size `len. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_inverse(&mut self, _len: usize) -> Arc> { unreachable!() } } } } pub use self::wasm_simd::wasm_simd_planner::FftPlannerWasmSimd; #[cfg(test)] mod test_utils; rustfft-6.2.0/src/math_utils.rs000064400000000000000000000753410072674642500146760ustar 00000000000000use num_traits::{One, PrimInt, Zero}; pub fn primitive_root(prime: u64) -> Option { let test_exponents: Vec = distinct_prime_factors(prime - 1) .iter() .map(|factor| (prime - 1) / factor) .collect(); 'next: for potential_root in 2..prime { // for each distinct factor, if potential_root^(p-1)/factor mod p is 1, reject it for exp in &test_exponents { if modular_exponent(potential_root, *exp, prime) == 1 { continue 'next; } } // if we reach this point, it means this root was not rejected, so return it return Some(potential_root); } None } /// computes base^exponent % modulo using the standard exponentiation by squaring algorithm pub fn modular_exponent(mut base: T, mut exponent: T, modulo: T) -> T { let one = T::one(); let mut result = one; while exponent > Zero::zero() { if exponent & one == one { result = result * base % modulo; } exponent = exponent >> One::one(); base = (base * base) % modulo; } result } /// return all of the prime factors of n, but omit duplicate prime factors pub fn distinct_prime_factors(mut n: u64) -> Vec { let mut result = Vec::new(); // handle 2 separately so we dont have to worry about adding 2 vs 1 if n % 2 == 0 { while n % 2 == 0 { n /= 2; } result.push(2); } if n > 1 { let mut divisor = 3; let mut limit = (n as f32).sqrt() as u64 + 1; while divisor < limit { if n % divisor == 0 { // remove as many factors as possible from n while n % divisor == 0 { n /= divisor; } result.push(divisor); // recalculate the limit to reduce the amount of work we need to do limit = (n as f32).sqrt() as u64 + 1; } divisor += 2; } if n > 1 { result.push(n); } } result } #[derive(Debug, PartialEq, Eq, Copy, Clone)] pub struct PrimeFactor { pub value: usize, pub count: u32, } #[derive(Clone, Debug)] pub struct PrimeFactors { other_factors: Vec, n: usize, power_two: u32, power_three: u32, total_factor_count: u32, distinct_factor_count: u32, } impl PrimeFactors { pub fn compute(mut n: usize) -> Self { let mut result = Self { other_factors: Vec::new(), n, power_two: 0, power_three: 0, total_factor_count: 0, distinct_factor_count: 0, }; // compute powers of two separately result.power_two = n.trailing_zeros(); result.total_factor_count += result.power_two; n >>= result.power_two; if result.power_two > 0 { result.distinct_factor_count += 1; } // also compute powers of three separately while n % 3 == 0 { result.power_three += 1; n /= 3; } result.total_factor_count += result.power_three; if result.power_three > 0 { result.distinct_factor_count += 1; } // if we have any other factors, gather them in the "other factors" vec if n > 1 { let mut divisor = 5; // compute divisor limit. if our divisor goes above this limit, we know we won't find any more factors. we'll revise it downwards as we discover factors. let mut limit = (n as f32).sqrt() as usize + 1; while divisor < limit { // Count how many times this divisor divesthe remaining input let mut count = 0; while n % divisor == 0 { n /= divisor; count += 1; } // If this entry is actually a divisor of the given number, add it to the array if count > 0 { result.other_factors.push(PrimeFactor { value: divisor, count, }); result.total_factor_count += count; result.distinct_factor_count += 1; // recalculate the limit to reduce the amount of other factors we need to check limit = (n as f32).sqrt() as usize + 1; } divisor += 2; } // because of our limit logic, there might be one factor left if n > 1 { result .other_factors .push(PrimeFactor { value: n, count: 1 }); result.total_factor_count += 1; result.distinct_factor_count += 1; } } result } pub fn is_prime(&self) -> bool { self.total_factor_count == 1 } pub fn get_product(&self) -> usize { self.n } #[allow(unused)] pub fn get_total_factor_count(&self) -> u32 { self.total_factor_count } #[allow(unused)] pub fn get_distinct_factor_count(&self) -> u32 { self.distinct_factor_count } #[allow(unused)] pub fn get_power_of_two(&self) -> u32 { self.power_two } #[allow(unused)] pub fn get_power_of_three(&self) -> u32 { self.power_three } #[allow(unused)] pub fn get_other_factors(&self) -> &[PrimeFactor] { &self.other_factors } pub fn is_power_of_three(&self) -> bool { self.power_three > 0 && self.power_two == 0 && self.other_factors.len() == 0 } // Divides the number by the given prime factor. Returns None if the resulting number is one. pub fn remove_factors(mut self, factor: PrimeFactor) -> Option { if factor.count == 0 { return Some(self); } if factor.value == 2 { self.power_two = self.power_two.checked_sub(factor.count).unwrap(); self.n >>= factor.count; self.total_factor_count -= factor.count; if self.power_two == 0 { self.distinct_factor_count -= 1; } if self.n > 1 { return Some(self); } } else if factor.value == 3 { self.power_three = self.power_three.checked_sub(factor.count).unwrap(); self.n /= 3.pow(factor.count); self.total_factor_count -= factor.count; if self.power_two == 0 { self.distinct_factor_count -= 1; } if self.n > 1 { return Some(self); } } else { let found_factor = self .other_factors .iter_mut() .find(|item| item.value == factor.value) .unwrap(); found_factor.count = found_factor.count.checked_sub(factor.count).unwrap(); self.n /= factor.value.pow(factor.count); self.total_factor_count -= factor.count; if found_factor.count == 0 { self.distinct_factor_count -= 1; self.other_factors.retain(|item| item.value != factor.value); } if self.n > 1 { return Some(self); } } None } // Splits this set of prime factors into two different sets so that the products of the two sets are as close as possible pub fn partition_factors(mut self) -> (Self, Self) { // Make sure this isn't a prime number assert!(!self.is_prime()); // If the given length is a perfect square, put the square root into both returned arays if self.power_two % 2 == 0 && self.power_three % 2 == 0 && self .other_factors .iter() .all(|factor| factor.count % 2 == 0) { let mut new_product = 1; // cut our power of two in half self.power_two /= 2; new_product <<= self.power_two; // cout our power of three in half self.power_three /= 2; new_product *= 3.pow(self.power_three); // cut all our other factors in half for factor in self.other_factors.iter_mut() { factor.count /= 2; new_product *= factor.value.pow(factor.count); } // update our cached properties and return 2 copies of ourself self.total_factor_count /= 2; self.n = new_product; (self.clone(), self) } else if self.distinct_factor_count == 1 { // If there's only one factor, just split it as evenly as possible let mut half = Self { other_factors: Vec::new(), n: self.n, power_two: self.power_two / 2, power_three: self.power_three / 2, total_factor_count: self.total_factor_count / 2, distinct_factor_count: 1, }; // We computed one half via integer division -- compute the other half by subtracting the divided values fro mthe original self.power_two -= half.power_two; self.power_three -= half.power_three; self.total_factor_count -= half.total_factor_count; // Update the product values for each half, with different logic depending on what kind of single factor we have if let Some(first_factor) = self.other_factors.first_mut() { // we actualyl skipped updating the "other factor" earlier, so cut it in half and do the subtraction now assert!(first_factor.count > 1); // If this is only one, then we're prime. we passed the "is_prime" assert earlier, so that would be a contradiction let half_factor = PrimeFactor { value: first_factor.value, count: first_factor.count / 2, }; first_factor.count -= half_factor.count; half.other_factors.push(half_factor); self.n = first_factor.value.pow(first_factor.count); half.n = half_factor.value.pow(half_factor.count); } else if half.power_two > 0 { half.n = 1 << half.power_two; self.n = 1 << self.power_two; } else if half.power_three > 0 { half.n = 3.pow(half.power_three); self.n = 3.pow(self.power_three); } (self, half) } else { // we have a mixed bag of products. we're going to greedily try to evenly distribute entire groups of factors in one direction or the other let mut left_product = 1; let mut right_product = 1; // for each factor, put it in whichever cumulative half is smaller for factor in self.other_factors { let factor_product = factor.value.pow(factor.count as u32); if left_product <= right_product { left_product *= factor_product; } else { right_product *= factor_product; } } if left_product <= right_product { left_product <<= self.power_two; } else { right_product <<= self.power_two; } if self.power_three > 0 && left_product <= right_product { left_product *= 3.pow(self.power_three); } else { right_product *= 3.pow(self.power_three); } // now that we have our two products, compute a prime factorization for them // we could maintain factor lists internally to save some computation and an allocation, but it led to a lot of code and this is so much simpler (Self::compute(left_product), Self::compute(right_product)) } } } #[derive(Copy, Clone, Debug)] pub struct PartialFactors { power2: u32, power3: u32, power5: u32, power7: u32, power11: u32, other_factors: usize, } impl PartialFactors { #[allow(unused)] pub fn compute(len: usize) -> Self { let power2 = len.trailing_zeros(); let mut other_factors = len >> power2; let mut power3 = 0; while other_factors % 3 == 0 { power3 += 1; other_factors /= 3; } let mut power5 = 0; while other_factors % 5 == 0 { power5 += 1; other_factors /= 5; } let mut power7 = 0; while other_factors % 7 == 0 { power7 += 1; other_factors /= 7; } let mut power11 = 0; while other_factors % 11 == 0 { power11 += 1; other_factors /= 11; } Self { power2, power3, power5, power7, power11, other_factors, } } #[allow(unused)] pub fn get_power2(&self) -> u32 { self.power2 } #[allow(unused)] pub fn get_power3(&self) -> u32 { self.power3 } #[allow(unused)] pub fn get_power5(&self) -> u32 { self.power5 } #[allow(unused)] pub fn get_power7(&self) -> u32 { self.power7 } #[allow(unused)] pub fn get_power11(&self) -> u32 { self.power11 } #[allow(unused)] pub fn get_other_factors(&self) -> usize { self.other_factors } #[allow(unused)] pub fn product(&self) -> usize { (self.other_factors * 3.pow(self.power3) * 5.pow(self.power5) * 7.pow(self.power7) * 11.pow(self.power11)) << self.power2 } #[allow(unused)] pub fn product_power2power3(&self) -> usize { 3.pow(self.power3) << self.power2 } #[allow(unused)] pub fn divide_by(&self, divisor: &PartialFactors) -> Option { let two_divides = self.power2 >= divisor.power2; let three_divides = self.power3 >= divisor.power3; let five_divides = self.power5 >= divisor.power5; let seven_divides = self.power7 >= divisor.power7; let eleven_divides = self.power11 >= divisor.power11; let other_divides = self.other_factors % divisor.other_factors == 0; if two_divides && three_divides && five_divides && seven_divides && eleven_divides && other_divides { Some(Self { power2: self.power2 - divisor.power2, power3: self.power3 - divisor.power3, power5: self.power5 - divisor.power5, power7: self.power7 - divisor.power7, power11: self.power11 - divisor.power11, other_factors: if self.other_factors == divisor.other_factors { 1 } else { self.other_factors / divisor.other_factors }, }) } else { None } } } #[cfg(test)] mod unit_tests { use super::*; #[test] fn test_modular_exponent() { // make sure to test something that would overflow under ordinary circumstances // ie 3 ^ 416788 mod 47 let test_list = vec![ ((2, 8, 300), 256), ((2, 9, 300), 212), ((1, 9, 300), 1), ((3, 416788, 47), 8), ]; for (input, expected) in test_list { let (base, exponent, modulo) = input; let result = modular_exponent(base, exponent, modulo); assert_eq!(result, expected); } } #[test] fn test_primitive_root() { let test_list = vec![(3, 2), (7, 3), (11, 2), (13, 2), (47, 5), (7919, 7)]; for (input, expected) in test_list { let root = primitive_root(input).unwrap(); assert_eq!(root, expected); } } #[test] fn test_distinct_prime_factors() { let test_list = vec![ (46, vec![2, 23]), (2, vec![2]), (3, vec![3]), (162, vec![2, 3]), ]; for (input, expected) in test_list { let factors = distinct_prime_factors(input); assert_eq!(factors, expected); } } use std::collections::HashMap; macro_rules! map{ { $($key:expr => $value:expr),+ } => { { let mut m = HashMap::new(); $( m.insert($key, $value); )+ m } }; } fn assert_internally_consistent(prime_factors: &PrimeFactors) { let mut cumulative_product = 1; let mut discovered_distinct_factors = 0; let mut discovered_total_factors = 0; if prime_factors.get_power_of_two() > 0 { cumulative_product <<= prime_factors.get_power_of_two(); discovered_distinct_factors += 1; discovered_total_factors += prime_factors.get_power_of_two(); } if prime_factors.get_power_of_three() > 0 { cumulative_product *= 3.pow(prime_factors.get_power_of_three()); discovered_distinct_factors += 1; discovered_total_factors += prime_factors.get_power_of_three(); } for factor in prime_factors.get_other_factors() { assert!(factor.count > 0); cumulative_product *= factor.value.pow(factor.count); discovered_distinct_factors += 1; discovered_total_factors += factor.count; } assert_eq!(prime_factors.get_product(), cumulative_product); assert_eq!( prime_factors.get_distinct_factor_count(), discovered_distinct_factors ); assert_eq!( prime_factors.get_total_factor_count(), discovered_total_factors ); assert_eq!(prime_factors.is_prime(), discovered_total_factors == 1); } #[test] fn test_prime_factors() { #[derive(Debug)] struct ExpectedData { len: usize, factors: HashMap, total_factors: u32, distinct_factors: u32, is_prime: bool, } impl ExpectedData { fn new( len: usize, factors: HashMap, total_factors: u32, distinct_factors: u32, is_prime: bool, ) -> Self { Self { len, factors, total_factors, distinct_factors, is_prime, } } } let test_list = vec![ ExpectedData::new(2, map! { 2 => 1 }, 1, 1, true), ExpectedData::new(128, map! { 2 => 7 }, 7, 1, false), ExpectedData::new(3, map! { 3 => 1 }, 1, 1, true), ExpectedData::new(81, map! { 3 => 4 }, 4, 1, false), ExpectedData::new(5, map! { 5 => 1 }, 1, 1, true), ExpectedData::new(125, map! { 5 => 3 }, 3, 1, false), ExpectedData::new(97, map! { 97 => 1 }, 1, 1, true), ExpectedData::new(6, map! { 2 => 1, 3 => 1 }, 2, 2, false), ExpectedData::new(12, map! { 2 => 2, 3 => 1 }, 3, 2, false), ExpectedData::new(36, map! { 2 => 2, 3 => 2 }, 4, 2, false), ExpectedData::new(10, map! { 2 => 1, 5 => 1 }, 2, 2, false), ExpectedData::new(100, map! { 2 => 2, 5 => 2 }, 4, 2, false), ExpectedData::new(44100, map! { 2 => 2, 3 => 2, 5 => 2, 7 => 2 }, 8, 4, false), ]; for expected in test_list { let factors = PrimeFactors::compute(expected.len); assert_eq!(factors.get_product(), expected.len); assert_eq!(factors.is_prime(), expected.is_prime); assert_eq!( factors.get_distinct_factor_count(), expected.distinct_factors ); assert_eq!(factors.get_total_factor_count(), expected.total_factors); assert_eq!( factors.get_power_of_two(), expected.factors.get(&2).map_or(0, |i| *i) ); assert_eq!( factors.get_power_of_three(), expected.factors.get(&3).map_or(0, |i| *i) ); // verify that every factor in the "other factors" array matches our expected map for factor in factors.get_other_factors() { assert_eq!(factor.count, *expected.factors.get(&factor.value).unwrap()); } // finally, verify that every factor in the "other factors" array was present in the "other factors" array let mut found_factors: Vec = factors .get_other_factors() .iter() .map(|factor| factor.value) .collect(); if factors.get_power_of_two() > 0 { found_factors.push(2); } if factors.get_power_of_three() > 0 { found_factors.push(3); } for key in expected.factors.keys() { assert!(found_factors.contains(key as &usize)); } } // in addition to our precomputed list, go through a bunch of ofther factors and just make sure they're internally consistent for n in 1..200 { let factors = PrimeFactors::compute(n); assert_eq!(factors.get_product(), n); assert_internally_consistent(&factors); } } #[test] fn test_partition_factors() { // We aren't going to verify the actual return value of "partition_factors", we're justgoing to make sure each half is internally consistent for n in 4..200 { let factors = PrimeFactors::compute(n); if !factors.is_prime() { let (left_factors, right_factors) = factors.partition_factors(); assert!(left_factors.get_product() > 1); assert!(right_factors.get_product() > 1); assert_eq!(left_factors.get_product() * right_factors.get_product(), n); assert_internally_consistent(&left_factors); assert_internally_consistent(&right_factors); } } } #[test] fn test_remove_factors() { // For every possible factor of a bunch of factors, they removing each and making sure the result is internally consistent for n in 2..200 { let factors = PrimeFactors::compute(n); for i in 0..=factors.get_power_of_two() { if let Some(removed_factors) = factors .clone() .remove_factors(PrimeFactor { value: 2, count: i }) { assert_eq!(removed_factors.get_product(), factors.get_product() >> i); assert_internally_consistent(&removed_factors); } else { // If the method returned None, this must be a power of two and i must be equal to the product assert!(n.is_power_of_two()); assert!(i == factors.get_power_of_two()); } } } } #[test] fn test_partial_factors() { #[derive(Debug)] struct ExpectedData { len: usize, power2: u32, power3: u32, power5: u32, power7: u32, power11: u32, other: usize, } let test_list = vec![ ExpectedData { len: 2, power2: 1, power3: 0, power5: 0, power7: 0, power11: 0, other: 1, }, ExpectedData { len: 128, power2: 7, power3: 0, power5: 0, power7: 0, power11: 0, other: 1, }, ExpectedData { len: 3, power2: 0, power3: 1, power5: 0, power7: 0, power11: 0, other: 1, }, ExpectedData { len: 81, power2: 0, power3: 4, power5: 0, power7: 0, power11: 0, other: 1, }, ExpectedData { len: 5, power2: 0, power3: 0, power5: 1, power7: 0, power11: 0, other: 1, }, ExpectedData { len: 125, power2: 0, power3: 0, power5: 3, power7: 0, power11: 0, other: 1, }, ExpectedData { len: 97, power2: 0, power3: 0, power5: 0, power7: 0, power11: 0, other: 97, }, ExpectedData { len: 6, power2: 1, power3: 1, power5: 0, power7: 0, power11: 0, other: 1, }, ExpectedData { len: 12, power2: 2, power3: 1, power5: 0, power7: 0, power11: 0, other: 1, }, ExpectedData { len: 36, power2: 2, power3: 2, power5: 0, power7: 0, power11: 0, other: 1, }, ExpectedData { len: 10, power2: 1, power3: 0, power5: 1, power7: 0, power11: 0, other: 1, }, ExpectedData { len: 100, power2: 2, power3: 0, power5: 2, power7: 0, power11: 0, other: 1, }, ExpectedData { len: 44100, power2: 2, power3: 2, power5: 2, power7: 2, power11: 0, other: 1, }, ExpectedData { len: 2310, power2: 1, power3: 1, power5: 1, power7: 1, power11: 1, other: 1, }, ]; for expected in test_list { let factors = PartialFactors::compute(expected.len); assert_eq!(factors.get_power2(), expected.power2); assert_eq!(factors.get_power3(), expected.power3); assert_eq!(factors.get_power5(), expected.power5); assert_eq!(factors.get_power7(), expected.power7); assert_eq!(factors.get_power11(), expected.power11); assert_eq!(factors.get_other_factors(), expected.other); assert_eq!( expected.len, (1 << factors.get_power2()) * 3.pow(factors.get_power3()) * 5.pow(factors.get_power5()) * 7.pow(factors.get_power7()) * 11.pow(factors.get_power11()) * factors.get_other_factors() ); assert_eq!(expected.len, factors.product()); assert_eq!( (1 << factors.get_power2()) * 3.pow(factors.get_power3()), factors.product_power2power3() ); assert_eq!(factors.get_other_factors().trailing_zeros(), 0); assert!(factors.get_other_factors() % 3 > 0); } // in addition to our precomputed list, go through a bunch of ofther factors and just make sure they're internally consistent for n in 1..200 { let factors = PartialFactors::compute(n); assert_eq!( n, (1 << factors.get_power2()) * 3.pow(factors.get_power3()) * 5.pow(factors.get_power5()) * 7.pow(factors.get_power7()) * 11.pow(factors.get_power11()) * factors.get_other_factors() ); assert_eq!(n, factors.product()); assert_eq!( (1 << factors.get_power2()) * 3.pow(factors.get_power3()), factors.product_power2power3() ); assert_eq!(factors.get_other_factors().trailing_zeros(), 0); assert!(factors.get_other_factors() % 3 > 0); } } #[test] fn test_partial_factors_divide_by() { for n in 2..200 { let factors = PartialFactors::compute(n); for power2 in 0..5 { for power3 in 0..4 { for power5 in 0..3 { for power7 in 0..3 { for power11 in 0..2 { for power13 in 0..2 { let divisor_product = (3.pow(power3) * 5.pow(power5) * 7.pow(power7) * 11.pow(power11) * 13.pow(power13)) << power2; let divisor = PartialFactors::compute(divisor_product); if let Some(quotient) = factors.divide_by(&divisor) { assert_eq!(quotient.product(), n / divisor_product); } else { assert!(n % divisor_product > 0); } } } } } } } } } } rustfft-6.2.0/src/neon/mod.rs000064400000000000000000000004750072674642500142370ustar 00000000000000#[macro_use] mod neon_common; #[macro_use] mod neon_vector; #[macro_use] pub mod neon_butterflies; pub mod neon_prime_butterflies; pub mod neon_radix4; mod neon_utils; pub mod neon_planner; pub use self::neon_butterflies::*; pub use self::neon_prime_butterflies::*; pub use self::neon_radix4::*; rustfft-6.2.0/src/neon/neon_butterflies.rs000064400000000000000000003755340072674642500170420ustar 00000000000000use core::arch::aarch64::*; use num_complex::Complex; use crate::{common::FftNum, FftDirection}; use crate::array_utils; use crate::array_utils::workaround_transmute_mut; use crate::array_utils::DoubleBuf; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::twiddles; use crate::{Direction, Fft, Length}; use super::neon_common::{assert_f32, assert_f64}; use super::neon_utils::*; use super::neon_vector::NeonArrayMut; #[allow(unused)] macro_rules! boilerplate_fft_neon_f32_butterfly { ($struct_name:ident, $len:expr, $direction_fn:expr) => { impl $struct_name { pub(crate) unsafe fn perform_fft_butterfly(&self, buffer: &mut [Complex]) { self.perform_fft_contiguous(workaround_transmute_mut::<_, Complex>(buffer)); } pub(crate) unsafe fn perform_parallel_fft_butterfly(&self, buffer: &mut [Complex]) { self.perform_parallel_fft_contiguous(workaround_transmute_mut::<_, Complex>( buffer, )); } // Do multiple ffts over a longer vector inplace, called from "process_with_scratch" of Fft trait pub(crate) unsafe fn perform_fft_butterfly_multi( &self, buffer: &mut [Complex], ) -> Result<(), ()> { let len = buffer.len(); let alldone = array_utils::iter_chunks(buffer, 2 * self.len(), |chunk| { self.perform_parallel_fft_butterfly(chunk) }); if alldone.is_err() && buffer.len() >= self.len() { self.perform_fft_butterfly(&mut buffer[len - self.len()..]); } Ok(()) } // Do multiple ffts over a longer vector outofplace, called from "process_outofplace_with_scratch" of Fft trait pub(crate) unsafe fn perform_oop_fft_butterfly_multi( &self, input: &mut [Complex], output: &mut [Complex], ) -> Result<(), ()> { let len = input.len(); let alldone = array_utils::iter_chunks_zipped( input, output, 2 * self.len(), |in_chunk, out_chunk| { let input_slice = workaround_transmute_mut(in_chunk); let output_slice = workaround_transmute_mut(out_chunk); self.perform_parallel_fft_contiguous(DoubleBuf { input: input_slice, output: output_slice, }) }, ); if alldone.is_err() && input.len() >= self.len() { let input_slice = workaround_transmute_mut(input); let output_slice = workaround_transmute_mut(output); self.perform_fft_contiguous(DoubleBuf { input: &mut input_slice[len - self.len()..], output: &mut output_slice[len - self.len()..], }) } Ok(()) } } }; } macro_rules! boilerplate_fft_neon_f64_butterfly { ($struct_name:ident, $len:expr, $direction_fn:expr) => { impl $struct_name { // Do a single fft //#[target_feature(enable = "neon")] pub(crate) unsafe fn perform_fft_butterfly(&self, buffer: &mut [Complex]) { self.perform_fft_contiguous(workaround_transmute_mut::<_, Complex>(buffer)); } // Do multiple ffts over a longer vector inplace, called from "process_with_scratch" of Fft trait //#[target_feature(enable = "neon")] pub(crate) unsafe fn perform_fft_butterfly_multi( &self, buffer: &mut [Complex], ) -> Result<(), ()> { array_utils::iter_chunks(buffer, self.len(), |chunk| { self.perform_fft_butterfly(chunk) }) } // Do multiple ffts over a longer vector outofplace, called from "process_outofplace_with_scratch" of Fft trait //#[target_feature(enable = "neon")] pub(crate) unsafe fn perform_oop_fft_butterfly_multi( &self, input: &mut [Complex], output: &mut [Complex], ) -> Result<(), ()> { array_utils::iter_chunks_zipped(input, output, self.len(), |in_chunk, out_chunk| { let input_slice = workaround_transmute_mut(in_chunk); let output_slice = workaround_transmute_mut(out_chunk); self.perform_fft_contiguous(DoubleBuf { input: input_slice, output: output_slice, }) }) } } }; } #[allow(unused)] macro_rules! boilerplate_fft_neon_common_butterfly { ($struct_name:ident, $len:expr, $direction_fn:expr) => { impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], _scratch: &mut [Complex], ) { if input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } let result = unsafe { self.perform_oop_fft_butterfly_multi(input, output) }; if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); } } fn process_with_scratch(&self, buffer: &mut [Complex], _scratch: &mut [Complex]) { if buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace(self.len(), buffer.len(), 0, 0); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let result = unsafe { self.perform_fft_butterfly_multi(buffer) }; if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace(self.len(), buffer.len(), 0, 0); } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { 0 } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { 0 } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { $len } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { $direction_fn(self) } } }; } // _ _________ _ _ _ // / | |___ /___ \| |__ (_) |_ // | | _____ |_ \ __) | '_ \| | __| // | | |_____| ___) / __/| |_) | | |_ // |_| |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly1 { direction: FftDirection, _phantom: std::marker::PhantomData, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly1, 1, |this: &NeonF32Butterfly1<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly1, 1, |this: &NeonF32Butterfly1<_>| this .direction); impl NeonF32Butterfly1 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); Self { direction, _phantom: std::marker::PhantomData, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let value = buffer.load_partial1_complex(0); buffer.store_partial_lo_complex(value, 0); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl NeonArrayMut, ) { let value = buffer.load_complex(0); buffer.store_complex(value, 0); } } // _ __ _ _ _ _ _ // / | / /_ | || | | |__ (_) |_ // | | _____ | '_ \| || |_| '_ \| | __| // | | |_____| | (_) |__ _| |_) | | |_ // |_| \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly1 { direction: FftDirection, _phantom: std::marker::PhantomData, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly1, 1, |this: &NeonF64Butterfly1<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly1, 1, |this: &NeonF64Butterfly1<_>| this .direction); impl NeonF64Butterfly1 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); Self { direction, _phantom: std::marker::PhantomData, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let value = buffer.load_complex(0); buffer.store_complex(value, 0); } } // ____ _________ _ _ _ // |___ \ |___ /___ \| |__ (_) |_ // __) | _____ |_ \ __) | '_ \| | __| // / __/ |_____| ___) / __/| |_) | | |_ // |_____| |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly2 { direction: FftDirection, _phantom: std::marker::PhantomData, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly2, 2, |this: &NeonF32Butterfly2<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly2, 2, |this: &NeonF32Butterfly2<_>| this .direction); impl NeonF32Butterfly2 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); Self { direction, _phantom: std::marker::PhantomData, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = buffer.load_complex(0); let temp = self.perform_fft_direct(values); buffer.store_complex(temp, 0); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl NeonArrayMut, ) { let values_a = buffer.load_complex(0); let values_b = buffer.load_complex(2); let out = self.perform_parallel_fft_direct(values_a, values_b); let [out02, out13] = transpose_complex_2x2_f32(out[0], out[1]); buffer.store_complex(out02, 0); buffer.store_complex(out13, 2); } // length 2 fft of x, given as [x0, x1] // result is [X0, X1] #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: float32x4_t) -> float32x4_t { solo_fft2_f32(values) } // dual length 2 fft of x and y, given as [x0, x1], [y0, y1] // result is [X0, Y0], [X1, Y1] #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, values_x: float32x4_t, values_y: float32x4_t, ) -> [float32x4_t; 2] { parallel_fft2_contiguous_f32(values_x, values_y) } } // double lenth 2 fft of a and b, given as [x0, y0], [x1, y1] // result is [X0, Y0], [X1, Y1] #[inline(always)] pub(crate) unsafe fn parallel_fft2_interleaved_f32( val02: float32x4_t, val13: float32x4_t, ) -> [float32x4_t; 2] { let temp0 = vaddq_f32(val02, val13); let temp1 = vsubq_f32(val02, val13); [temp0, temp1] } // double lenth 2 fft of a and b, given as [x0, x1], [y0, y1] // result is [X0, Y0], [X1, Y1] #[inline(always)] unsafe fn parallel_fft2_contiguous_f32(left: float32x4_t, right: float32x4_t) -> [float32x4_t; 2] { let [temp02, temp13] = transpose_complex_2x2_f32(left, right); parallel_fft2_interleaved_f32(temp02, temp13) } // length 2 fft of x, given as [x0, x1] // result is [X0, X1] #[inline(always)] unsafe fn solo_fft2_f32(values: float32x4_t) -> float32x4_t { let high = vget_high_f32(values); let low = vget_low_f32(values); vcombine_f32(vadd_f32(low, high), vsub_f32(low, high)) } // ____ __ _ _ _ _ _ // |___ \ / /_ | || | | |__ (_) |_ // __) | _____ | '_ \| || |_| '_ \| | __| // / __/ |_____| | (_) |__ _| |_) | | |_ // |_____| \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly2 { direction: FftDirection, _phantom: std::marker::PhantomData, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly2, 2, |this: &NeonF64Butterfly2<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly2, 2, |this: &NeonF64Butterfly2<_>| this .direction); impl NeonF64Butterfly2 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); Self { direction, _phantom: std::marker::PhantomData, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let value0 = buffer.load_complex(0); let value1 = buffer.load_complex(1); let out = self.perform_fft_direct(value0, value1); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 1); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value0: float64x2_t, value1: float64x2_t, ) -> [float64x2_t; 2] { solo_fft2_f64(value0, value1) } } #[inline(always)] pub(crate) unsafe fn solo_fft2_f64(left: float64x2_t, right: float64x2_t) -> [float64x2_t; 2] { let temp0 = vaddq_f64(left, right); let temp1 = vsubq_f64(left, right); [temp0, temp1] } // _____ _________ _ _ _ // |___ / |___ /___ \| |__ (_) |_ // |_ \ _____ |_ \ __) | '_ \| | __| // ___) | |_____| ___) / __/| |_) | | |_ // |____/ |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly3 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle: float32x4_t, twiddle1re: float32x4_t, twiddle1im: float32x4_t, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly3, 3, |this: &NeonF32Butterfly3<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly3, 3, |this: &NeonF32Butterfly3<_>| this .direction); impl NeonF32Butterfly3 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 3, direction); let twiddle = unsafe { vld1q_f32([tw1.re, tw1.re, -tw1.im, -tw1.im].as_ptr()) }; let twiddle1re = unsafe { vmovq_n_f32(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f32(tw1.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle, twiddle1re, twiddle1im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let value0x = buffer.load_partial1_complex(0); let value12 = buffer.load_complex(1); let out = self.perform_fft_direct(value0x, value12); buffer.store_partial_lo_complex(out[0], 0); buffer.store_complex(out[1], 1); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl NeonArrayMut, ) { let valuea0a1 = buffer.load_complex(0); let valuea2b0 = buffer.load_complex(2); let valueb1b2 = buffer.load_complex(4); let value0 = extract_lo_hi_f32(valuea0a1, valuea2b0); let value1 = extract_hi_lo_f32(valuea0a1, valueb1b2); let value2 = extract_lo_hi_f32(valuea2b0, valueb1b2); let out = self.perform_parallel_fft_direct(value0, value1, value2); let out0 = extract_lo_lo_f32(out[0], out[1]); let out1 = extract_lo_hi_f32(out[2], out[0]); let out2 = extract_hi_hi_f32(out[1], out[2]); buffer.store_complex(out0, 0); buffer.store_complex(out1, 2); buffer.store_complex(out2, 4); } // length 3 fft of a, given as [x0, 0.0], [x1, x2] // result is [X0, Z], [X1, X2] // The value Z should be discarded. #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value0x: float32x4_t, value12: float32x4_t, ) -> [float32x4_t; 2] { // This is a Neon translation of the scalar 3-point butterfly let rev12 = reverse_complex_and_negate_hi_f32(value12); let temp12pn = self.rotate.rotate_hi(vaddq_f32(value12, rev12)); let twiddled = vmulq_f32(temp12pn, self.twiddle); let temp = vaddq_f32(value0x, twiddled); let out12 = solo_fft2_f32(temp); let out0x = vaddq_f32(value0x, temp12pn); [out0x, out12] } // length 3 dual fft of a, given as (x0, y0), (x1, y1), (x2, y2). // result is [(X0, Y0), (X1, Y1), (X2, Y2)] #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, value0: float32x4_t, value1: float32x4_t, value2: float32x4_t, ) -> [float32x4_t; 3] { // This is a Neon translation of the scalar 3-point butterfly let x12p = vaddq_f32(value1, value2); let x12n = vsubq_f32(value1, value2); let sum = vaddq_f32(value0, x12p); let temp_a = vmulq_f32(self.twiddle1re, x12p); let temp_a = vaddq_f32(temp_a, value0); let n_rot = self.rotate.rotate_both(x12n); let temp_b = vmulq_f32(self.twiddle1im, n_rot); let x1 = vaddq_f32(temp_a, temp_b); let x2 = vsubq_f32(temp_a, temp_b); [sum, x1, x2] } } // _____ __ _ _ _ _ _ // |___ / / /_ | || | | |__ (_) |_ // |_ \ _____ | '_ \| || |_| '_ \| | __| // ___) | |_____| | (_) |__ _| |_) | | |_ // |____/ \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly3 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: float64x2_t, twiddle1im: float64x2_t, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly3, 3, |this: &NeonF64Butterfly3<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly3, 3, |this: &NeonF64Butterfly3<_>| this .direction); impl NeonF64Butterfly3 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 3, direction); let twiddle1re = unsafe { vmovq_n_f64(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f64(tw1.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let value0 = buffer.load_complex(0); let value1 = buffer.load_complex(1); let value2 = buffer.load_complex(2); let out = self.perform_fft_direct(value0, value1, value2); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 1); buffer.store_complex(out[2], 2); } // length 3 fft of x, given as x0, x1, x2. // result is [X0, X1, X2] #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value0: float64x2_t, value1: float64x2_t, value2: float64x2_t, ) -> [float64x2_t; 3] { // This is a Neon translation of the scalar 3-point butterfly let x12p = vaddq_f64(value1, value2); let x12n = vsubq_f64(value1, value2); let sum = vaddq_f64(value0, x12p); let temp_a = vfmaq_f64(value0, self.twiddle1re, x12p); let n_rot = self.rotate.rotate(x12n); let temp_b = vmulq_f64(self.twiddle1im, n_rot); let x1 = vaddq_f64(temp_a, temp_b); let x2 = vsubq_f64(temp_a, temp_b); [sum, x1, x2] } } // _ _ _________ _ _ _ // | || | |___ /___ \| |__ (_) |_ // | || |_ _____ |_ \ __) | '_ \| | __| // |__ _| |_____| ___) / __/| |_) | | |_ // |_| |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly4 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly4, 4, |this: &NeonF32Butterfly4<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly4, 4, |this: &NeonF32Butterfly4<_>| this .direction); impl NeonF32Butterfly4 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = if direction == FftDirection::Inverse { Rotate90F32::new(true) } else { Rotate90F32::new(false) }; Self { direction, _phantom: std::marker::PhantomData, rotate, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let value01 = buffer.load_complex(0); let value23 = buffer.load_complex(2); let out = self.perform_fft_direct(value01, value23); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 2); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl NeonArrayMut, ) { let value01a = buffer.load_complex(0); let value23a = buffer.load_complex(2); let value01b = buffer.load_complex(4); let value23b = buffer.load_complex(6); let [value0ab, value1ab] = transpose_complex_2x2_f32(value01a, value01b); let [value2ab, value3ab] = transpose_complex_2x2_f32(value23a, value23b); let out = self.perform_parallel_fft_direct(value0ab, value1ab, value2ab, value3ab); let [out0, out1] = transpose_complex_2x2_f32(out[0], out[1]); let [out2, out3] = transpose_complex_2x2_f32(out[2], out[3]); buffer.store_complex(out0, 0); buffer.store_complex(out1, 4); buffer.store_complex(out2, 2); buffer.store_complex(out3, 6); } // length 4 fft of a, given as [x0, x1], [x2, x3] // result is [[X0, X1], [X2, X3]] #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value01: float32x4_t, value23: float32x4_t, ) -> [float32x4_t; 2] { //we're going to hardcode a step of mixed radix //aka we're going to do the six step algorithm // step 1: transpose // and // step 2: column FFTs let mut temp = parallel_fft2_interleaved_f32(value01, value23); // step 3: apply twiddle factors (only one in this case, and it's either 0 + i or 0 - i) temp[1] = self.rotate.rotate_hi(temp[1]); // step 4: transpose, which we're skipping because we're the previous FFTs were non-contiguous // step 5: row FFTs // and // step 6: transpose by swapping index 1 and 2 parallel_fft2_contiguous_f32(temp[0], temp[1]) } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, values0: float32x4_t, values1: float32x4_t, values2: float32x4_t, values3: float32x4_t, ) -> [float32x4_t; 4] { //we're going to hardcode a step of mixed radix //aka we're going to do the six step algorithm // step 1: transpose // and // step 2: column FFTs let temp0 = parallel_fft2_interleaved_f32(values0, values2); let mut temp1 = parallel_fft2_interleaved_f32(values1, values3); // step 3: apply twiddle factors (only one in this case, and it's either 0 + i or 0 - i) temp1[1] = self.rotate.rotate_both(temp1[1]); // step 4: transpose, which we're skipping because we're the previous FFTs were non-contiguous // step 5: row FFTs let out0 = parallel_fft2_interleaved_f32(temp0[0], temp1[0]); let out2 = parallel_fft2_interleaved_f32(temp0[1], temp1[1]); // step 6: transpose by swapping index 1 and 2 [out0[0], out2[0], out0[1], out2[1]] } } // _ _ __ _ _ _ _ _ // | || | / /_ | || | | |__ (_) |_ // | || |_ _____ | '_ \| || |_| '_ \| | __| // |__ _| |_____| | (_) |__ _| |_) | | |_ // |_| \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly4 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly4, 4, |this: &NeonF64Butterfly4<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly4, 4, |this: &NeonF64Butterfly4<_>| this .direction); impl NeonF64Butterfly4 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = if direction == FftDirection::Inverse { Rotate90F64::new(true) } else { Rotate90F64::new(false) }; Self { direction, _phantom: std::marker::PhantomData, rotate, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let value0 = buffer.load_complex(0); let value1 = buffer.load_complex(1); let value2 = buffer.load_complex(2); let value3 = buffer.load_complex(3); let out = self.perform_fft_direct(value0, value1, value2, value3); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 1); buffer.store_complex(out[2], 2); buffer.store_complex(out[3], 3); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value0: float64x2_t, value1: float64x2_t, value2: float64x2_t, value3: float64x2_t, ) -> [float64x2_t; 4] { //we're going to hardcode a step of mixed radix //aka we're going to do the six step algorithm // step 1: transpose // and // step 2: column FFTs let temp0 = solo_fft2_f64(value0, value2); let mut temp1 = solo_fft2_f64(value1, value3); // step 3: apply twiddle factors (only one in this case, and it's either 0 + i or 0 - i) temp1[1] = self.rotate.rotate(temp1[1]); // step 4: transpose, which we're skipping because we're the previous FFTs were non-contiguous // step 5: row FFTs let out0 = solo_fft2_f64(temp0[0], temp1[0]); let out2 = solo_fft2_f64(temp0[1], temp1[1]); // step 6: transpose by swapping index 1 and 2 [out0[0], out2[0], out0[1], out2[1]] } } // ____ _________ _ _ _ // | ___| |___ /___ \| |__ (_) |_ // |___ \ _____ |_ \ __) | '_ \| | __| // ___) | |_____| ___) / __/| |_) | | |_ // |____/ |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly5 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle12re: float32x4_t, twiddle21re: float32x4_t, twiddle12im: float32x4_t, twiddle21im: float32x4_t, twiddle1re: float32x4_t, twiddle1im: float32x4_t, twiddle2re: float32x4_t, twiddle2im: float32x4_t, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly5, 5, |this: &NeonF32Butterfly5<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly5, 5, |this: &NeonF32Butterfly5<_>| this .direction); impl NeonF32Butterfly5 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 5, direction); let tw2: Complex = twiddles::compute_twiddle(2, 5, direction); let twiddle12re = unsafe { vld1q_f32([tw1.re, tw1.re, tw2.re, tw2.re].as_ptr()) }; let twiddle21re = unsafe { vld1q_f32([tw2.re, tw2.re, tw1.re, tw1.re].as_ptr()) }; let twiddle12im = unsafe { vld1q_f32([tw1.im, tw1.im, tw2.im, tw2.im].as_ptr()) }; let twiddle21im = unsafe { vld1q_f32([tw2.im, tw2.im, -tw1.im, -tw1.im].as_ptr()) }; let twiddle1re = unsafe { vmovq_n_f32(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f32(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f32(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f32(tw2.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle12re, twiddle21re, twiddle12im, twiddle21im, twiddle1re, twiddle1im, twiddle2re, twiddle2im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let value00 = buffer.load1_complex(0); let value12 = buffer.load_complex(1); let value34 = buffer.load_complex(3); let out = self.perform_fft_direct(value00, value12, value34); buffer.store_partial_lo_complex(out[0], 0); buffer.store_complex(out[1], 1); buffer.store_complex(out[2], 3); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl NeonArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4 ,6, 8}); let value0 = extract_lo_hi_f32(input_packed[0], input_packed[2]); let value1 = extract_hi_lo_f32(input_packed[0], input_packed[3]); let value2 = extract_lo_hi_f32(input_packed[1], input_packed[3]); let value3 = extract_hi_lo_f32(input_packed[1], input_packed[4]); let value4 = extract_lo_hi_f32(input_packed[2], input_packed[4]); let out = self.perform_parallel_fft_direct(value0, value1, value2, value3, value4); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_hi_f32(out[4], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4}); } // length 5 fft of a, given as [x0, x0], [x1, x2], [x3, x4]. // result is [[X0, Z], [X1, X2], [X3, X4]] // Note that Z should not be used. #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value00: float32x4_t, value12: float32x4_t, value34: float32x4_t, ) -> [float32x4_t; 3] { // This is a Neon translation of the scalar 5-point butterfly let temp43 = reverse_complex_elements_f32(value34); let x1423p = vaddq_f32(value12, temp43); let x1423n = vsubq_f32(value12, temp43); let x1414p = duplicate_lo_f32(x1423p); let x2323p = duplicate_hi_f32(x1423p); let x1414n = duplicate_lo_f32(x1423n); let x2323n = duplicate_hi_f32(x1423n); let temp_a1 = vmulq_f32(self.twiddle12re, x1414p); let temp_b1 = vmulq_f32(self.twiddle12im, x1414n); let temp_a = vfmaq_f32(temp_a1, self.twiddle21re, x2323p); let temp_a = vaddq_f32(value00, temp_a); let temp_b = vfmaq_f32(temp_b1, self.twiddle21im, x2323n); let b_rot = self.rotate.rotate_both(temp_b); let x00 = vaddq_f32(value00, vaddq_f32(x1414p, x2323p)); let x12 = vaddq_f32(temp_a, b_rot); let x34 = reverse_complex_elements_f32(vsubq_f32(temp_a, b_rot)); [x00, x12, x34] } // length 5 dual fft of x and y, given as (x0, y0), (x1, y1) ... (x4, y4). // result is [(X0, Y0), (X1, Y1) ... (X2, Y2)] #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, value0: float32x4_t, value1: float32x4_t, value2: float32x4_t, value3: float32x4_t, value4: float32x4_t, ) -> [float32x4_t; 5] { // This is a Neon translation of the scalar 3-point butterfly let x14p = vaddq_f32(value1, value4); let x14n = vsubq_f32(value1, value4); let x23p = vaddq_f32(value2, value3); let x23n = vsubq_f32(value2, value3); let temp_a1_1 = vmulq_f32(self.twiddle1re, x14p); let temp_a1_2 = vmulq_f32(self.twiddle2re, x23p); let temp_b1_1 = vmulq_f32(self.twiddle1im, x14n); let temp_b1_2 = vmulq_f32(self.twiddle2im, x23n); let temp_a2_1 = vmulq_f32(self.twiddle1re, x23p); let temp_a2_2 = vmulq_f32(self.twiddle2re, x14p); let temp_b2_1 = vmulq_f32(self.twiddle2im, x14n); let temp_b2_2 = vmulq_f32(self.twiddle1im, x23n); let temp_a1 = vaddq_f32(value0, vaddq_f32(temp_a1_1, temp_a1_2)); let temp_b1 = vaddq_f32(temp_b1_1, temp_b1_2); let temp_a2 = vaddq_f32(value0, vaddq_f32(temp_a2_1, temp_a2_2)); let temp_b2 = vsubq_f32(temp_b2_1, temp_b2_2); [ vaddq_f32(value0, vaddq_f32(x14p, x23p)), vaddq_f32(temp_a1, self.rotate.rotate_both(temp_b1)), vaddq_f32(temp_a2, self.rotate.rotate_both(temp_b2)), vsubq_f32(temp_a2, self.rotate.rotate_both(temp_b2)), vsubq_f32(temp_a1, self.rotate.rotate_both(temp_b1)), ] } } // ____ __ _ _ _ _ _ // | ___| / /_ | || | | |__ (_) |_ // |___ \ _____ | '_ \| || |_| '_ \| | __| // ___) | |_____| | (_) |__ _| |_) | | |_ // |____/ \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly5 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: float64x2_t, twiddle1im: float64x2_t, twiddle2re: float64x2_t, twiddle2im: float64x2_t, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly5, 5, |this: &NeonF64Butterfly5<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly5, 5, |this: &NeonF64Butterfly5<_>| this .direction); impl NeonF64Butterfly5 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 5, direction); let tw2: Complex = twiddles::compute_twiddle(2, 5, direction); let twiddle1re = unsafe { vmovq_n_f64(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f64(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f64(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f64(tw2.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let value0 = buffer.load_complex(0); let value1 = buffer.load_complex(1); let value2 = buffer.load_complex(2); let value3 = buffer.load_complex(3); let value4 = buffer.load_complex(4); let out = self.perform_fft_direct(value0, value1, value2, value3, value4); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 1); buffer.store_complex(out[2], 2); buffer.store_complex(out[3], 3); buffer.store_complex(out[4], 4); } // length 5 fft of x, given as x0, x1, x2, x3, x4. // result is [X0, X1, X2, X3, X4] #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value0: float64x2_t, value1: float64x2_t, value2: float64x2_t, value3: float64x2_t, value4: float64x2_t, ) -> [float64x2_t; 5] { // This is a Neon translation of the scalar 5-point butterfly let x14p = vaddq_f64(value1, value4); let x14n = vsubq_f64(value1, value4); let x23p = vaddq_f64(value2, value3); let x23n = vsubq_f64(value2, value3); let temp_a1_1 = vmulq_f64(self.twiddle1re, x14p); let temp_a1_2 = vmulq_f64(self.twiddle2re, x23p); let temp_a2_1 = vmulq_f64(self.twiddle2re, x14p); let temp_a2_2 = vmulq_f64(self.twiddle1re, x23p); let temp_b1_1 = vmulq_f64(self.twiddle1im, x14n); let temp_b1_2 = vmulq_f64(self.twiddle2im, x23n); let temp_b2_1 = vmulq_f64(self.twiddle2im, x14n); let temp_b2_2 = vmulq_f64(self.twiddle1im, x23n); let temp_a1 = vaddq_f64(value0, vaddq_f64(temp_a1_1, temp_a1_2)); let temp_a2 = vaddq_f64(value0, vaddq_f64(temp_a2_1, temp_a2_2)); let temp_b1 = vaddq_f64(temp_b1_1, temp_b1_2); let temp_b2 = vsubq_f64(temp_b2_1, temp_b2_2); let temp_b1_rot = self.rotate.rotate(temp_b1); let temp_b2_rot = self.rotate.rotate(temp_b2); [ vaddq_f64(value0, vaddq_f64(x14p, x23p)), vaddq_f64(temp_a1, temp_b1_rot), vaddq_f64(temp_a2, temp_b2_rot), vsubq_f64(temp_a2, temp_b2_rot), vsubq_f64(temp_a1, temp_b1_rot), ] } } // __ _________ _ _ _ // / /_ |___ /___ \| |__ (_) |_ // | '_ \ _____ |_ \ __) | '_ \| | __| // | (_) | |_____| ___) / __/| |_) | | |_ // \___/ |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly6 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: NeonF32Butterfly3, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly6, 6, |this: &NeonF32Butterfly6<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly6, 6, |this: &NeonF32Butterfly6<_>| this .direction); impl NeonF32Butterfly6 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf3 = NeonF32Butterfly3::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let value01 = buffer.load_complex(0); let value23 = buffer.load_complex(2); let value45 = buffer.load_complex(4); let out = self.perform_fft_direct(value01, value23, value45); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 2); buffer.store_complex(out[2], 4); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl NeonArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10}); let values = interleave_complex_f32!(input_packed, 3, {0, 1, 2}); let out = self.perform_parallel_fft_direct( values[0], values[1], values[2], values[3], values[4], values[5], ); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0, 1, 2, 3, 4, 5}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value01: float32x4_t, value23: float32x4_t, value45: float32x4_t, ) -> [float32x4_t; 3] { // Algorithm: 3x2 good-thomas // Size-3 FFTs down the columns of our reordered array let reord0 = extract_lo_hi_f32(value01, value23); let reord1 = extract_lo_hi_f32(value23, value45); let reord2 = extract_lo_hi_f32(value45, value01); let mid = self.bf3.perform_parallel_fft_direct(reord0, reord1, reord2); // We normally would put twiddle factors right here, but since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = parallel_fft2_contiguous_f32(mid[0], mid[1]); let output2 = solo_fft2_f32(mid[2]); // Reorder into output [ extract_lo_hi_f32(output0, output1), extract_lo_lo_f32(output2, output1), extract_hi_hi_f32(output0, output2), ] } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, value0: float32x4_t, value1: float32x4_t, value2: float32x4_t, value3: float32x4_t, value4: float32x4_t, value5: float32x4_t, ) -> [float32x4_t; 6] { // Algorithm: 3x2 good-thomas // Size-3 FFTs down the columns of our reordered array let mid0 = self.bf3.perform_parallel_fft_direct(value0, value2, value4); let mid1 = self.bf3.perform_parallel_fft_direct(value3, value5, value1); // We normally would put twiddle factors right here, but since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = parallel_fft2_interleaved_f32(mid0[0], mid1[0]); let [output2, output3] = parallel_fft2_interleaved_f32(mid0[1], mid1[1]); let [output4, output5] = parallel_fft2_interleaved_f32(mid0[2], mid1[2]); // Reorder into output [output0, output3, output4, output1, output2, output5] } } // __ __ _ _ _ _ _ // / /_ / /_ | || | | |__ (_) |_ // | '_ \ _____ | '_ \| || |_| '_ \| | __| // | (_) | |_____| | (_) |__ _| |_) | | |_ // \___/ \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly6 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: NeonF64Butterfly3, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly6, 6, |this: &NeonF64Butterfly6<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly6, 6, |this: &NeonF64Butterfly6<_>| this .direction); impl NeonF64Butterfly6 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf3 = NeonF64Butterfly3::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let value0 = buffer.load_complex(0); let value1 = buffer.load_complex(1); let value2 = buffer.load_complex(2); let value3 = buffer.load_complex(3); let value4 = buffer.load_complex(4); let value5 = buffer.load_complex(5); let out = self.perform_fft_direct(value0, value1, value2, value3, value4, value5); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 1); buffer.store_complex(out[2], 2); buffer.store_complex(out[3], 3); buffer.store_complex(out[4], 4); buffer.store_complex(out[5], 5); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value0: float64x2_t, value1: float64x2_t, value2: float64x2_t, value3: float64x2_t, value4: float64x2_t, value5: float64x2_t, ) -> [float64x2_t; 6] { // Algorithm: 3x2 good-thomas // Size-3 FFTs down the columns of our reordered array let mid0 = self.bf3.perform_fft_direct(value0, value2, value4); let mid1 = self.bf3.perform_fft_direct(value3, value5, value1); // We normally would put twiddle factors right here, but since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = solo_fft2_f64(mid0[0], mid1[0]); let [output2, output3] = solo_fft2_f64(mid0[1], mid1[1]); let [output4, output5] = solo_fft2_f64(mid0[2], mid1[2]); // Reorder into output [output0, output3, output4, output1, output2, output5] } } // ___ _________ _ _ _ // ( _ ) |___ /___ \| |__ (_) |_ // / _ \ _____ |_ \ __) | '_ \| | __| // | (_) | |_____| ___) / __/| |_) | | |_ // \___/ |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly8 { root2: float32x4_t, root2_dual: float32x4_t, direction: FftDirection, bf4: NeonF32Butterfly4, rotate90: Rotate90F32, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly8, 8, |this: &NeonF32Butterfly8<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly8, 8, |this: &NeonF32Butterfly8<_>| this .direction); impl NeonF32Butterfly8 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf4 = NeonF32Butterfly4::new(direction); let root2 = unsafe { vld1q_f32([1.0, 1.0, 0.5f32.sqrt(), 0.5f32.sqrt(), 1.0, 1.0].as_ptr()) }; let root2_dual = unsafe { vmovq_n_f32(0.5f32.sqrt()) }; let rotate90 = if direction == FftDirection::Inverse { Rotate90F32::new(true) } else { Rotate90F32::new(false) }; Self { root2, root2_dual, direction, bf4, rotate90, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6}); let out = self.perform_fft_direct(input_packed); write_complex_to_array_strided!(out, buffer, 2, {0,1,2,3}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl NeonArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14}); let values = interleave_complex_f32!(input_packed, 4, {0, 1, 2, 3}); let out = self.perform_parallel_fft_direct(values); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4, 6}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0,1,2,3,4,5,6,7}); } #[inline(always)] unsafe fn perform_fft_direct(&self, values: [float32x4_t; 4]) -> [float32x4_t; 4] { // we're going to hardcode a step of mixed radix // step 1: copy and reorder the input into the scratch let [in02, in13] = transpose_complex_2x2_f32(values[0], values[1]); let [in46, in57] = transpose_complex_2x2_f32(values[2], values[3]); // step 2: column FFTs let val0 = self.bf4.perform_fft_direct(in02, in46); let mut val2 = self.bf4.perform_fft_direct(in13, in57); // step 3: apply twiddle factors let val2b = self.rotate90.rotate_hi(val2[0]); let val2c = vaddq_f32(val2b, val2[0]); let val2d = vmulq_f32(val2c, self.root2); val2[0] = extract_lo_hi_f32(val2[0], val2d); let val3b = self.rotate90.rotate_both(val2[1]); let val3c = vsubq_f32(val3b, val2[1]); let val3d = vmulq_f32(val3c, self.root2); val2[1] = extract_lo_hi_f32(val3b, val3d); // step 4: transpose -- skipped because we're going to do the next FFTs non-contiguously // step 5: row FFTs let out0 = parallel_fft2_interleaved_f32(val0[0], val2[0]); let out1 = parallel_fft2_interleaved_f32(val0[1], val2[1]); // step 6: rearrange and copy to buffer [out0[0], out1[0], out0[1], out1[1]] } #[inline(always)] unsafe fn perform_parallel_fft_direct(&self, values: [float32x4_t; 8]) -> [float32x4_t; 8] { // we're going to hardcode a step of mixed radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let val03 = self .bf4 .perform_parallel_fft_direct(values[0], values[2], values[4], values[6]); let mut val47 = self .bf4 .perform_parallel_fft_direct(values[1], values[3], values[5], values[7]); // step 3: apply twiddle factors let val5b = self.rotate90.rotate_both(val47[1]); let val5c = vaddq_f32(val5b, val47[1]); val47[1] = vmulq_f32(val5c, self.root2_dual); val47[2] = self.rotate90.rotate_both(val47[2]); let val7b = self.rotate90.rotate_both(val47[3]); let val7c = vsubq_f32(val7b, val47[3]); val47[3] = vmulq_f32(val7c, self.root2_dual); // step 4: transpose -- skipped because we're going to do the next FFTs non-contiguously // step 5: row FFTs let out0 = parallel_fft2_interleaved_f32(val03[0], val47[0]); let out1 = parallel_fft2_interleaved_f32(val03[1], val47[1]); let out2 = parallel_fft2_interleaved_f32(val03[2], val47[2]); let out3 = parallel_fft2_interleaved_f32(val03[3], val47[3]); // step 6: rearrange and copy to buffer [ out0[0], out1[0], out2[0], out3[0], out0[1], out1[1], out2[1], out3[1], ] } } // ___ __ _ _ _ _ _ // ( _ ) / /_ | || | | |__ (_) |_ // / _ \ _____ | '_ \| || |_| '_ \| | __| // | (_) | |_____| | (_) |__ _| |_) | | |_ // \___/ \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly8 { root2: float64x2_t, direction: FftDirection, bf4: NeonF64Butterfly4, rotate90: Rotate90F64, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly8, 8, |this: &NeonF64Butterfly8<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly8, 8, |this: &NeonF64Butterfly8<_>| this .direction); impl NeonF64Butterfly8 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf4 = NeonF64Butterfly4::new(direction); let root2 = unsafe { vmovq_n_f64(0.5f64.sqrt()) }; let rotate90 = if direction == FftDirection::Inverse { Rotate90F64::new(true) } else { Rotate90F64::new(false) }; Self { root2, direction, bf4, rotate90, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7}); } #[inline(always)] unsafe fn perform_fft_direct(&self, values: [float64x2_t; 8]) -> [float64x2_t; 8] { // we're going to hardcode a step of mixed radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let val03 = self .bf4 .perform_fft_direct(values[0], values[2], values[4], values[6]); let mut val47 = self .bf4 .perform_fft_direct(values[1], values[3], values[5], values[7]); // step 3: apply twiddle factors let val5b = self.rotate90.rotate(val47[1]); let val5c = vaddq_f64(val5b, val47[1]); val47[1] = vmulq_f64(val5c, self.root2); val47[2] = self.rotate90.rotate(val47[2]); let val7b = self.rotate90.rotate(val47[3]); let val7c = vsubq_f64(val7b, val47[3]); val47[3] = vmulq_f64(val7c, self.root2); // step 4: transpose -- skipped because we're going to do the next FFTs non-contiguously // step 5: row FFTs let out0 = solo_fft2_f64(val03[0], val47[0]); let out1 = solo_fft2_f64(val03[1], val47[1]); let out2 = solo_fft2_f64(val03[2], val47[2]); let out3 = solo_fft2_f64(val03[3], val47[3]); // step 6: rearrange and copy to buffer [ out0[0], out1[0], out2[0], out3[0], out0[1], out1[1], out2[1], out3[1], ] } } // ___ _________ _ _ _ // / _ \ |___ /___ \| |__ (_) |_ // | (_) | _____ |_ \ __) | '_ \| | __| // \__, | |_____| ___) / __/| |_) | | |_ // /_/ |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly9 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: NeonF32Butterfly3, twiddle1: float32x4_t, twiddle2: float32x4_t, twiddle4: float32x4_t, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly9, 9, |this: &NeonF32Butterfly9<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly9, 9, |this: &NeonF32Butterfly9<_>| this .direction); impl NeonF32Butterfly9 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf3 = NeonF32Butterfly3::new(direction); let tw1: Complex = twiddles::compute_twiddle(1, 9, direction); let tw2: Complex = twiddles::compute_twiddle(2, 9, direction); let tw4: Complex = twiddles::compute_twiddle(4, 9, direction); let twiddle1 = unsafe { vld1q_f32([tw1.re, tw1.im, tw1.re, tw1.im].as_ptr()) }; let twiddle2 = unsafe { vld1q_f32([tw2.re, tw2.im, tw2.re, tw2.im].as_ptr()) }; let twiddle4 = unsafe { vld1q_f32([tw4.re, tw4.im, tw4.re, tw4.im].as_ptr()) }; Self { direction, _phantom: std::marker::PhantomData, bf3, twiddle1, twiddle2, twiddle4, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { // A single Neon 9-point will need a lot of shuffling, let's just reuse the dual one let values = read_partial1_complex_to_array!(buffer, {0,1,2,3,4,5,6,7,8}); let out = self.perform_parallel_fft_direct(values); for n in 0..9 { buffer.store_partial_lo_complex(out[n], n); } } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl NeonArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[4]), extract_hi_lo_f32(input_packed[0], input_packed[5]), extract_lo_hi_f32(input_packed[1], input_packed[5]), extract_hi_lo_f32(input_packed[1], input_packed[6]), extract_lo_hi_f32(input_packed[2], input_packed[6]), extract_hi_lo_f32(input_packed[2], input_packed[7]), extract_lo_hi_f32(input_packed[3], input_packed[7]), extract_hi_lo_f32(input_packed[3], input_packed[8]), extract_lo_hi_f32(input_packed[4], input_packed[8]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_hi_f32(out[8], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0,1,2,3,4,5,6,7,8}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, values: [float32x4_t; 9], ) -> [float32x4_t; 9] { // Algorithm: 3x3 mixed radix // Size-3 FFTs down the columns let mid0 = self .bf3 .perform_parallel_fft_direct(values[0], values[3], values[6]); let mut mid1 = self .bf3 .perform_parallel_fft_direct(values[1], values[4], values[7]); let mut mid2 = self .bf3 .perform_parallel_fft_direct(values[2], values[5], values[8]); // Apply twiddle factors. Note that we're re-using twiddle2 mid1[1] = mul_complex_f32(self.twiddle1, mid1[1]); mid1[2] = mul_complex_f32(self.twiddle2, mid1[2]); mid2[1] = mul_complex_f32(self.twiddle2, mid2[1]); mid2[2] = mul_complex_f32(self.twiddle4, mid2[2]); let [output0, output1, output2] = self .bf3 .perform_parallel_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self .bf3 .perform_parallel_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self .bf3 .perform_parallel_fft_direct(mid0[2], mid1[2], mid2[2]); [ output0, output3, output6, output1, output4, output7, output2, output5, output8, ] } } // ___ __ _ _ _ _ _ // / _ \ / /_ | || | | |__ (_) |_ // | (_) | _____ | '_ \| || |_| '_ \| | __| // \__, | |_____| | (_) |__ _| |_) | | |_ // /_/ \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly9 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: NeonF64Butterfly3, twiddle1: float64x2_t, twiddle2: float64x2_t, twiddle4: float64x2_t, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly9, 9, |this: &NeonF64Butterfly9<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly9, 9, |this: &NeonF64Butterfly9<_>| this .direction); impl NeonF64Butterfly9 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf3 = NeonF64Butterfly3::new(direction); let tw1: Complex = twiddles::compute_twiddle(1, 9, direction); let tw2: Complex = twiddles::compute_twiddle(2, 9, direction); let tw4: Complex = twiddles::compute_twiddle(4, 9, direction); let twiddle1 = unsafe { vld1q_f64([tw1.re, tw1.im].as_ptr()) }; let twiddle2 = unsafe { vld1q_f64([tw2.re, tw2.im].as_ptr()) }; let twiddle4 = unsafe { vld1q_f64([tw4.re, tw4.im].as_ptr()) }; Self { direction, _phantom: std::marker::PhantomData, bf3, twiddle1, twiddle2, twiddle4, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [float64x2_t; 9]) -> [float64x2_t; 9] { // Algorithm: 3x3 mixed radix // Size-3 FFTs down the columns let mid0 = self.bf3.perform_fft_direct(values[0], values[3], values[6]); let mut mid1 = self.bf3.perform_fft_direct(values[1], values[4], values[7]); let mut mid2 = self.bf3.perform_fft_direct(values[2], values[5], values[8]); // Apply twiddle factors. Note that we're re-using twiddle2 mid1[1] = mul_complex_f64(self.twiddle1, mid1[1]); mid1[2] = mul_complex_f64(self.twiddle2, mid1[2]); mid2[1] = mul_complex_f64(self.twiddle2, mid2[1]); mid2[2] = mul_complex_f64(self.twiddle4, mid2[2]); let [output0, output1, output2] = self.bf3.perform_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self.bf3.perform_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self.bf3.perform_fft_direct(mid0[2], mid1[2], mid2[2]); [ output0, output3, output6, output1, output4, output7, output2, output5, output8, ] } } // _ ___ _________ _ _ _ // / |/ _ \ |___ /___ \| |__ (_) |_ // | | | | | _____ |_ \ __) | '_ \| | __| // | | |_| | |_____| ___) / __/| |_) | | |_ // |_|\___/ |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly10 { direction: FftDirection, _phantom: std::marker::PhantomData, bf5: NeonF32Butterfly5, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly10, 10, |this: &NeonF32Butterfly10<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly10, 10, |this: &NeonF32Butterfly10<_>| this .direction); impl NeonF32Butterfly10 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf5 = NeonF32Butterfly5::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf5, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8}); let out = self.perform_fft_direct(input_packed); write_complex_to_array_strided!(out, buffer, 2, {0,1,2,3,4}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl NeonArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18}); let values = interleave_complex_f32!(input_packed, 5, {0, 1, 2, 3, 4}); let out = self.perform_parallel_fft_direct(values); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4, 6, 8}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0,1,2,3,4,5,6,7,8,9}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [float32x4_t; 5]) -> [float32x4_t; 5] { // Algorithm: 5x2 good-thomas // Reorder and pack let reord0 = extract_lo_hi_f32(values[0], values[2]); let reord1 = extract_lo_hi_f32(values[1], values[3]); let reord2 = extract_lo_hi_f32(values[2], values[4]); let reord3 = extract_lo_hi_f32(values[3], values[0]); let reord4 = extract_lo_hi_f32(values[4], values[1]); // Size-5 FFTs down the columns of our reordered array let mids = self .bf5 .perform_parallel_fft_direct(reord0, reord1, reord2, reord3, reord4); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [temp01, temp23] = parallel_fft2_contiguous_f32(mids[0], mids[1]); let [temp45, temp67] = parallel_fft2_contiguous_f32(mids[2], mids[3]); let temp89 = solo_fft2_f32(mids[4]); // Reorder let out01 = extract_lo_hi_f32(temp01, temp23); let out23 = extract_lo_hi_f32(temp45, temp67); let out45 = extract_lo_lo_f32(temp89, temp23); let out67 = extract_hi_lo_f32(temp01, temp67); let out89 = extract_hi_hi_f32(temp45, temp89); [out01, out23, out45, out67, out89] } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, values: [float32x4_t; 10], ) -> [float32x4_t; 10] { // Algorithm: 5x2 good-thomas // Size-5 FFTs down the columns of our reordered array let mid0 = self .bf5 .perform_parallel_fft_direct(values[0], values[2], values[4], values[6], values[8]); let mid1 = self .bf5 .perform_parallel_fft_direct(values[5], values[7], values[9], values[1], values[3]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = parallel_fft2_interleaved_f32(mid0[0], mid1[0]); let [output2, output3] = parallel_fft2_interleaved_f32(mid0[1], mid1[1]); let [output4, output5] = parallel_fft2_interleaved_f32(mid0[2], mid1[2]); let [output6, output7] = parallel_fft2_interleaved_f32(mid0[3], mid1[3]); let [output8, output9] = parallel_fft2_interleaved_f32(mid0[4], mid1[4]); // Reorder and return [ output0, output3, output4, output7, output8, output1, output2, output5, output6, output9, ] } } // _ ___ __ _ _ _ _ _ // / |/ _ \ / /_ | || | | |__ (_) |_ // | | | | | _____ | '_ \| || |_| '_ \| | __| // | | |_| | |_____| | (_) |__ _| |_) | | |_ // |_|\___/ \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly10 { direction: FftDirection, _phantom: std::marker::PhantomData, bf2: NeonF64Butterfly2, bf5: NeonF64Butterfly5, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly10, 10, |this: &NeonF64Butterfly10<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly10, 10, |this: &NeonF64Butterfly10<_>| this .direction); impl NeonF64Butterfly10 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf2 = NeonF64Butterfly2::new(direction); let bf5 = NeonF64Butterfly5::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf2, bf5, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [float64x2_t; 10]) -> [float64x2_t; 10] { // Algorithm: 5x2 good-thomas // Size-5 FFTs down the columns of our reordered array let mid0 = self .bf5 .perform_fft_direct(values[0], values[2], values[4], values[6], values[8]); let mid1 = self .bf5 .perform_fft_direct(values[5], values[7], values[9], values[1], values[3]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = self.bf2.perform_fft_direct(mid0[0], mid1[0]); let [output2, output3] = self.bf2.perform_fft_direct(mid0[1], mid1[1]); let [output4, output5] = self.bf2.perform_fft_direct(mid0[2], mid1[2]); let [output6, output7] = self.bf2.perform_fft_direct(mid0[3], mid1[3]); let [output8, output9] = self.bf2.perform_fft_direct(mid0[4], mid1[4]); // Reorder and return [ output0, output3, output4, output7, output8, output1, output2, output5, output6, output9, ] } } // _ ____ _________ _ _ _ // / |___ \ |___ /___ \| |__ (_) |_ // | | __) | _____ |_ \ __) | '_ \| | __| // | |/ __/ |_____| ___) / __/| |_) | | |_ // |_|_____| |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly12 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: NeonF32Butterfly3, bf4: NeonF32Butterfly4, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly12, 12, |this: &NeonF32Butterfly12<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly12, 12, |this: &NeonF32Butterfly12<_>| this .direction); impl NeonF32Butterfly12 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf3 = NeonF32Butterfly3::new(direction); let bf4 = NeonF32Butterfly4::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, bf4, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10 }); let out = self.perform_fft_direct(input_packed); write_complex_to_array_strided!(out, buffer, 2, {0,1,2,3,4,5}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl NeonArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22}); let values = interleave_complex_f32!(input_packed, 6, {0, 1, 2, 3, 4, 5}); let out = self.perform_parallel_fft_direct(values); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4, 6, 8, 10}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0,1,2,3,4,5,6,7,8,9, 10, 11}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [float32x4_t; 6]) -> [float32x4_t; 6] { // Algorithm: 4x3 good-thomas // Reorder and pack let packed03 = extract_lo_hi_f32(values[0], values[1]); let packed47 = extract_lo_hi_f32(values[2], values[3]); let packed69 = extract_lo_hi_f32(values[3], values[4]); let packed101 = extract_lo_hi_f32(values[5], values[0]); let packed811 = extract_lo_hi_f32(values[4], values[5]); let packed25 = extract_lo_hi_f32(values[1], values[2]); // Size-4 FFTs down the columns of our reordered array let mid0 = self.bf4.perform_fft_direct(packed03, packed69); let mid1 = self.bf4.perform_fft_direct(packed47, packed101); let mid2 = self.bf4.perform_fft_direct(packed811, packed25); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-3 FFTs down the columns let [temp03, temp14, temp25] = self .bf3 .perform_parallel_fft_direct(mid0[0], mid1[0], mid2[0]); let [temp69, temp710, temp811] = self .bf3 .perform_parallel_fft_direct(mid0[1], mid1[1], mid2[1]); // Reorder and return [ extract_lo_hi_f32(temp03, temp14), extract_lo_hi_f32(temp811, temp69), extract_lo_hi_f32(temp14, temp25), extract_lo_hi_f32(temp69, temp710), extract_lo_hi_f32(temp25, temp03), extract_lo_hi_f32(temp710, temp811), ] } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, values: [float32x4_t; 12], ) -> [float32x4_t; 12] { // Algorithm: 4x3 good-thomas // Size-4 FFTs down the columns of our reordered array let mid0 = self .bf4 .perform_parallel_fft_direct(values[0], values[3], values[6], values[9]); let mid1 = self .bf4 .perform_parallel_fft_direct(values[4], values[7], values[10], values[1]); let mid2 = self .bf4 .perform_parallel_fft_direct(values[8], values[11], values[2], values[5]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-3 FFTs down the columns let [output0, output1, output2] = self .bf3 .perform_parallel_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self .bf3 .perform_parallel_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self .bf3 .perform_parallel_fft_direct(mid0[2], mid1[2], mid2[2]); let [output9, output10, output11] = self .bf3 .perform_parallel_fft_direct(mid0[3], mid1[3], mid2[3]); // Reorder and return [ output0, output4, output8, output9, output1, output5, output6, output10, output2, output3, output7, output11, ] } } // _ ____ __ _ _ _ _ _ // / |___ \ / /_ | || | | |__ (_) |_ // | | __) | _____ | '_ \| || |_| '_ \| | __| // | |/ __/ |_____| | (_) |__ _| |_) | | |_ // |_|_____| \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly12 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: NeonF64Butterfly3, bf4: NeonF64Butterfly4, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly12, 12, |this: &NeonF64Butterfly12<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly12, 12, |this: &NeonF64Butterfly12<_>| this .direction); impl NeonF64Butterfly12 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf3 = NeonF64Butterfly3::new(direction); let bf4 = NeonF64Butterfly4::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, bf4, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [float64x2_t; 12]) -> [float64x2_t; 12] { // Algorithm: 4x3 good-thomas // Size-4 FFTs down the columns of our reordered array let mid0 = self .bf4 .perform_fft_direct(values[0], values[3], values[6], values[9]); let mid1 = self .bf4 .perform_fft_direct(values[4], values[7], values[10], values[1]); let mid2 = self .bf4 .perform_fft_direct(values[8], values[11], values[2], values[5]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-3 FFTs down the columns let [output0, output1, output2] = self.bf3.perform_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self.bf3.perform_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self.bf3.perform_fft_direct(mid0[2], mid1[2], mid2[2]); let [output9, output10, output11] = self.bf3.perform_fft_direct(mid0[3], mid1[3], mid2[3]); [ output0, output4, output8, output9, output1, output5, output6, output10, output2, output3, output7, output11, ] } } // _ ____ _________ _ _ _ // / | ___| |___ /___ \| |__ (_) |_ // | |___ \ _____ |_ \ __) | '_ \| | __| // | |___) | |_____| ___) / __/| |_) | | |_ // |_|____/ |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly15 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: NeonF32Butterfly3, bf5: NeonF32Butterfly5, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly15, 15, |this: &NeonF32Butterfly15<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly15, 15, |this: &NeonF32Butterfly15<_>| this .direction); impl NeonF32Butterfly15 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf3 = NeonF32Butterfly3::new(direction); let bf5 = NeonF32Butterfly5::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, bf5, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { // A single Neon 15-point will need a lot of shuffling, let's just reuse the dual one let values = read_partial1_complex_to_array!(buffer, {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14}); let out = self.perform_parallel_fft_direct(values); for n in 0..15 { buffer.store_partial_lo_complex(out[n], n); } } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl NeonArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[7]), extract_hi_lo_f32(input_packed[0], input_packed[8]), extract_lo_hi_f32(input_packed[1], input_packed[8]), extract_hi_lo_f32(input_packed[1], input_packed[9]), extract_lo_hi_f32(input_packed[2], input_packed[9]), extract_hi_lo_f32(input_packed[2], input_packed[10]), extract_lo_hi_f32(input_packed[3], input_packed[10]), extract_hi_lo_f32(input_packed[3], input_packed[11]), extract_lo_hi_f32(input_packed[4], input_packed[11]), extract_hi_lo_f32(input_packed[4], input_packed[12]), extract_lo_hi_f32(input_packed[5], input_packed[12]), extract_hi_lo_f32(input_packed[5], input_packed[13]), extract_lo_hi_f32(input_packed[6], input_packed[13]), extract_hi_lo_f32(input_packed[6], input_packed[14]), extract_lo_hi_f32(input_packed[7], input_packed[14]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_hi_f32(out[14], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0,1,2,3,4,5,6,7,8,9, 10, 11, 12, 13, 14}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, values: [float32x4_t; 15], ) -> [float32x4_t; 15] { // Algorithm: 5x3 good-thomas // Size-5 FFTs down the columns of our reordered array let mid0 = self .bf5 .perform_parallel_fft_direct(values[0], values[3], values[6], values[9], values[12]); let mid1 = self .bf5 .perform_parallel_fft_direct(values[5], values[8], values[11], values[14], values[2]); let mid2 = self .bf5 .perform_parallel_fft_direct(values[10], values[13], values[1], values[4], values[7]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-3 FFTs down the columns let [output0, output1, output2] = self .bf3 .perform_parallel_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self .bf3 .perform_parallel_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self .bf3 .perform_parallel_fft_direct(mid0[2], mid1[2], mid2[2]); let [output9, output10, output11] = self .bf3 .perform_parallel_fft_direct(mid0[3], mid1[3], mid2[3]); let [output12, output13, output14] = self .bf3 .perform_parallel_fft_direct(mid0[4], mid1[4], mid2[4]); [ output0, output4, output8, output9, output13, output2, output3, output7, output11, output12, output1, output5, output6, output10, output14, ] } } // _ ____ __ _ _ _ _ _ // / | ___| / /_ | || | | |__ (_) |_ // | |___ \ _____ | '_ \| || |_| '_ \| | __| // | |___) | |_____| | (_) |__ _| |_) | | |_ // |_|____/ \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly15 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: NeonF64Butterfly3, bf5: NeonF64Butterfly5, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly15, 15, |this: &NeonF64Butterfly15<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly15, 15, |this: &NeonF64Butterfly15<_>| this .direction); impl NeonF64Butterfly15 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf3 = NeonF64Butterfly3::new(direction); let bf5 = NeonF64Butterfly5::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, bf5, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [float64x2_t; 15]) -> [float64x2_t; 15] { // Algorithm: 5x3 good-thomas // Size-5 FFTs down the columns of our reordered array let mid0 = self .bf5 .perform_fft_direct(values[0], values[3], values[6], values[9], values[12]); let mid1 = self .bf5 .perform_fft_direct(values[5], values[8], values[11], values[14], values[2]); let mid2 = self .bf5 .perform_fft_direct(values[10], values[13], values[1], values[4], values[7]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-3 FFTs down the columns let [output0, output1, output2] = self.bf3.perform_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self.bf3.perform_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self.bf3.perform_fft_direct(mid0[2], mid1[2], mid2[2]); let [output9, output10, output11] = self.bf3.perform_fft_direct(mid0[3], mid1[3], mid2[3]); let [output12, output13, output14] = self.bf3.perform_fft_direct(mid0[4], mid1[4], mid2[4]); [ output0, output4, output8, output9, output13, output2, output3, output7, output11, output12, output1, output5, output6, output10, output14, ] } } // _ __ _________ _ _ _ // / |/ /_ |___ /___ \| |__ (_) |_ // | | '_ \ _____ |_ \ __) | '_ \| | __| // | | (_) | |_____| ___) / __/| |_) | | |_ // |_|\___/ |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly16 { direction: FftDirection, bf4: NeonF32Butterfly4, bf8: NeonF32Butterfly8, rotate90: Rotate90F32, twiddle01: float32x4_t, twiddle23: float32x4_t, twiddle01conj: float32x4_t, twiddle23conj: float32x4_t, twiddle1: float32x4_t, twiddle2: float32x4_t, twiddle3: float32x4_t, twiddle1c: float32x4_t, twiddle2c: float32x4_t, twiddle3c: float32x4_t, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly16, 16, |this: &NeonF32Butterfly16<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly16, 16, |this: &NeonF32Butterfly16<_>| this .direction); impl NeonF32Butterfly16 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf8 = NeonF32Butterfly8::new(direction); let bf4 = NeonF32Butterfly4::new(direction); let rotate90 = if direction == FftDirection::Inverse { Rotate90F32::new(true) } else { Rotate90F32::new(false) }; let tw1: Complex = twiddles::compute_twiddle(1, 16, direction); let tw2: Complex = twiddles::compute_twiddle(2, 16, direction); let tw3: Complex = twiddles::compute_twiddle(3, 16, direction); let twiddle01 = unsafe { vld1q_f32([1.0, 0.0, tw1.re, tw1.im].as_ptr()) }; let twiddle23 = unsafe { vld1q_f32([tw2.re, tw2.im, tw3.re, tw3.im].as_ptr()) }; let twiddle01conj = unsafe { vld1q_f32([1.0, 0.0, tw1.re, -tw1.im].as_ptr()) }; let twiddle23conj = unsafe { vld1q_f32([tw2.re, -tw2.im, tw3.re, -tw3.im].as_ptr()) }; let twiddle1 = unsafe { vld1q_f32([tw1.re, tw1.im, tw1.re, tw1.im].as_ptr()) }; let twiddle2 = unsafe { vld1q_f32([tw2.re, tw2.im, tw2.re, tw2.im].as_ptr()) }; let twiddle3 = unsafe { vld1q_f32([tw3.re, tw3.im, tw3.re, tw3.im].as_ptr()) }; let twiddle1c = unsafe { vld1q_f32([tw1.re, -tw1.im, tw1.re, -tw1.im].as_ptr()) }; let twiddle2c = unsafe { vld1q_f32([tw2.re, -tw2.im, tw2.re, -tw2.im].as_ptr()) }; let twiddle3c = unsafe { vld1q_f32([tw3.re, -tw3.im, tw3.re, -tw3.im].as_ptr()) }; Self { direction, bf4, bf8, rotate90, twiddle01, twiddle23, twiddle01conj, twiddle23conj, twiddle1, twiddle2, twiddle3, twiddle1c, twiddle2c, twiddle3c, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14 }); let out = self.perform_fft_direct(input_packed); write_complex_to_array_strided!(out, buffer, 2, {0,1,2,3,4,5,6,7}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl NeonArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30}); let values = interleave_complex_f32!(input_packed, 8, {0, 1, 2, 3 ,4 ,5 ,6 ,7}); let out = self.perform_parallel_fft_direct(values); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4, 6, 8, 10, 12, 14}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0,1,2,3,4,5,6,7,8,9, 10, 11,12,13,14, 15}); } #[inline(always)] unsafe fn perform_fft_direct(&self, input: [float32x4_t; 8]) -> [float32x4_t; 8] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch let in0002 = extract_lo_lo_f32(input[0], input[1]); let in0406 = extract_lo_lo_f32(input[2], input[3]); let in0810 = extract_lo_lo_f32(input[4], input[5]); let in1214 = extract_lo_lo_f32(input[6], input[7]); let in0105 = extract_hi_hi_f32(input[0], input[2]); let in0913 = extract_hi_hi_f32(input[4], input[6]); let in1503 = extract_hi_hi_f32(input[7], input[1]); let in0711 = extract_hi_hi_f32(input[3], input[5]); let in_evens = [in0002, in0406, in0810, in1214]; // step 2: column FFTs let evens = self.bf8.perform_fft_direct(in_evens); let mut odds1 = self.bf4.perform_fft_direct(in0105, in0913); let mut odds3 = self.bf4.perform_fft_direct(in1503, in0711); // step 3: apply twiddle factors odds1[0] = mul_complex_f32(odds1[0], self.twiddle01); odds3[0] = mul_complex_f32(odds3[0], self.twiddle01conj); odds1[1] = mul_complex_f32(odds1[1], self.twiddle23); odds3[1] = mul_complex_f32(odds3[1], self.twiddle23conj); // step 4: cross FFTs let mut temp0 = parallel_fft2_interleaved_f32(odds1[0], odds3[0]); let mut temp1 = parallel_fft2_interleaved_f32(odds1[1], odds3[1]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate_both(temp0[1]); temp1[1] = self.rotate90.rotate_both(temp1[1]); //step 5: copy/add/subtract data back to buffer [ vaddq_f32(evens[0], temp0[0]), vaddq_f32(evens[1], temp1[0]), vaddq_f32(evens[2], temp0[1]), vaddq_f32(evens[3], temp1[1]), vsubq_f32(evens[0], temp0[0]), vsubq_f32(evens[1], temp1[0]), vsubq_f32(evens[2], temp0[1]), vsubq_f32(evens[3], temp1[1]), ] } #[inline(always)] unsafe fn perform_parallel_fft_direct(&self, input: [float32x4_t; 16]) -> [float32x4_t; 16] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let evens = self.bf8.perform_parallel_fft_direct([ input[0], input[2], input[4], input[6], input[8], input[10], input[12], input[14], ]); let mut odds1 = self .bf4 .perform_parallel_fft_direct(input[1], input[5], input[9], input[13]); let mut odds3 = self .bf4 .perform_parallel_fft_direct(input[15], input[3], input[7], input[11]); // step 3: apply twiddle factors odds1[1] = mul_complex_f32(odds1[1], self.twiddle1); odds3[1] = mul_complex_f32(odds3[1], self.twiddle1c); odds1[2] = mul_complex_f32(odds1[2], self.twiddle2); odds3[2] = mul_complex_f32(odds3[2], self.twiddle2c); odds1[3] = mul_complex_f32(odds1[3], self.twiddle3); odds3[3] = mul_complex_f32(odds3[3], self.twiddle3c); // step 4: cross FFTs let mut temp0 = parallel_fft2_interleaved_f32(odds1[0], odds3[0]); let mut temp1 = parallel_fft2_interleaved_f32(odds1[1], odds3[1]); let mut temp2 = parallel_fft2_interleaved_f32(odds1[2], odds3[2]); let mut temp3 = parallel_fft2_interleaved_f32(odds1[3], odds3[3]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate_both(temp0[1]); temp1[1] = self.rotate90.rotate_both(temp1[1]); temp2[1] = self.rotate90.rotate_both(temp2[1]); temp3[1] = self.rotate90.rotate_both(temp3[1]); //step 5: copy/add/subtract data back to buffer [ vaddq_f32(evens[0], temp0[0]), vaddq_f32(evens[1], temp1[0]), vaddq_f32(evens[2], temp2[0]), vaddq_f32(evens[3], temp3[0]), vaddq_f32(evens[4], temp0[1]), vaddq_f32(evens[5], temp1[1]), vaddq_f32(evens[6], temp2[1]), vaddq_f32(evens[7], temp3[1]), vsubq_f32(evens[0], temp0[0]), vsubq_f32(evens[1], temp1[0]), vsubq_f32(evens[2], temp2[0]), vsubq_f32(evens[3], temp3[0]), vsubq_f32(evens[4], temp0[1]), vsubq_f32(evens[5], temp1[1]), vsubq_f32(evens[6], temp2[1]), vsubq_f32(evens[7], temp3[1]), ] } } // _ __ __ _ _ _ _ _ // / |/ /_ / /_ | || | | |__ (_) |_ // | | '_ \ _____ | '_ \| || |_| '_ \| | __| // | | (_) | |_____| | (_) |__ _| |_) | | |_ // |_|\___/ \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly16 { direction: FftDirection, bf4: NeonF64Butterfly4, bf8: NeonF64Butterfly8, rotate90: Rotate90F64, twiddle1: float64x2_t, twiddle2: float64x2_t, twiddle3: float64x2_t, twiddle1c: float64x2_t, twiddle2c: float64x2_t, twiddle3c: float64x2_t, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly16, 16, |this: &NeonF64Butterfly16<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly16, 16, |this: &NeonF64Butterfly16<_>| this .direction); impl NeonF64Butterfly16 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf8 = NeonF64Butterfly8::new(direction); let bf4 = NeonF64Butterfly4::new(direction); let rotate90 = if direction == FftDirection::Inverse { Rotate90F64::new(true) } else { Rotate90F64::new(false) }; let twiddle1 = unsafe { vld1q_f64(&twiddles::compute_twiddle::(1, 16, direction) as *const _ as *const f64) }; let twiddle2 = unsafe { vld1q_f64(&twiddles::compute_twiddle::(2, 16, direction) as *const _ as *const f64) }; let twiddle3 = unsafe { vld1q_f64(&twiddles::compute_twiddle::(3, 16, direction) as *const _ as *const f64) }; let twiddle1c = unsafe { vld1q_f64( &twiddles::compute_twiddle::(1, 16, direction).conj() as *const _ as *const f64, ) }; let twiddle2c = unsafe { vld1q_f64( &twiddles::compute_twiddle::(2, 16, direction).conj() as *const _ as *const f64, ) }; let twiddle3c = unsafe { vld1q_f64( &twiddles::compute_twiddle::(3, 16, direction).conj() as *const _ as *const f64, ) }; Self { direction, bf4, bf8, rotate90, twiddle1, twiddle2, twiddle3, twiddle1c, twiddle2c, twiddle3c, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}); } #[inline(always)] unsafe fn perform_fft_direct(&self, input: [float64x2_t; 16]) -> [float64x2_t; 16] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let evens = self.bf8.perform_fft_direct([ input[0], input[2], input[4], input[6], input[8], input[10], input[12], input[14], ]); let mut odds1 = self .bf4 .perform_fft_direct(input[1], input[5], input[9], input[13]); let mut odds3 = self .bf4 .perform_fft_direct(input[15], input[3], input[7], input[11]); // step 3: apply twiddle factors odds1[1] = mul_complex_f64(odds1[1], self.twiddle1); odds3[1] = mul_complex_f64(odds3[1], self.twiddle1c); odds1[2] = mul_complex_f64(odds1[2], self.twiddle2); odds3[2] = mul_complex_f64(odds3[2], self.twiddle2c); odds1[3] = mul_complex_f64(odds1[3], self.twiddle3); odds3[3] = mul_complex_f64(odds3[3], self.twiddle3c); // step 4: cross FFTs let mut temp0 = solo_fft2_f64(odds1[0], odds3[0]); let mut temp1 = solo_fft2_f64(odds1[1], odds3[1]); let mut temp2 = solo_fft2_f64(odds1[2], odds3[2]); let mut temp3 = solo_fft2_f64(odds1[3], odds3[3]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate(temp0[1]); temp1[1] = self.rotate90.rotate(temp1[1]); temp2[1] = self.rotate90.rotate(temp2[1]); temp3[1] = self.rotate90.rotate(temp3[1]); //step 5: copy/add/subtract data back to buffer [ vaddq_f64(evens[0], temp0[0]), vaddq_f64(evens[1], temp1[0]), vaddq_f64(evens[2], temp2[0]), vaddq_f64(evens[3], temp3[0]), vaddq_f64(evens[4], temp0[1]), vaddq_f64(evens[5], temp1[1]), vaddq_f64(evens[6], temp2[1]), vaddq_f64(evens[7], temp3[1]), vsubq_f64(evens[0], temp0[0]), vsubq_f64(evens[1], temp1[0]), vsubq_f64(evens[2], temp2[0]), vsubq_f64(evens[3], temp3[0]), vsubq_f64(evens[4], temp0[1]), vsubq_f64(evens[5], temp1[1]), vsubq_f64(evens[6], temp2[1]), vsubq_f64(evens[7], temp3[1]), ] } } // _________ _________ _ _ _ // |___ /___ \ |___ /___ \| |__ (_) |_ // |_ \ __) | _____ |_ \ __) | '_ \| | __| // ___) / __/ |_____| ___) / __/| |_) | | |_ // |____/_____| |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly32 { direction: FftDirection, bf8: NeonF32Butterfly8, bf16: NeonF32Butterfly16, rotate90: Rotate90F32, twiddle01: float32x4_t, twiddle23: float32x4_t, twiddle45: float32x4_t, twiddle67: float32x4_t, twiddle01conj: float32x4_t, twiddle23conj: float32x4_t, twiddle45conj: float32x4_t, twiddle67conj: float32x4_t, twiddle1: float32x4_t, twiddle2: float32x4_t, twiddle3: float32x4_t, twiddle4: float32x4_t, twiddle5: float32x4_t, twiddle6: float32x4_t, twiddle7: float32x4_t, twiddle1c: float32x4_t, twiddle2c: float32x4_t, twiddle3c: float32x4_t, twiddle4c: float32x4_t, twiddle5c: float32x4_t, twiddle6c: float32x4_t, twiddle7c: float32x4_t, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly32, 32, |this: &NeonF32Butterfly32<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly32, 32, |this: &NeonF32Butterfly32<_>| this .direction); impl NeonF32Butterfly32 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf8 = NeonF32Butterfly8::new(direction); let bf16 = NeonF32Butterfly16::new(direction); let rotate90 = if direction == FftDirection::Inverse { Rotate90F32::new(true) } else { Rotate90F32::new(false) }; let tw1: Complex = twiddles::compute_twiddle(1, 32, direction); let tw2: Complex = twiddles::compute_twiddle(2, 32, direction); let tw3: Complex = twiddles::compute_twiddle(3, 32, direction); let tw4: Complex = twiddles::compute_twiddle(4, 32, direction); let tw5: Complex = twiddles::compute_twiddle(5, 32, direction); let tw6: Complex = twiddles::compute_twiddle(6, 32, direction); let tw7: Complex = twiddles::compute_twiddle(7, 32, direction); let twiddle01 = unsafe { vld1q_f32([1.0, 0.0, tw1.re, tw1.im].as_ptr()) }; let twiddle23 = unsafe { vld1q_f32([tw2.re, tw2.im, tw3.re, tw3.im].as_ptr()) }; let twiddle45 = unsafe { vld1q_f32([tw4.re, tw4.im, tw5.re, tw5.im].as_ptr()) }; let twiddle67 = unsafe { vld1q_f32([tw6.re, tw6.im, tw7.re, tw7.im].as_ptr()) }; let twiddle01conj = unsafe { vld1q_f32([1.0, 0.0, tw1.re, -tw1.im].as_ptr()) }; let twiddle23conj = unsafe { vld1q_f32([tw2.re, -tw2.im, tw3.re, -tw3.im].as_ptr()) }; let twiddle45conj = unsafe { vld1q_f32([tw4.re, -tw4.im, tw5.re, -tw5.im].as_ptr()) }; let twiddle67conj = unsafe { vld1q_f32([tw6.re, -tw6.im, tw7.re, -tw7.im].as_ptr()) }; let twiddle1 = unsafe { vld1q_f32([tw1.re, tw1.im, tw1.re, tw1.im].as_ptr()) }; let twiddle2 = unsafe { vld1q_f32([tw2.re, tw2.im, tw2.re, tw2.im].as_ptr()) }; let twiddle3 = unsafe { vld1q_f32([tw3.re, tw3.im, tw3.re, tw3.im].as_ptr()) }; let twiddle4 = unsafe { vld1q_f32([tw4.re, tw4.im, tw4.re, tw4.im].as_ptr()) }; let twiddle5 = unsafe { vld1q_f32([tw5.re, tw5.im, tw5.re, tw5.im].as_ptr()) }; let twiddle6 = unsafe { vld1q_f32([tw6.re, tw6.im, tw6.re, tw6.im].as_ptr()) }; let twiddle7 = unsafe { vld1q_f32([tw7.re, tw7.im, tw7.re, tw7.im].as_ptr()) }; let twiddle1c = unsafe { vld1q_f32([tw1.re, -tw1.im, tw1.re, -tw1.im].as_ptr()) }; let twiddle2c = unsafe { vld1q_f32([tw2.re, -tw2.im, tw2.re, -tw2.im].as_ptr()) }; let twiddle3c = unsafe { vld1q_f32([tw3.re, -tw3.im, tw3.re, -tw3.im].as_ptr()) }; let twiddle4c = unsafe { vld1q_f32([tw4.re, -tw4.im, tw4.re, -tw4.im].as_ptr()) }; let twiddle5c = unsafe { vld1q_f32([tw5.re, -tw5.im, tw5.re, -tw5.im].as_ptr()) }; let twiddle6c = unsafe { vld1q_f32([tw6.re, -tw6.im, tw6.re, -tw6.im].as_ptr()) }; let twiddle7c = unsafe { vld1q_f32([tw7.re, -tw7.im, tw7.re, -tw7.im].as_ptr()) }; Self { direction, bf8, bf16, rotate90, twiddle01, twiddle23, twiddle45, twiddle67, twiddle01conj, twiddle23conj, twiddle45conj, twiddle67conj, twiddle1, twiddle2, twiddle3, twiddle4, twiddle5, twiddle6, twiddle7, twiddle1c, twiddle2c, twiddle3c, twiddle4c, twiddle5c, twiddle6c, twiddle7c, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30 }); let out = self.perform_fft_direct(input_packed); write_complex_to_array_strided!(out, buffer, 2, {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl NeonArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62}); let values = interleave_complex_f32!(input_packed, 16, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}); let out = self.perform_parallel_fft_direct(values); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0,1,2,3,4,5,6,7,8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }); } #[inline(always)] unsafe fn perform_fft_direct(&self, input: [float32x4_t; 16]) -> [float32x4_t; 16] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch let in0002 = extract_lo_lo_f32(input[0], input[1]); let in0406 = extract_lo_lo_f32(input[2], input[3]); let in0810 = extract_lo_lo_f32(input[4], input[5]); let in1214 = extract_lo_lo_f32(input[6], input[7]); let in1618 = extract_lo_lo_f32(input[8], input[9]); let in2022 = extract_lo_lo_f32(input[10], input[11]); let in2426 = extract_lo_lo_f32(input[12], input[13]); let in2830 = extract_lo_lo_f32(input[14], input[15]); let in0105 = extract_hi_hi_f32(input[0], input[2]); let in0913 = extract_hi_hi_f32(input[4], input[6]); let in1721 = extract_hi_hi_f32(input[8], input[10]); let in2529 = extract_hi_hi_f32(input[12], input[14]); let in3103 = extract_hi_hi_f32(input[15], input[1]); let in0711 = extract_hi_hi_f32(input[3], input[5]); let in1519 = extract_hi_hi_f32(input[7], input[9]); let in2327 = extract_hi_hi_f32(input[11], input[13]); let in_evens = [ in0002, in0406, in0810, in1214, in1618, in2022, in2426, in2830, ]; // step 2: column FFTs let evens = self.bf16.perform_fft_direct(in_evens); let mut odds1 = self .bf8 .perform_fft_direct([in0105, in0913, in1721, in2529]); let mut odds3 = self .bf8 .perform_fft_direct([in3103, in0711, in1519, in2327]); // step 3: apply twiddle factors odds1[0] = mul_complex_f32(odds1[0], self.twiddle01); odds3[0] = mul_complex_f32(odds3[0], self.twiddle01conj); odds1[1] = mul_complex_f32(odds1[1], self.twiddle23); odds3[1] = mul_complex_f32(odds3[1], self.twiddle23conj); odds1[2] = mul_complex_f32(odds1[2], self.twiddle45); odds3[2] = mul_complex_f32(odds3[2], self.twiddle45conj); odds1[3] = mul_complex_f32(odds1[3], self.twiddle67); odds3[3] = mul_complex_f32(odds3[3], self.twiddle67conj); // step 4: cross FFTs let mut temp0 = parallel_fft2_interleaved_f32(odds1[0], odds3[0]); let mut temp1 = parallel_fft2_interleaved_f32(odds1[1], odds3[1]); let mut temp2 = parallel_fft2_interleaved_f32(odds1[2], odds3[2]); let mut temp3 = parallel_fft2_interleaved_f32(odds1[3], odds3[3]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate_both(temp0[1]); temp1[1] = self.rotate90.rotate_both(temp1[1]); temp2[1] = self.rotate90.rotate_both(temp2[1]); temp3[1] = self.rotate90.rotate_both(temp3[1]); //step 5: copy/add/subtract data back to buffer [ vaddq_f32(evens[0], temp0[0]), vaddq_f32(evens[1], temp1[0]), vaddq_f32(evens[2], temp2[0]), vaddq_f32(evens[3], temp3[0]), vaddq_f32(evens[4], temp0[1]), vaddq_f32(evens[5], temp1[1]), vaddq_f32(evens[6], temp2[1]), vaddq_f32(evens[7], temp3[1]), vsubq_f32(evens[0], temp0[0]), vsubq_f32(evens[1], temp1[0]), vsubq_f32(evens[2], temp2[0]), vsubq_f32(evens[3], temp3[0]), vsubq_f32(evens[4], temp0[1]), vsubq_f32(evens[5], temp1[1]), vsubq_f32(evens[6], temp2[1]), vsubq_f32(evens[7], temp3[1]), ] } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, input: [float32x4_t; 32], ) -> [float32x4_t; 32] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let evens = self.bf16.perform_parallel_fft_direct([ input[0], input[2], input[4], input[6], input[8], input[10], input[12], input[14], input[16], input[18], input[20], input[22], input[24], input[26], input[28], input[30], ]); let mut odds1 = self.bf8.perform_parallel_fft_direct([ input[1], input[5], input[9], input[13], input[17], input[21], input[25], input[29], ]); let mut odds3 = self.bf8.perform_parallel_fft_direct([ input[31], input[3], input[7], input[11], input[15], input[19], input[23], input[27], ]); // step 3: apply twiddle factors odds1[1] = mul_complex_f32(odds1[1], self.twiddle1); odds3[1] = mul_complex_f32(odds3[1], self.twiddle1c); odds1[2] = mul_complex_f32(odds1[2], self.twiddle2); odds3[2] = mul_complex_f32(odds3[2], self.twiddle2c); odds1[3] = mul_complex_f32(odds1[3], self.twiddle3); odds3[3] = mul_complex_f32(odds3[3], self.twiddle3c); odds1[4] = mul_complex_f32(odds1[4], self.twiddle4); odds3[4] = mul_complex_f32(odds3[4], self.twiddle4c); odds1[5] = mul_complex_f32(odds1[5], self.twiddle5); odds3[5] = mul_complex_f32(odds3[5], self.twiddle5c); odds1[6] = mul_complex_f32(odds1[6], self.twiddle6); odds3[6] = mul_complex_f32(odds3[6], self.twiddle6c); odds1[7] = mul_complex_f32(odds1[7], self.twiddle7); odds3[7] = mul_complex_f32(odds3[7], self.twiddle7c); // step 4: cross FFTs let mut temp0 = parallel_fft2_interleaved_f32(odds1[0], odds3[0]); let mut temp1 = parallel_fft2_interleaved_f32(odds1[1], odds3[1]); let mut temp2 = parallel_fft2_interleaved_f32(odds1[2], odds3[2]); let mut temp3 = parallel_fft2_interleaved_f32(odds1[3], odds3[3]); let mut temp4 = parallel_fft2_interleaved_f32(odds1[4], odds3[4]); let mut temp5 = parallel_fft2_interleaved_f32(odds1[5], odds3[5]); let mut temp6 = parallel_fft2_interleaved_f32(odds1[6], odds3[6]); let mut temp7 = parallel_fft2_interleaved_f32(odds1[7], odds3[7]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate_both(temp0[1]); temp1[1] = self.rotate90.rotate_both(temp1[1]); temp2[1] = self.rotate90.rotate_both(temp2[1]); temp3[1] = self.rotate90.rotate_both(temp3[1]); temp4[1] = self.rotate90.rotate_both(temp4[1]); temp5[1] = self.rotate90.rotate_both(temp5[1]); temp6[1] = self.rotate90.rotate_both(temp6[1]); temp7[1] = self.rotate90.rotate_both(temp7[1]); //step 5: copy/add/subtract data back to buffer [ vaddq_f32(evens[0], temp0[0]), vaddq_f32(evens[1], temp1[0]), vaddq_f32(evens[2], temp2[0]), vaddq_f32(evens[3], temp3[0]), vaddq_f32(evens[4], temp4[0]), vaddq_f32(evens[5], temp5[0]), vaddq_f32(evens[6], temp6[0]), vaddq_f32(evens[7], temp7[0]), vaddq_f32(evens[8], temp0[1]), vaddq_f32(evens[9], temp1[1]), vaddq_f32(evens[10], temp2[1]), vaddq_f32(evens[11], temp3[1]), vaddq_f32(evens[12], temp4[1]), vaddq_f32(evens[13], temp5[1]), vaddq_f32(evens[14], temp6[1]), vaddq_f32(evens[15], temp7[1]), vsubq_f32(evens[0], temp0[0]), vsubq_f32(evens[1], temp1[0]), vsubq_f32(evens[2], temp2[0]), vsubq_f32(evens[3], temp3[0]), vsubq_f32(evens[4], temp4[0]), vsubq_f32(evens[5], temp5[0]), vsubq_f32(evens[6], temp6[0]), vsubq_f32(evens[7], temp7[0]), vsubq_f32(evens[8], temp0[1]), vsubq_f32(evens[9], temp1[1]), vsubq_f32(evens[10], temp2[1]), vsubq_f32(evens[11], temp3[1]), vsubq_f32(evens[12], temp4[1]), vsubq_f32(evens[13], temp5[1]), vsubq_f32(evens[14], temp6[1]), vsubq_f32(evens[15], temp7[1]), ] } } // _________ __ _ _ _ _ _ // |___ /___ \ / /_ | || | | |__ (_) |_ // |_ \ __) | _____ | '_ \| || |_| '_ \| | __| // ___) / __/ |_____| | (_) |__ _| |_) | | |_ // |____/_____| \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly32 { direction: FftDirection, bf8: NeonF64Butterfly8, bf16: NeonF64Butterfly16, rotate90: Rotate90F64, twiddle1: float64x2_t, twiddle2: float64x2_t, twiddle3: float64x2_t, twiddle4: float64x2_t, twiddle5: float64x2_t, twiddle6: float64x2_t, twiddle7: float64x2_t, twiddle1c: float64x2_t, twiddle2c: float64x2_t, twiddle3c: float64x2_t, twiddle4c: float64x2_t, twiddle5c: float64x2_t, twiddle6c: float64x2_t, twiddle7c: float64x2_t, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly32, 32, |this: &NeonF64Butterfly32<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly32, 32, |this: &NeonF64Butterfly32<_>| this .direction); impl NeonF64Butterfly32 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf8 = NeonF64Butterfly8::new(direction); let bf16 = NeonF64Butterfly16::new(direction); let rotate90 = if direction == FftDirection::Inverse { Rotate90F64::new(true) } else { Rotate90F64::new(false) }; let twiddle1 = unsafe { vld1q_f64(&twiddles::compute_twiddle::(1, 32, direction) as *const _ as *const f64) }; let twiddle2 = unsafe { vld1q_f64(&twiddles::compute_twiddle::(2, 32, direction) as *const _ as *const f64) }; let twiddle3 = unsafe { vld1q_f64(&twiddles::compute_twiddle::(3, 32, direction) as *const _ as *const f64) }; let twiddle4 = unsafe { vld1q_f64(&twiddles::compute_twiddle::(4, 32, direction) as *const _ as *const f64) }; let twiddle5 = unsafe { vld1q_f64(&twiddles::compute_twiddle::(5, 32, direction) as *const _ as *const f64) }; let twiddle6 = unsafe { vld1q_f64(&twiddles::compute_twiddle::(6, 32, direction) as *const _ as *const f64) }; let twiddle7 = unsafe { vld1q_f64(&twiddles::compute_twiddle::(7, 32, direction) as *const _ as *const f64) }; let twiddle1c = unsafe { vld1q_f64( &twiddles::compute_twiddle::(1, 32, direction).conj() as *const _ as *const f64, ) }; let twiddle2c = unsafe { vld1q_f64( &twiddles::compute_twiddle::(2, 32, direction).conj() as *const _ as *const f64, ) }; let twiddle3c = unsafe { vld1q_f64( &twiddles::compute_twiddle::(3, 32, direction).conj() as *const _ as *const f64, ) }; let twiddle4c = unsafe { vld1q_f64( &twiddles::compute_twiddle::(4, 32, direction).conj() as *const _ as *const f64, ) }; let twiddle5c = unsafe { vld1q_f64( &twiddles::compute_twiddle::(5, 32, direction).conj() as *const _ as *const f64, ) }; let twiddle6c = unsafe { vld1q_f64( &twiddles::compute_twiddle::(6, 32, direction).conj() as *const _ as *const f64, ) }; let twiddle7c = unsafe { vld1q_f64( &twiddles::compute_twiddle::(7, 32, direction).conj() as *const _ as *const f64, ) }; Self { direction, bf8, bf16, rotate90, twiddle1, twiddle2, twiddle3, twiddle4, twiddle5, twiddle6, twiddle7, twiddle1c, twiddle2c, twiddle3c, twiddle4c, twiddle5c, twiddle6c, twiddle7c, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}); } #[inline(always)] unsafe fn perform_fft_direct(&self, input: [float64x2_t; 32]) -> [float64x2_t; 32] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let evens = self.bf16.perform_fft_direct([ input[0], input[2], input[4], input[6], input[8], input[10], input[12], input[14], input[16], input[18], input[20], input[22], input[24], input[26], input[28], input[30], ]); let mut odds1 = self.bf8.perform_fft_direct([ input[1], input[5], input[9], input[13], input[17], input[21], input[25], input[29], ]); let mut odds3 = self.bf8.perform_fft_direct([ input[31], input[3], input[7], input[11], input[15], input[19], input[23], input[27], ]); // step 3: apply twiddle factors odds1[1] = mul_complex_f64(odds1[1], self.twiddle1); odds3[1] = mul_complex_f64(odds3[1], self.twiddle1c); odds1[2] = mul_complex_f64(odds1[2], self.twiddle2); odds3[2] = mul_complex_f64(odds3[2], self.twiddle2c); odds1[3] = mul_complex_f64(odds1[3], self.twiddle3); odds3[3] = mul_complex_f64(odds3[3], self.twiddle3c); odds1[4] = mul_complex_f64(odds1[4], self.twiddle4); odds3[4] = mul_complex_f64(odds3[4], self.twiddle4c); odds1[5] = mul_complex_f64(odds1[5], self.twiddle5); odds3[5] = mul_complex_f64(odds3[5], self.twiddle5c); odds1[6] = mul_complex_f64(odds1[6], self.twiddle6); odds3[6] = mul_complex_f64(odds3[6], self.twiddle6c); odds1[7] = mul_complex_f64(odds1[7], self.twiddle7); odds3[7] = mul_complex_f64(odds3[7], self.twiddle7c); // step 4: cross FFTs let mut temp0 = solo_fft2_f64(odds1[0], odds3[0]); let mut temp1 = solo_fft2_f64(odds1[1], odds3[1]); let mut temp2 = solo_fft2_f64(odds1[2], odds3[2]); let mut temp3 = solo_fft2_f64(odds1[3], odds3[3]); let mut temp4 = solo_fft2_f64(odds1[4], odds3[4]); let mut temp5 = solo_fft2_f64(odds1[5], odds3[5]); let mut temp6 = solo_fft2_f64(odds1[6], odds3[6]); let mut temp7 = solo_fft2_f64(odds1[7], odds3[7]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate(temp0[1]); temp1[1] = self.rotate90.rotate(temp1[1]); temp2[1] = self.rotate90.rotate(temp2[1]); temp3[1] = self.rotate90.rotate(temp3[1]); temp4[1] = self.rotate90.rotate(temp4[1]); temp5[1] = self.rotate90.rotate(temp5[1]); temp6[1] = self.rotate90.rotate(temp6[1]); temp7[1] = self.rotate90.rotate(temp7[1]); //step 5: copy/add/subtract data back to buffer [ vaddq_f64(evens[0], temp0[0]), vaddq_f64(evens[1], temp1[0]), vaddq_f64(evens[2], temp2[0]), vaddq_f64(evens[3], temp3[0]), vaddq_f64(evens[4], temp4[0]), vaddq_f64(evens[5], temp5[0]), vaddq_f64(evens[6], temp6[0]), vaddq_f64(evens[7], temp7[0]), vaddq_f64(evens[8], temp0[1]), vaddq_f64(evens[9], temp1[1]), vaddq_f64(evens[10], temp2[1]), vaddq_f64(evens[11], temp3[1]), vaddq_f64(evens[12], temp4[1]), vaddq_f64(evens[13], temp5[1]), vaddq_f64(evens[14], temp6[1]), vaddq_f64(evens[15], temp7[1]), vsubq_f64(evens[0], temp0[0]), vsubq_f64(evens[1], temp1[0]), vsubq_f64(evens[2], temp2[0]), vsubq_f64(evens[3], temp3[0]), vsubq_f64(evens[4], temp4[0]), vsubq_f64(evens[5], temp5[0]), vsubq_f64(evens[6], temp6[0]), vsubq_f64(evens[7], temp7[0]), vsubq_f64(evens[8], temp0[1]), vsubq_f64(evens[9], temp1[1]), vsubq_f64(evens[10], temp2[1]), vsubq_f64(evens[11], temp3[1]), vsubq_f64(evens[12], temp4[1]), vsubq_f64(evens[13], temp5[1]), vsubq_f64(evens[14], temp6[1]), vsubq_f64(evens[15], temp7[1]), ] } } #[cfg(test)] mod unit_tests { use super::*; use crate::algorithm::Dft; use crate::test_utils::{check_fft_algorithm, compare_vectors}; //the tests for all butterflies will be identical except for the identifiers used and size //so it's ideal for a macro macro_rules! test_butterfly_32_func { ($test_name:ident, $struct_name:ident, $size:expr) => { #[test] fn $test_name() { let butterfly = $struct_name::new(FftDirection::Forward); check_fft_algorithm::(&butterfly, $size, FftDirection::Forward); let butterfly_direction = $struct_name::new(FftDirection::Inverse); check_fft_algorithm::(&butterfly_direction, $size, FftDirection::Inverse); } }; } test_butterfly_32_func!(test_neonf32_butterfly1, NeonF32Butterfly1, 1); test_butterfly_32_func!(test_neonf32_butterfly2, NeonF32Butterfly2, 2); test_butterfly_32_func!(test_neonf32_butterfly3, NeonF32Butterfly3, 3); test_butterfly_32_func!(test_neonf32_butterfly4, NeonF32Butterfly4, 4); test_butterfly_32_func!(test_neonf32_butterfly5, NeonF32Butterfly5, 5); test_butterfly_32_func!(test_neonf32_butterfly6, NeonF32Butterfly6, 6); test_butterfly_32_func!(test_neonf32_butterfly8, NeonF32Butterfly8, 8); test_butterfly_32_func!(test_neonf32_butterfly9, NeonF32Butterfly9, 9); test_butterfly_32_func!(test_neonf32_butterfly10, NeonF32Butterfly10, 10); test_butterfly_32_func!(test_neonf32_butterfly12, NeonF32Butterfly12, 12); test_butterfly_32_func!(test_neonf32_butterfly15, NeonF32Butterfly15, 15); test_butterfly_32_func!(test_neonf32_butterfly16, NeonF32Butterfly16, 16); test_butterfly_32_func!(test_neonf32_butterfly32, NeonF32Butterfly32, 32); //the tests for all butterflies will be identical except for the identifiers used and size //so it's ideal for a macro macro_rules! test_butterfly_64_func { ($test_name:ident, $struct_name:ident, $size:expr) => { #[test] fn $test_name() { let butterfly = $struct_name::new(FftDirection::Forward); check_fft_algorithm::(&butterfly, $size, FftDirection::Forward); let butterfly_direction = $struct_name::new(FftDirection::Inverse); check_fft_algorithm::(&butterfly_direction, $size, FftDirection::Inverse); } }; } test_butterfly_64_func!(test_neonf64_butterfly1, NeonF64Butterfly1, 1); test_butterfly_64_func!(test_neonf64_butterfly2, NeonF64Butterfly2, 2); test_butterfly_64_func!(test_neonf64_butterfly3, NeonF64Butterfly3, 3); test_butterfly_64_func!(test_neonf64_butterfly4, NeonF64Butterfly4, 4); test_butterfly_64_func!(test_neonf64_butterfly5, NeonF64Butterfly5, 5); test_butterfly_64_func!(test_neonf64_butterfly6, NeonF64Butterfly6, 6); test_butterfly_64_func!(test_neonf64_butterfly8, NeonF64Butterfly8, 8); test_butterfly_64_func!(test_neonf64_butterfly9, NeonF64Butterfly9, 9); test_butterfly_64_func!(test_neonf64_butterfly10, NeonF64Butterfly10, 10); test_butterfly_64_func!(test_neonf64_butterfly12, NeonF64Butterfly12, 12); test_butterfly_64_func!(test_neonf64_butterfly15, NeonF64Butterfly15, 15); test_butterfly_64_func!(test_neonf64_butterfly16, NeonF64Butterfly16, 16); test_butterfly_64_func!(test_neonf64_butterfly32, NeonF64Butterfly32, 32); #[test] fn test_solo_fft2_32() { unsafe { let val1 = Complex::::new(1.0, 2.5); let val2 = Complex::::new(3.2, 4.2); let mut val = vec![val1, val2]; let in_packed = vld1q_f32(val.as_ptr() as *const f32); let dft = Dft::new(2, FftDirection::Forward); let bf2 = NeonF32Butterfly2::::new(FftDirection::Forward); dft.process(&mut val); let res_packed = bf2.perform_fft_direct(in_packed); let res = std::mem::transmute::; 2]>(res_packed); assert_eq!(val[0], res[0]); assert_eq!(val[1], res[1]); } } #[test] fn test_parallel_fft2_32() { unsafe { let val_a1 = Complex::::new(1.0, 2.5); let val_a2 = Complex::::new(3.2, 4.2); let val_b1 = Complex::::new(6.0, 24.5); let val_b2 = Complex::::new(4.3, 34.2); let mut val_a = vec![val_a1, val_a2]; let mut val_b = vec![val_b1, val_b2]; let p1 = vld1q_f32(val_a.as_ptr() as *const f32); let p2 = vld1q_f32(val_b.as_ptr() as *const f32); let dft = Dft::new(2, FftDirection::Forward); let bf2 = NeonF32Butterfly2::::new(FftDirection::Forward); dft.process(&mut val_a); dft.process(&mut val_b); let res_both = bf2.perform_parallel_fft_direct(p1, p2); let res = std::mem::transmute::<[float32x4_t; 2], [Complex; 4]>(res_both); let neon_res_a = [res[0], res[2]]; let neon_res_b = [res[1], res[3]]; assert!(compare_vectors(&val_a, &neon_res_a)); assert!(compare_vectors(&val_b, &neon_res_b)); } } } rustfft-6.2.0/src/neon/neon_common.rs000064400000000000000000000350320072674642500157640ustar 00000000000000use std::any::TypeId; // Calculate the sum of an expression consisting of just plus and minus, like `value = a + b - c + d`. // The expression is rewritten to `value = a + (b - (c - d))` (note the flipped sign on d). // After this the `$add` and `$sub` functions are used to make the calculation. // For f32 using `_mm_add_ps` and `_mm_sub_ps`, the expression `value = a + b - c + d` becomes: // ```let value = _mm_add_ps(a, _mm_sub_ps(b, _mm_sub_ps(c, d)));``` // Only plus and minus are supported, and all the terms must be plain scalar variables. // Using array indices, like `value = temp[0] + temp[1]` is not supported. macro_rules! calc_sum { ($add:ident, $sub:ident, + $acc:tt + $($rest:tt)*)=> { $add($acc, calc_sum!($add, $sub, + $($rest)*)) }; ($add:ident, $sub:ident, + $acc:tt - $($rest:tt)*)=> { $sub($acc, calc_sum!($add, $sub, - $($rest)*)) }; ($add:ident, $sub:ident, - $acc:tt + $($rest:tt)*)=> { $sub($acc, calc_sum!($add, $sub, + $($rest)*)) }; ($add:ident, $sub:ident, - $acc:tt - $($rest:tt)*)=> { $add($acc, calc_sum!($add, $sub, - $($rest)*)) }; ($add:ident, $sub:ident, $acc:tt + $($rest:tt)*)=> { $add($acc, calc_sum!($add, $sub, + $($rest)*)) }; ($add:ident, $sub:ident, $acc:tt - $($rest:tt)*)=> { $sub($acc, calc_sum!($add, $sub, - $($rest)*)) }; ($add:ident, $sub:ident, + $val:tt) => {$val}; ($add:ident, $sub:ident, - $val:tt) => {$val}; } // Calculate the sum of an expression consisting of just plus and minus, like a + b - c + d macro_rules! calc_f32 { ($($tokens:tt)*) => { calc_sum!(vaddq_f32, vsubq_f32, $($tokens)*)}; } // Calculate the sum of an expression consisting of just plus and minus, like a + b - c + d macro_rules! calc_f64 { ($($tokens:tt)*) => { calc_sum!(vaddq_f64, vsubq_f64, $($tokens)*)}; } // Helper function to assert we have the right float type pub fn assert_f32() { let id_f32 = TypeId::of::(); let id_t = TypeId::of::(); assert!(id_t == id_f32, "Wrong float type, must be f32"); } // Helper function to assert we have the right float type pub fn assert_f64() { let id_f64 = TypeId::of::(); let id_t = TypeId::of::(); assert!(id_t == id_f64, "Wrong float type, must be f64"); } // Shuffle elements to interleave two contiguous sets of f32, from an array of simd vectors to a new array of simd vectors macro_rules! interleave_complex_f32 { ($input:ident, $offset:literal, { $($idx:literal),* }) => { [ $( extract_lo_lo_f32($input[$idx], $input[$idx+$offset]), extract_hi_hi_f32($input[$idx], $input[$idx+$offset]), )* ] } } // Shuffle elements to interleave two contiguous sets of f32, from an array of simd vectors to a new array of simd vectors // This statement: // ``` // let values = separate_interleaved_complex_f32!(input, {0, 2, 4}); // ``` // is equivalent to: // ``` // let values = [ // extract_lo_lo_f32(input[0], input[1]), // extract_lo_lo_f32(input[2], input[3]), // extract_lo_lo_f32(input[4], input[5]), // extract_hi_hi_f32(input[0], input[1]), // extract_hi_hi_f32(input[2], input[3]), // extract_hi_hi_f32(input[4], input[5]), // ]; macro_rules! separate_interleaved_complex_f32 { ($input:ident, { $($idx:literal),* }) => { [ $( extract_lo_lo_f32($input[$idx], $input[$idx+1]), )* $( extract_hi_hi_f32($input[$idx], $input[$idx+1]), )* ] } } macro_rules! boilerplate_fft_neon_oop { ($struct_name:ident, $len_fn:expr) => { impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], _scratch: &mut [Complex], ) { if self.len() == 0 { return; } if input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } let result = unsafe { array_utils::iter_chunks_zipped( input, output, self.len(), |in_chunk, out_chunk| { self.perform_fft_out_of_place(in_chunk, out_chunk, &mut []) }, ) }; if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); } } fn process_with_scratch(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { if self.len() == 0 { return; } let required_scratch = self.get_inplace_scratch_len(); if scratch.len() < required_scratch || buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let scratch = &mut scratch[..required_scratch]; let result = unsafe { array_utils::iter_chunks(buffer, self.len(), |chunk| { self.perform_fft_out_of_place(chunk, scratch, &mut []); chunk.copy_from_slice(scratch); }) }; if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { self.len() } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { 0 } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { $len_fn(self) } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { self.direction } } }; } /* Not used now, but maybe later for the mixed radixes etc macro_rules! boilerplate_sse_fft { ($struct_name:ident, $len_fn:expr, $inplace_scratch_len_fn:expr, $out_of_place_scratch_len_fn:expr) => { impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], scratch: &mut [Complex], ) { if self.len() == 0 { return; } let required_scratch = self.get_outofplace_scratch_len(); if scratch.len() < required_scratch || input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace( self.len(), input.len(), output.len(), self.get_outofplace_scratch_len(), scratch.len(), ); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } let scratch = &mut scratch[..required_scratch]; let result = array_utils::iter_chunks_zipped( input, output, self.len(), |in_chunk, out_chunk| { self.perform_fft_out_of_place(in_chunk, out_chunk, scratch) }, ); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace( self.len(), input.len(), output.len(), self.get_outofplace_scratch_len(), scratch.len(), ); } } fn process_with_scratch(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { if self.len() == 0 { return; } let required_scratch = self.get_inplace_scratch_len(); if scratch.len() < required_scratch || buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let scratch = &mut scratch[..required_scratch]; let result = array_utils::iter_chunks(buffer, self.len(), |chunk| { self.perform_fft_inplace(chunk, scratch) }); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { $inplace_scratch_len_fn(self) } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { $out_of_place_scratch_len_fn(self) } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { $len_fn(self) } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { self.direction } } }; } */ #[cfg(test)] mod unit_tests { use core::arch::aarch64::*; #[test] fn test_calc_f32() { unsafe { let a = vld1q_f32([1.0, 1.0, 1.0, 1.0].as_ptr()); let b = vld1q_f32([2.0, 2.0, 2.0, 2.0].as_ptr()); let c = vld1q_f32([3.0, 3.0, 3.0, 3.0].as_ptr()); let d = vld1q_f32([4.0, 4.0, 4.0, 4.0].as_ptr()); let e = vld1q_f32([5.0, 5.0, 5.0, 5.0].as_ptr()); let f = vld1q_f32([6.0, 6.0, 6.0, 6.0].as_ptr()); let g = vld1q_f32([7.0, 7.0, 7.0, 7.0].as_ptr()); let h = vld1q_f32([8.0, 8.0, 8.0, 8.0].as_ptr()); let i = vld1q_f32([9.0, 9.0, 9.0, 9.0].as_ptr()); let expected: f32 = 1.0 + 2.0 - 3.0 + 4.0 - 5.0 + 6.0 - 7.0 - 8.0 + 9.0; let res = calc_f32!(a + b - c + d - e + f - g - h + i); let sum = std::mem::transmute::(res); assert_eq!(sum[0], expected); assert_eq!(sum[1], expected); assert_eq!(sum[2], expected); assert_eq!(sum[3], expected); } } #[test] fn test_calc_f64() { unsafe { let a = vld1q_f64([1.0, 1.0].as_ptr()); let b = vld1q_f64([2.0, 2.0].as_ptr()); let c = vld1q_f64([3.0, 3.0].as_ptr()); let d = vld1q_f64([4.0, 4.0].as_ptr()); let e = vld1q_f64([5.0, 5.0].as_ptr()); let f = vld1q_f64([6.0, 6.0].as_ptr()); let g = vld1q_f64([7.0, 7.0].as_ptr()); let h = vld1q_f64([8.0, 8.0].as_ptr()); let i = vld1q_f64([9.0, 9.0].as_ptr()); let expected: f64 = 1.0 + 2.0 - 3.0 + 4.0 - 5.0 + 6.0 - 7.0 - 8.0 + 9.0; let res = calc_f64!(a + b - c + d - e + f - g - h + i); let sum = std::mem::transmute::(res); assert_eq!(sum[0], expected); assert_eq!(sum[1], expected); } } } rustfft-6.2.0/src/neon/neon_planner.rs000064400000000000000000001054250072674642500161370ustar 00000000000000use num_integer::gcd; use std::any::TypeId; use std::collections::HashMap; use std::sync::Arc; use crate::{common::FftNum, fft_cache::FftCache, FftDirection}; use crate::algorithm::*; use crate::neon::neon_butterflies::*; use crate::neon::neon_prime_butterflies::*; use crate::neon::neon_radix4::*; use crate::Fft; use crate::math_utils::{PrimeFactor, PrimeFactors}; const MIN_RADIX4_BITS: u32 = 6; // smallest size to consider radix 4 an option is 2^6 = 64 const MAX_RADER_PRIME_FACTOR: usize = 23; // don't use Raders if the inner fft length has prime factor larger than this const MIN_BLUESTEIN_MIXED_RADIX_LEN: usize = 90; // only use mixed radix for the inner fft of Bluestein if length is larger than this /// A Recipe is a structure that describes the design of a FFT, without actually creating it. /// It is used as a middle step in the planning process. #[derive(Debug, PartialEq, Clone)] pub enum Recipe { Dft(usize), MixedRadix { left_fft: Arc, right_fft: Arc, }, #[allow(dead_code)] GoodThomasAlgorithm { left_fft: Arc, right_fft: Arc, }, MixedRadixSmall { left_fft: Arc, right_fft: Arc, }, GoodThomasAlgorithmSmall { left_fft: Arc, right_fft: Arc, }, RadersAlgorithm { inner_fft: Arc, }, BluesteinsAlgorithm { len: usize, inner_fft: Arc, }, Radix4(usize), Butterfly1, Butterfly2, Butterfly3, Butterfly4, Butterfly5, Butterfly6, Butterfly7, Butterfly8, Butterfly9, Butterfly10, Butterfly11, Butterfly12, Butterfly13, Butterfly15, Butterfly16, Butterfly17, Butterfly19, Butterfly23, Butterfly29, Butterfly31, Butterfly32, } impl Recipe { pub fn len(&self) -> usize { match self { Recipe::Dft(length) => *length, Recipe::Radix4(length) => *length, Recipe::Butterfly1 => 1, Recipe::Butterfly2 => 2, Recipe::Butterfly3 => 3, Recipe::Butterfly4 => 4, Recipe::Butterfly5 => 5, Recipe::Butterfly6 => 6, Recipe::Butterfly7 => 7, Recipe::Butterfly8 => 8, Recipe::Butterfly9 => 9, Recipe::Butterfly10 => 10, Recipe::Butterfly11 => 11, Recipe::Butterfly12 => 12, Recipe::Butterfly13 => 13, Recipe::Butterfly15 => 15, Recipe::Butterfly16 => 16, Recipe::Butterfly17 => 17, Recipe::Butterfly19 => 19, Recipe::Butterfly23 => 23, Recipe::Butterfly29 => 29, Recipe::Butterfly31 => 31, Recipe::Butterfly32 => 32, Recipe::MixedRadix { left_fft, right_fft, } => left_fft.len() * right_fft.len(), Recipe::GoodThomasAlgorithm { left_fft, right_fft, } => left_fft.len() * right_fft.len(), Recipe::MixedRadixSmall { left_fft, right_fft, } => left_fft.len() * right_fft.len(), Recipe::GoodThomasAlgorithmSmall { left_fft, right_fft, } => left_fft.len() * right_fft.len(), Recipe::RadersAlgorithm { inner_fft } => inner_fft.len() + 1, Recipe::BluesteinsAlgorithm { len, .. } => *len, } } } /// The Neon FFT planner creates new FFT algorithm instances using a mix of scalar and Neon accelerated algorithms. /// It is supported when using the 64-bit AArch64 instruction set. /// /// RustFFT has several FFT algorithms available. For a given FFT size, the `FftPlannerNeon` decides which of the /// available FFT algorithms to use and then initializes them. /// /// ~~~ /// // Perform a forward Fft of size 1234 /// use std::sync::Arc; /// use rustfft::{FftPlannerNeon, num_complex::Complex}; /// /// if let Ok(mut planner) = FftPlannerNeon::new() { /// let fft = planner.plan_fft_forward(1234); /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 1234]; /// fft.process(&mut buffer); /// /// // The FFT instance returned by the planner has the type `Arc>`, /// // where T is the numeric type, ie f32 or f64, so it's cheap to clone /// let fft_clone = Arc::clone(&fft); /// } /// ~~~ /// /// If you plan on creating multiple FFT instances, it is recommended to re-use the same planner for all of them. This /// is because the planner re-uses internal data across FFT instances wherever possible, saving memory and reducing /// setup time. (FFT instances created with one planner will never re-use data and buffers with FFT instances created /// by a different planner) /// /// Each FFT instance owns [`Arc`s](std::sync::Arc) to its internal data, rather than borrowing it from the planner, so it's perfectly /// safe to drop the planner after creating Fft instances. pub struct FftPlannerNeon { algorithm_cache: FftCache, recipe_cache: HashMap>, } impl FftPlannerNeon { /// Creates a new `FftPlannerNeon` instance. /// /// Returns `Ok(planner_instance)` if we're compiling for AArch64 and NEON support was enabled in feature flags. /// Returns `Err(())` if NEON support is not available. pub fn new() -> Result { if std::arch::is_aarch64_feature_detected!("neon") { // Ideally, we would implement the planner with specialization. // Specialization won't be on stable rust for a long time though, so in the meantime, we can hack around it. // // We use TypeID to determine if T is f32, f64, or neither. If neither, we don't want to do any Neon acceleration // If it's f32 or f64, then construct and return a Neon planner instance. // // All Neon accelerated algorithms come in separate versions for f32 and f64. The type is checked when a new one is created, and if it does not // match the type the FFT is meant for, it will panic. This will never be a problem if using a planner to construct the FFTs. // // An annoying snag with this setup is that we frequently have to transmute buffers from &mut [Complex] to &mut [Complex] or vice versa. // We know this is safe because we assert everywhere that Type(f32 or f64)==Type(T), so it's just a matter of "doing it right" every time. // These transmutes are required because the FFT algorithm's input will come through the FFT trait, which may only be bounded by FftNum. // So the buffers will have the type &mut [Complex]. let id_f32 = TypeId::of::(); let id_f64 = TypeId::of::(); let id_t = TypeId::of::(); if id_t == id_f32 || id_t == id_f64 { return Ok(Self { algorithm_cache: FftCache::new(), recipe_cache: HashMap::new(), }); } } Err(()) } /// Returns a `Fft` instance which uses Neon instructions to compute FFTs of size `len`. /// /// If the provided `direction` is `FftDirection::Forward`, the returned instance will compute forward FFTs. If it's `FftDirection::Inverse`, it will compute inverse FFTs. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft(&mut self, len: usize, direction: FftDirection) -> Arc> { // Step 1: Create a "recipe" for this FFT, which will tell us exactly which combination of algorithms to use let recipe = self.design_fft_for_len(len); // Step 2: Use our recipe to construct a Fft trait object self.build_fft(&recipe, direction) } /// Returns a `Fft` instance which uses Neon instructions to compute forward FFTs of size `len` /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_forward(&mut self, len: usize) -> Arc> { self.plan_fft(len, FftDirection::Forward) } /// Returns a `Fft` instance which uses Neon instructions to compute inverse FFTs of size `len. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_inverse(&mut self, len: usize) -> Arc> { self.plan_fft(len, FftDirection::Inverse) } // Make a recipe for a length fn design_fft_for_len(&mut self, len: usize) -> Arc { if len < 1 { Arc::new(Recipe::Dft(len)) } else if let Some(recipe) = self.recipe_cache.get(&len) { Arc::clone(&recipe) } else { let factors = PrimeFactors::compute(len); let recipe = self.design_fft_with_factors(len, factors); self.recipe_cache.insert(len, Arc::clone(&recipe)); recipe } } // Create the fft from a recipe, take from cache if possible fn build_fft(&mut self, recipe: &Recipe, direction: FftDirection) -> Arc> { let len = recipe.len(); if let Some(instance) = self.algorithm_cache.get(len, direction) { instance } else { let fft = self.build_new_fft(recipe, direction); self.algorithm_cache.insert(&fft); fft } } // Create a new fft from a recipe fn build_new_fft(&mut self, recipe: &Recipe, direction: FftDirection) -> Arc> { let id_f32 = TypeId::of::(); let id_f64 = TypeId::of::(); let id_t = TypeId::of::(); match recipe { Recipe::Dft(len) => Arc::new(Dft::new(*len, direction)) as Arc>, Recipe::Radix4(len) => { if id_t == id_f32 { Arc::new(Neon32Radix4::new(*len, direction)) as Arc> } else if id_t == id_f64 { Arc::new(Neon64Radix4::new(*len, direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly1 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly1::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly1::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly2 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly2::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly2::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly3 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly3::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly3::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly4 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly4::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly4::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly5 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly5::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly5::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly6 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly6::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly6::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly7 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly7::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly7::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly8 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly8::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly8::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly9 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly9::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly9::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly10 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly10::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly10::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly11 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly11::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly11::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly12 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly12::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly12::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly13 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly13::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly13::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly15 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly15::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly15::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly16 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly16::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly16::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly17 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly17::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly17::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly19 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly19::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly19::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly23 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly23::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly23::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly29 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly29::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly29::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly31 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly31::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly31::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly32 => { if id_t == id_f32 { Arc::new(NeonF32Butterfly32::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(NeonF64Butterfly32::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::MixedRadix { left_fft, right_fft, } => { let left_fft = self.build_fft(&left_fft, direction); let right_fft = self.build_fft(&right_fft, direction); Arc::new(MixedRadix::new(left_fft, right_fft)) as Arc> } Recipe::GoodThomasAlgorithm { left_fft, right_fft, } => { let left_fft = self.build_fft(&left_fft, direction); let right_fft = self.build_fft(&right_fft, direction); Arc::new(GoodThomasAlgorithm::new(left_fft, right_fft)) as Arc> } Recipe::MixedRadixSmall { left_fft, right_fft, } => { let left_fft = self.build_fft(&left_fft, direction); let right_fft = self.build_fft(&right_fft, direction); Arc::new(MixedRadixSmall::new(left_fft, right_fft)) as Arc> } Recipe::GoodThomasAlgorithmSmall { left_fft, right_fft, } => { let left_fft = self.build_fft(&left_fft, direction); let right_fft = self.build_fft(&right_fft, direction); Arc::new(GoodThomasAlgorithmSmall::new(left_fft, right_fft)) as Arc> } Recipe::RadersAlgorithm { inner_fft } => { let inner_fft = self.build_fft(&inner_fft, direction); Arc::new(RadersAlgorithm::new(inner_fft)) as Arc> } Recipe::BluesteinsAlgorithm { len, inner_fft } => { let inner_fft = self.build_fft(&inner_fft, direction); Arc::new(BluesteinsAlgorithm::new(*len, inner_fft)) as Arc> } } } fn design_fft_with_factors(&mut self, len: usize, factors: PrimeFactors) -> Arc { if let Some(fft_instance) = self.design_butterfly_algorithm(len) { fft_instance } else if factors.is_prime() { self.design_prime(len) } else if len.trailing_zeros() >= MIN_RADIX4_BITS { if len.is_power_of_two() { Arc::new(Recipe::Radix4(len)) } else { let non_power_of_two = factors .remove_factors(PrimeFactor { value: 2, count: len.trailing_zeros(), }) .unwrap(); let power_of_two = PrimeFactors::compute(1 << len.trailing_zeros()); self.design_mixed_radix(power_of_two, non_power_of_two) } } else { // Can we do this as a mixed radix with just two butterflies? // Loop through and find all combinations // If more than one is found, keep the one where the factors are closer together. // For example length 20 where 10x2 and 5x4 are possible, we use 5x4. let butterflies: [usize; 20] = [ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 19, 23, 29, 31, 32, ]; let mut bf_left = 0; let mut bf_right = 0; // If the length is below 14, or over 1024 we don't need to try this. if len > 13 && len <= 1024 { for (n, bf_l) in butterflies.iter().enumerate() { if len % bf_l == 0 { let bf_r = len / bf_l; if butterflies.iter().skip(n).any(|&m| m == bf_r) { bf_left = *bf_l; bf_right = bf_r; } } } if bf_left > 0 { let fact_l = PrimeFactors::compute(bf_left); let fact_r = PrimeFactors::compute(bf_right); return self.design_mixed_radix(fact_l, fact_r); } } // Not possible with just butterflies, go with the general solution. let (left_factors, right_factors) = factors.partition_factors(); self.design_mixed_radix(left_factors, right_factors) } } fn design_mixed_radix( &mut self, left_factors: PrimeFactors, right_factors: PrimeFactors, ) -> Arc { let left_len = left_factors.get_product(); let right_len = right_factors.get_product(); //neither size is a butterfly, so go with the normal algorithm let left_fft = self.design_fft_with_factors(left_len, left_factors); let right_fft = self.design_fft_with_factors(right_len, right_factors); //if both left_len and right_len are small, use algorithms optimized for small FFTs if left_len < 33 && right_len < 33 { // for small FFTs, if gcd is 1, good-thomas is faster if gcd(left_len, right_len) == 1 { Arc::new(Recipe::GoodThomasAlgorithmSmall { left_fft, right_fft, }) } else { Arc::new(Recipe::MixedRadixSmall { left_fft, right_fft, }) } } else { Arc::new(Recipe::MixedRadix { left_fft, right_fft, }) } } // Returns Some(instance) if we have a butterfly available for this size. Returns None if there is no butterfly available for this size fn design_butterfly_algorithm(&mut self, len: usize) -> Option> { match len { 1 => Some(Arc::new(Recipe::Butterfly1)), 2 => Some(Arc::new(Recipe::Butterfly2)), 3 => Some(Arc::new(Recipe::Butterfly3)), 4 => Some(Arc::new(Recipe::Butterfly4)), 5 => Some(Arc::new(Recipe::Butterfly5)), 6 => Some(Arc::new(Recipe::Butterfly6)), 7 => Some(Arc::new(Recipe::Butterfly7)), 8 => Some(Arc::new(Recipe::Butterfly8)), 9 => Some(Arc::new(Recipe::Butterfly9)), 10 => Some(Arc::new(Recipe::Butterfly10)), 11 => Some(Arc::new(Recipe::Butterfly11)), 12 => Some(Arc::new(Recipe::Butterfly12)), 13 => Some(Arc::new(Recipe::Butterfly13)), 15 => Some(Arc::new(Recipe::Butterfly15)), 16 => Some(Arc::new(Recipe::Butterfly16)), 17 => Some(Arc::new(Recipe::Butterfly17)), 19 => Some(Arc::new(Recipe::Butterfly19)), 23 => Some(Arc::new(Recipe::Butterfly23)), 29 => Some(Arc::new(Recipe::Butterfly29)), 31 => Some(Arc::new(Recipe::Butterfly31)), 32 => Some(Arc::new(Recipe::Butterfly32)), _ => None, } } fn design_prime(&mut self, len: usize) -> Arc { let inner_fft_len_rader = len - 1; let raders_factors = PrimeFactors::compute(inner_fft_len_rader); // If any of the prime factors is too large, Rader's gets slow and Bluestein's is the better choice if raders_factors .get_other_factors() .iter() .any(|val| val.value > MAX_RADER_PRIME_FACTOR) { let inner_fft_len_pow2 = (2 * len - 1).checked_next_power_of_two().unwrap(); // for long ffts a mixed radix inner fft is faster than a longer radix4 let min_inner_len = 2 * len - 1; let mixed_radix_len = 3 * inner_fft_len_pow2 / 4; let inner_fft = if mixed_radix_len >= min_inner_len && len >= MIN_BLUESTEIN_MIXED_RADIX_LEN { let mixed_radix_factors = PrimeFactors::compute(mixed_radix_len); self.design_fft_with_factors(mixed_radix_len, mixed_radix_factors) } else { Arc::new(Recipe::Radix4(inner_fft_len_pow2)) }; Arc::new(Recipe::BluesteinsAlgorithm { len, inner_fft }) } else { let inner_fft = self.design_fft_with_factors(inner_fft_len_rader, raders_factors); Arc::new(Recipe::RadersAlgorithm { inner_fft }) } } } #[cfg(test)] mod unit_tests { use super::*; fn is_mixedradix(plan: &Recipe) -> bool { match plan { &Recipe::MixedRadix { .. } => true, _ => false, } } fn is_mixedradixsmall(plan: &Recipe) -> bool { match plan { &Recipe::MixedRadixSmall { .. } => true, _ => false, } } fn is_goodthomassmall(plan: &Recipe) -> bool { match plan { &Recipe::GoodThomasAlgorithmSmall { .. } => true, _ => false, } } fn is_raders(plan: &Recipe) -> bool { match plan { &Recipe::RadersAlgorithm { .. } => true, _ => false, } } fn is_bluesteins(plan: &Recipe) -> bool { match plan { &Recipe::BluesteinsAlgorithm { .. } => true, _ => false, } } #[test] fn test_plan_neon_trivial() { // Length 0 and 1 should use Dft let mut planner = FftPlannerNeon::::new().unwrap(); for len in 0..1 { let plan = planner.design_fft_for_len(len); assert_eq!(*plan, Recipe::Dft(len)); assert_eq!(plan.len(), len, "Recipe reports wrong length"); } } #[test] fn test_plan_neon_largepoweroftwo() { // Powers of 2 above 6 should use Radix4 let mut planner = FftPlannerNeon::::new().unwrap(); for pow in 6..32 { let len = 1 << pow; let plan = planner.design_fft_for_len(len); assert_eq!(*plan, Recipe::Radix4(len)); assert_eq!(plan.len(), len, "Recipe reports wrong length"); } } #[test] fn test_plan_neon_butterflies() { // Check that all butterflies are used let mut planner = FftPlannerNeon::::new().unwrap(); assert_eq!(*planner.design_fft_for_len(2), Recipe::Butterfly2); assert_eq!(*planner.design_fft_for_len(3), Recipe::Butterfly3); assert_eq!(*planner.design_fft_for_len(4), Recipe::Butterfly4); assert_eq!(*planner.design_fft_for_len(5), Recipe::Butterfly5); assert_eq!(*planner.design_fft_for_len(6), Recipe::Butterfly6); assert_eq!(*planner.design_fft_for_len(7), Recipe::Butterfly7); assert_eq!(*planner.design_fft_for_len(8), Recipe::Butterfly8); assert_eq!(*planner.design_fft_for_len(9), Recipe::Butterfly9); assert_eq!(*planner.design_fft_for_len(10), Recipe::Butterfly10); assert_eq!(*planner.design_fft_for_len(11), Recipe::Butterfly11); assert_eq!(*planner.design_fft_for_len(12), Recipe::Butterfly12); assert_eq!(*planner.design_fft_for_len(13), Recipe::Butterfly13); assert_eq!(*planner.design_fft_for_len(15), Recipe::Butterfly15); assert_eq!(*planner.design_fft_for_len(16), Recipe::Butterfly16); assert_eq!(*planner.design_fft_for_len(17), Recipe::Butterfly17); assert_eq!(*planner.design_fft_for_len(19), Recipe::Butterfly19); assert_eq!(*planner.design_fft_for_len(23), Recipe::Butterfly23); assert_eq!(*planner.design_fft_for_len(29), Recipe::Butterfly29); assert_eq!(*planner.design_fft_for_len(31), Recipe::Butterfly31); assert_eq!(*planner.design_fft_for_len(32), Recipe::Butterfly32); } #[test] fn test_plan_neon_mixedradix() { // Products of several different primes should become MixedRadix let mut planner = FftPlannerNeon::::new().unwrap(); for pow2 in 2..5 { for pow3 in 2..5 { for pow5 in 2..5 { for pow7 in 2..5 { let len = 2usize.pow(pow2) * 3usize.pow(pow3) * 5usize.pow(pow5) * 7usize.pow(pow7); let plan = planner.design_fft_for_len(len); assert!(is_mixedradix(&plan), "Expected MixedRadix, got {:?}", plan); assert_eq!(plan.len(), len, "Recipe reports wrong length"); } } } } } #[test] fn test_plan_neon_mixedradixsmall() { // Products of two "small" lengths < 31 that have a common divisor >1, and isn't a power of 2 should be MixedRadixSmall let mut planner = FftPlannerNeon::::new().unwrap(); for len in [5 * 20, 5 * 25].iter() { let plan = planner.design_fft_for_len(*len); assert!( is_mixedradixsmall(&plan), "Expected MixedRadixSmall, got {:?}", plan ); assert_eq!(plan.len(), *len, "Recipe reports wrong length"); } } #[test] fn test_plan_neon_goodthomasbutterfly() { let mut planner = FftPlannerNeon::::new().unwrap(); for len in [3 * 7, 5 * 7, 11 * 13, 2 * 29].iter() { let plan = planner.design_fft_for_len(*len); assert!( is_goodthomassmall(&plan), "Expected GoodThomasAlgorithmSmall, got {:?}", plan ); assert_eq!(plan.len(), *len, "Recipe reports wrong length"); } } #[test] fn test_plan_neon_bluestein_vs_rader() { let difficultprimes: [usize; 11] = [59, 83, 107, 149, 167, 173, 179, 359, 719, 1439, 2879]; let easyprimes: [usize; 24] = [ 53, 61, 67, 71, 73, 79, 89, 97, 101, 103, 109, 113, 127, 131, 137, 139, 151, 157, 163, 181, 191, 193, 197, 199, ]; let mut planner = FftPlannerNeon::::new().unwrap(); for len in difficultprimes.iter() { let plan = planner.design_fft_for_len(*len); assert!( is_bluesteins(&plan), "Expected BluesteinsAlgorithm, got {:?}", plan ); assert_eq!(plan.len(), *len, "Recipe reports wrong length"); } for len in easyprimes.iter() { let plan = planner.design_fft_for_len(*len); assert!(is_raders(&plan), "Expected RadersAlgorithm, got {:?}", plan); assert_eq!(plan.len(), *len, "Recipe reports wrong length"); } } #[test] fn test_neon_fft_cache() { { // Check that FFTs are reused if they're both forward let mut planner = FftPlannerNeon::::new().unwrap(); let fft_a = planner.plan_fft(1234, FftDirection::Forward); let fft_b = planner.plan_fft(1234, FftDirection::Forward); assert!(Arc::ptr_eq(&fft_a, &fft_b), "Existing fft was not reused"); } { // Check that FFTs are reused if they're both inverse let mut planner = FftPlannerNeon::::new().unwrap(); let fft_a = planner.plan_fft(1234, FftDirection::Inverse); let fft_b = planner.plan_fft(1234, FftDirection::Inverse); assert!(Arc::ptr_eq(&fft_a, &fft_b), "Existing fft was not reused"); } { // Check that FFTs are NOT resued if they don't both have the same direction let mut planner = FftPlannerNeon::::new().unwrap(); let fft_a = planner.plan_fft(1234, FftDirection::Forward); let fft_b = planner.plan_fft(1234, FftDirection::Inverse); assert!( !Arc::ptr_eq(&fft_a, &fft_b), "Existing fft was reused, even though directions don't match" ); } } #[test] fn test_neon_recipe_cache() { // Check that all butterflies are used let mut planner = FftPlannerNeon::::new().unwrap(); let fft_a = planner.design_fft_for_len(1234); let fft_b = planner.design_fft_for_len(1234); assert!( Arc::ptr_eq(&fft_a, &fft_b), "Existing recipe was not reused" ); } } rustfft-6.2.0/src/neon/neon_prime_butterflies.rs000064400000000000000000012243760072674642500202340ustar 00000000000000use core::arch::aarch64::*; use num_complex::Complex; use crate::{common::FftNum, FftDirection}; use crate::array_utils; use crate::array_utils::workaround_transmute_mut; use crate::array_utils::DoubleBuf; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::twiddles; use crate::{Direction, Fft, Length}; use super::neon_common::{assert_f32, assert_f64}; use super::neon_utils::*; use super::neon_vector::{NeonArrayMut}; use super::neon_butterflies::{parallel_fft2_interleaved_f32, solo_fft2_f64}; // Auto-generated prime length butterflies // The code here is mostly autogenerated by the python script tools/gen_sse_butterflies.py, and then translated from SSE to Neon. // // The algorithm is derived directly from the definition of the DFT, by eliminating any repeated calculations. // See the comments in src/algorithm/butterflies.rs for a detailed description. // // The script generates the code for performing a single f64 fft, as well as dual f32 fft. // It also generates the code for reading and writing the input and output. // The single 32-bit ffts reuse the dual ffts. // _____ _________ _ _ _ // |___ | |___ /___ \| |__ (_) |_ // / / _____ |_ \ __) | '_ \| | __| // / / |_____| ___) / __/| |_) | | |_ // /_/ |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly7 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: float32x4_t, twiddle1im: float32x4_t, twiddle2re: float32x4_t, twiddle2im: float32x4_t, twiddle3re: float32x4_t, twiddle3im: float32x4_t, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly7, 7, |this: &NeonF32Butterfly7<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly7, 7, |this: &NeonF32Butterfly7<_>| this .direction); impl NeonF32Butterfly7 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 7, direction); let tw2: Complex = twiddles::compute_twiddle(2, 7, direction); let tw3: Complex = twiddles::compute_twiddle(3, 7, direction); let twiddle1re = unsafe { vmovq_n_f32(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f32(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f32(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f32(tw2.im) }; let twiddle3re = unsafe { vmovq_n_f32(tw3.re) }; let twiddle3im = unsafe { vmovq_n_f32(tw3.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[3]), extract_hi_lo_f32(input_packed[0], input_packed[4]), extract_lo_hi_f32(input_packed[1], input_packed[4]), extract_hi_lo_f32(input_packed[1], input_packed[5]), extract_lo_hi_f32(input_packed[2], input_packed[5]), extract_hi_lo_f32(input_packed[2], input_packed[6]), extract_lo_hi_f32(input_packed[3], input_packed[6]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_hi_f32(out[6], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0,1,2,3,4,5,6}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [float32x4_t; 7]) -> [float32x4_t; 7] { let [x1p6, x1m6] = parallel_fft2_interleaved_f32(values[1], values[6]); let [x2p5, x2m5] = parallel_fft2_interleaved_f32(values[2], values[5]); let [x3p4, x3m4] = parallel_fft2_interleaved_f32(values[3], values[4]); let t_a1_1 = vmulq_f32(self.twiddle1re, x1p6); let t_a1_2 = vmulq_f32(self.twiddle2re, x2p5); let t_a1_3 = vmulq_f32(self.twiddle3re, x3p4); let t_a2_1 = vmulq_f32(self.twiddle2re, x1p6); let t_a2_2 = vmulq_f32(self.twiddle3re, x2p5); let t_a2_3 = vmulq_f32(self.twiddle1re, x3p4); let t_a3_1 = vmulq_f32(self.twiddle3re, x1p6); let t_a3_2 = vmulq_f32(self.twiddle1re, x2p5); let t_a3_3 = vmulq_f32(self.twiddle2re, x3p4); let t_b1_1 = vmulq_f32(self.twiddle1im, x1m6); let t_b1_2 = vmulq_f32(self.twiddle2im, x2m5); let t_b1_3 = vmulq_f32(self.twiddle3im, x3m4); let t_b2_1 = vmulq_f32(self.twiddle2im, x1m6); let t_b2_2 = vmulq_f32(self.twiddle3im, x2m5); let t_b2_3 = vmulq_f32(self.twiddle1im, x3m4); let t_b3_1 = vmulq_f32(self.twiddle3im, x1m6); let t_b3_2 = vmulq_f32(self.twiddle1im, x2m5); let t_b3_3 = vmulq_f32(self.twiddle2im, x3m4); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3); let t_b2 = calc_f32!(t_b2_1 - t_b2_2 - t_b2_3); let t_b3 = calc_f32!(t_b3_1 - t_b3_2 + t_b3_3); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let y0 = calc_f32!(x0 + x1p6 + x2p5 + x3p4); let [y1, y6] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y5] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y4] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); [y0, y1, y2, y3, y4, y5, y6] } } // _____ __ _ _ _ _ _ // |___ | / /_ | || | | |__ (_) |_ // / / _____ | '_ \| || |_| '_ \| | __| // / / |_____| | (_) |__ _| |_) | | |_ // /_/ \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly7 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: float64x2_t, twiddle1im: float64x2_t, twiddle2re: float64x2_t, twiddle2im: float64x2_t, twiddle3re: float64x2_t, twiddle3im: float64x2_t, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly7, 7, |this: &NeonF64Butterfly7<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly7, 7, |this: &NeonF64Butterfly7<_>| this .direction); impl NeonF64Butterfly7 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 7, direction); let tw2: Complex = twiddles::compute_twiddle(2, 7, direction); let tw3: Complex = twiddles::compute_twiddle(3, 7, direction); let twiddle1re = unsafe { vmovq_n_f64(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f64(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f64(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f64(tw2.im) }; let twiddle3re = unsafe { vmovq_n_f64(tw3.re) }; let twiddle3im = unsafe { vmovq_n_f64(tw3.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [float64x2_t; 7]) -> [float64x2_t; 7] { let [x1p6, x1m6] = solo_fft2_f64(values[1], values[6]); let [x2p5, x2m5] = solo_fft2_f64(values[2], values[5]); let [x3p4, x3m4] = solo_fft2_f64(values[3], values[4]); let t_a1_1 = vmulq_f64(self.twiddle1re, x1p6); let t_a1_2 = vmulq_f64(self.twiddle2re, x2p5); let t_a1_3 = vmulq_f64(self.twiddle3re, x3p4); let t_a2_1 = vmulq_f64(self.twiddle2re, x1p6); let t_a2_2 = vmulq_f64(self.twiddle3re, x2p5); let t_a2_3 = vmulq_f64(self.twiddle1re, x3p4); let t_a3_1 = vmulq_f64(self.twiddle3re, x1p6); let t_a3_2 = vmulq_f64(self.twiddle1re, x2p5); let t_a3_3 = vmulq_f64(self.twiddle2re, x3p4); let t_b1_1 = vmulq_f64(self.twiddle1im, x1m6); let t_b1_2 = vmulq_f64(self.twiddle2im, x2m5); let t_b1_3 = vmulq_f64(self.twiddle3im, x3m4); let t_b2_1 = vmulq_f64(self.twiddle2im, x1m6); let t_b2_2 = vmulq_f64(self.twiddle3im, x2m5); let t_b2_3 = vmulq_f64(self.twiddle1im, x3m4); let t_b3_1 = vmulq_f64(self.twiddle3im, x1m6); let t_b3_2 = vmulq_f64(self.twiddle1im, x2m5); let t_b3_3 = vmulq_f64(self.twiddle2im, x3m4); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3); let t_b2 = calc_f64!(t_b2_1 - t_b2_2 - t_b2_3); let t_b3 = calc_f64!(t_b3_1 - t_b3_2 + t_b3_3); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let y0 = calc_f64!(x0 + x1p6 + x2p5 + x3p4); let [y1, y6] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y5] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y4] = solo_fft2_f64(t_a3, t_b3_rot); [y0, y1, y2, y3, y4, y5, y6] } } // _ _ _________ _ _ _ // / / | |___ /___ \| |__ (_) |_ // | | | _____ |_ \ __) | '_ \| | __| // | | | |_____| ___) / __/| |_) | | |_ // |_|_| |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly11 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: float32x4_t, twiddle1im: float32x4_t, twiddle2re: float32x4_t, twiddle2im: float32x4_t, twiddle3re: float32x4_t, twiddle3im: float32x4_t, twiddle4re: float32x4_t, twiddle4im: float32x4_t, twiddle5re: float32x4_t, twiddle5im: float32x4_t, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly11, 11, |this: &NeonF32Butterfly11<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly11, 11, |this: &NeonF32Butterfly11<_>| this .direction); impl NeonF32Butterfly11 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 11, direction); let tw2: Complex = twiddles::compute_twiddle(2, 11, direction); let tw3: Complex = twiddles::compute_twiddle(3, 11, direction); let tw4: Complex = twiddles::compute_twiddle(4, 11, direction); let tw5: Complex = twiddles::compute_twiddle(5, 11, direction); let twiddle1re = unsafe { vmovq_n_f32(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f32(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f32(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f32(tw2.im) }; let twiddle3re = unsafe { vmovq_n_f32(tw3.re) }; let twiddle3im = unsafe { vmovq_n_f32(tw3.im) }; let twiddle4re = unsafe { vmovq_n_f32(tw4.re) }; let twiddle4im = unsafe { vmovq_n_f32(tw4.im) }; let twiddle5re = unsafe { vmovq_n_f32(tw5.re) }; let twiddle5im = unsafe { vmovq_n_f32(tw5.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[5]), extract_hi_lo_f32(input_packed[0], input_packed[6]), extract_lo_hi_f32(input_packed[1], input_packed[6]), extract_hi_lo_f32(input_packed[1], input_packed[7]), extract_lo_hi_f32(input_packed[2], input_packed[7]), extract_hi_lo_f32(input_packed[2], input_packed[8]), extract_lo_hi_f32(input_packed[3], input_packed[8]), extract_hi_lo_f32(input_packed[3], input_packed[9]), extract_lo_hi_f32(input_packed[4], input_packed[9]), extract_hi_lo_f32(input_packed[4], input_packed[10]), extract_lo_hi_f32(input_packed[5], input_packed[10]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_hi_f32(out[10], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [float32x4_t; 11]) -> [float32x4_t; 11] { let [x1p10, x1m10] = parallel_fft2_interleaved_f32(values[1], values[10]); let [x2p9, x2m9] = parallel_fft2_interleaved_f32(values[2], values[9]); let [x3p8, x3m8] = parallel_fft2_interleaved_f32(values[3], values[8]); let [x4p7, x4m7] = parallel_fft2_interleaved_f32(values[4], values[7]); let [x5p6, x5m6] = parallel_fft2_interleaved_f32(values[5], values[6]); let t_a1_1 = vmulq_f32(self.twiddle1re, x1p10); let t_a1_2 = vmulq_f32(self.twiddle2re, x2p9); let t_a1_3 = vmulq_f32(self.twiddle3re, x3p8); let t_a1_4 = vmulq_f32(self.twiddle4re, x4p7); let t_a1_5 = vmulq_f32(self.twiddle5re, x5p6); let t_a2_1 = vmulq_f32(self.twiddle2re, x1p10); let t_a2_2 = vmulq_f32(self.twiddle4re, x2p9); let t_a2_3 = vmulq_f32(self.twiddle5re, x3p8); let t_a2_4 = vmulq_f32(self.twiddle3re, x4p7); let t_a2_5 = vmulq_f32(self.twiddle1re, x5p6); let t_a3_1 = vmulq_f32(self.twiddle3re, x1p10); let t_a3_2 = vmulq_f32(self.twiddle5re, x2p9); let t_a3_3 = vmulq_f32(self.twiddle2re, x3p8); let t_a3_4 = vmulq_f32(self.twiddle1re, x4p7); let t_a3_5 = vmulq_f32(self.twiddle4re, x5p6); let t_a4_1 = vmulq_f32(self.twiddle4re, x1p10); let t_a4_2 = vmulq_f32(self.twiddle3re, x2p9); let t_a4_3 = vmulq_f32(self.twiddle1re, x3p8); let t_a4_4 = vmulq_f32(self.twiddle5re, x4p7); let t_a4_5 = vmulq_f32(self.twiddle2re, x5p6); let t_a5_1 = vmulq_f32(self.twiddle5re, x1p10); let t_a5_2 = vmulq_f32(self.twiddle1re, x2p9); let t_a5_3 = vmulq_f32(self.twiddle4re, x3p8); let t_a5_4 = vmulq_f32(self.twiddle2re, x4p7); let t_a5_5 = vmulq_f32(self.twiddle3re, x5p6); let t_b1_1 = vmulq_f32(self.twiddle1im, x1m10); let t_b1_2 = vmulq_f32(self.twiddle2im, x2m9); let t_b1_3 = vmulq_f32(self.twiddle3im, x3m8); let t_b1_4 = vmulq_f32(self.twiddle4im, x4m7); let t_b1_5 = vmulq_f32(self.twiddle5im, x5m6); let t_b2_1 = vmulq_f32(self.twiddle2im, x1m10); let t_b2_2 = vmulq_f32(self.twiddle4im, x2m9); let t_b2_3 = vmulq_f32(self.twiddle5im, x3m8); let t_b2_4 = vmulq_f32(self.twiddle3im, x4m7); let t_b2_5 = vmulq_f32(self.twiddle1im, x5m6); let t_b3_1 = vmulq_f32(self.twiddle3im, x1m10); let t_b3_2 = vmulq_f32(self.twiddle5im, x2m9); let t_b3_3 = vmulq_f32(self.twiddle2im, x3m8); let t_b3_4 = vmulq_f32(self.twiddle1im, x4m7); let t_b3_5 = vmulq_f32(self.twiddle4im, x5m6); let t_b4_1 = vmulq_f32(self.twiddle4im, x1m10); let t_b4_2 = vmulq_f32(self.twiddle3im, x2m9); let t_b4_3 = vmulq_f32(self.twiddle1im, x3m8); let t_b4_4 = vmulq_f32(self.twiddle5im, x4m7); let t_b4_5 = vmulq_f32(self.twiddle2im, x5m6); let t_b5_1 = vmulq_f32(self.twiddle5im, x1m10); let t_b5_2 = vmulq_f32(self.twiddle1im, x2m9); let t_b5_3 = vmulq_f32(self.twiddle4im, x3m8); let t_b5_4 = vmulq_f32(self.twiddle2im, x4m7); let t_b5_5 = vmulq_f32(self.twiddle3im, x5m6); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 - t_b2_3 - t_b2_4 - t_b2_5); let t_b3 = calc_f32!(t_b3_1 - t_b3_2 - t_b3_3 + t_b3_4 + t_b3_5); let t_b4 = calc_f32!(t_b4_1 - t_b4_2 + t_b4_3 + t_b4_4 - t_b4_5); let t_b5 = calc_f32!(t_b5_1 - t_b5_2 + t_b5_3 - t_b5_4 + t_b5_5); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let y0 = calc_f32!(x0 + x1p10 + x2p9 + x3p8 + x4p7 + x5p6); let [y1, y10] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y9] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y8] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y7] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y6] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10] } } // _ _ __ _ _ _ _ _ // / / | / /_ | || | | |__ (_) |_ // | | | _____ | '_ \| || |_| '_ \| | __| // | | | |_____| | (_) |__ _| |_) | | |_ // |_|_| \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly11 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: float64x2_t, twiddle1im: float64x2_t, twiddle2re: float64x2_t, twiddle2im: float64x2_t, twiddle3re: float64x2_t, twiddle3im: float64x2_t, twiddle4re: float64x2_t, twiddle4im: float64x2_t, twiddle5re: float64x2_t, twiddle5im: float64x2_t, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly11, 11, |this: &NeonF64Butterfly11<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly11, 11, |this: &NeonF64Butterfly11<_>| this .direction); impl NeonF64Butterfly11 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 11, direction); let tw2: Complex = twiddles::compute_twiddle(2, 11, direction); let tw3: Complex = twiddles::compute_twiddle(3, 11, direction); let tw4: Complex = twiddles::compute_twiddle(4, 11, direction); let tw5: Complex = twiddles::compute_twiddle(5, 11, direction); let twiddle1re = unsafe { vmovq_n_f64(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f64(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f64(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f64(tw2.im) }; let twiddle3re = unsafe { vmovq_n_f64(tw3.re) }; let twiddle3im = unsafe { vmovq_n_f64(tw3.im) }; let twiddle4re = unsafe { vmovq_n_f64(tw4.re) }; let twiddle4im = unsafe { vmovq_n_f64(tw4.im) }; let twiddle5re = unsafe { vmovq_n_f64(tw5.re) }; let twiddle5im = unsafe { vmovq_n_f64(tw5.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [float64x2_t; 11]) -> [float64x2_t; 11] { let [x1p10, x1m10] = solo_fft2_f64(values[1], values[10]); let [x2p9, x2m9] = solo_fft2_f64(values[2], values[9]); let [x3p8, x3m8] = solo_fft2_f64(values[3], values[8]); let [x4p7, x4m7] = solo_fft2_f64(values[4], values[7]); let [x5p6, x5m6] = solo_fft2_f64(values[5], values[6]); let t_a1_1 = vmulq_f64(self.twiddle1re, x1p10); let t_a1_2 = vmulq_f64(self.twiddle2re, x2p9); let t_a1_3 = vmulq_f64(self.twiddle3re, x3p8); let t_a1_4 = vmulq_f64(self.twiddle4re, x4p7); let t_a1_5 = vmulq_f64(self.twiddle5re, x5p6); let t_a2_1 = vmulq_f64(self.twiddle2re, x1p10); let t_a2_2 = vmulq_f64(self.twiddle4re, x2p9); let t_a2_3 = vmulq_f64(self.twiddle5re, x3p8); let t_a2_4 = vmulq_f64(self.twiddle3re, x4p7); let t_a2_5 = vmulq_f64(self.twiddle1re, x5p6); let t_a3_1 = vmulq_f64(self.twiddle3re, x1p10); let t_a3_2 = vmulq_f64(self.twiddle5re, x2p9); let t_a3_3 = vmulq_f64(self.twiddle2re, x3p8); let t_a3_4 = vmulq_f64(self.twiddle1re, x4p7); let t_a3_5 = vmulq_f64(self.twiddle4re, x5p6); let t_a4_1 = vmulq_f64(self.twiddle4re, x1p10); let t_a4_2 = vmulq_f64(self.twiddle3re, x2p9); let t_a4_3 = vmulq_f64(self.twiddle1re, x3p8); let t_a4_4 = vmulq_f64(self.twiddle5re, x4p7); let t_a4_5 = vmulq_f64(self.twiddle2re, x5p6); let t_a5_1 = vmulq_f64(self.twiddle5re, x1p10); let t_a5_2 = vmulq_f64(self.twiddle1re, x2p9); let t_a5_3 = vmulq_f64(self.twiddle4re, x3p8); let t_a5_4 = vmulq_f64(self.twiddle2re, x4p7); let t_a5_5 = vmulq_f64(self.twiddle3re, x5p6); let t_b1_1 = vmulq_f64(self.twiddle1im, x1m10); let t_b1_2 = vmulq_f64(self.twiddle2im, x2m9); let t_b1_3 = vmulq_f64(self.twiddle3im, x3m8); let t_b1_4 = vmulq_f64(self.twiddle4im, x4m7); let t_b1_5 = vmulq_f64(self.twiddle5im, x5m6); let t_b2_1 = vmulq_f64(self.twiddle2im, x1m10); let t_b2_2 = vmulq_f64(self.twiddle4im, x2m9); let t_b2_3 = vmulq_f64(self.twiddle5im, x3m8); let t_b2_4 = vmulq_f64(self.twiddle3im, x4m7); let t_b2_5 = vmulq_f64(self.twiddle1im, x5m6); let t_b3_1 = vmulq_f64(self.twiddle3im, x1m10); let t_b3_2 = vmulq_f64(self.twiddle5im, x2m9); let t_b3_3 = vmulq_f64(self.twiddle2im, x3m8); let t_b3_4 = vmulq_f64(self.twiddle1im, x4m7); let t_b3_5 = vmulq_f64(self.twiddle4im, x5m6); let t_b4_1 = vmulq_f64(self.twiddle4im, x1m10); let t_b4_2 = vmulq_f64(self.twiddle3im, x2m9); let t_b4_3 = vmulq_f64(self.twiddle1im, x3m8); let t_b4_4 = vmulq_f64(self.twiddle5im, x4m7); let t_b4_5 = vmulq_f64(self.twiddle2im, x5m6); let t_b5_1 = vmulq_f64(self.twiddle5im, x1m10); let t_b5_2 = vmulq_f64(self.twiddle1im, x2m9); let t_b5_3 = vmulq_f64(self.twiddle4im, x3m8); let t_b5_4 = vmulq_f64(self.twiddle2im, x4m7); let t_b5_5 = vmulq_f64(self.twiddle3im, x5m6); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 - t_b2_3 - t_b2_4 - t_b2_5); let t_b3 = calc_f64!(t_b3_1 - t_b3_2 - t_b3_3 + t_b3_4 + t_b3_5); let t_b4 = calc_f64!(t_b4_1 - t_b4_2 + t_b4_3 + t_b4_4 - t_b4_5); let t_b5 = calc_f64!(t_b5_1 - t_b5_2 + t_b5_3 - t_b5_4 + t_b5_5); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let y0 = calc_f64!(x0 + x1p10 + x2p9 + x3p8 + x4p7 + x5p6); let [y1, y10] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y9] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y8] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y7] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y6] = solo_fft2_f64(t_a5, t_b5_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10] } } // _ _____ _________ _ _ _ // / |___ / |___ /___ \| |__ (_) |_ // | | |_ \ _____ |_ \ __) | '_ \| | __| // | |___) | |_____| ___) / __/| |_) | | |_ // |_|____/ |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly13 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: float32x4_t, twiddle1im: float32x4_t, twiddle2re: float32x4_t, twiddle2im: float32x4_t, twiddle3re: float32x4_t, twiddle3im: float32x4_t, twiddle4re: float32x4_t, twiddle4im: float32x4_t, twiddle5re: float32x4_t, twiddle5im: float32x4_t, twiddle6re: float32x4_t, twiddle6im: float32x4_t, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly13, 13, |this: &NeonF32Butterfly13<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly13, 13, |this: &NeonF32Butterfly13<_>| this .direction); impl NeonF32Butterfly13 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 13, direction); let tw2: Complex = twiddles::compute_twiddle(2, 13, direction); let tw3: Complex = twiddles::compute_twiddle(3, 13, direction); let tw4: Complex = twiddles::compute_twiddle(4, 13, direction); let tw5: Complex = twiddles::compute_twiddle(5, 13, direction); let tw6: Complex = twiddles::compute_twiddle(6, 13, direction); let twiddle1re = unsafe { vmovq_n_f32(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f32(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f32(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f32(tw2.im) }; let twiddle3re = unsafe { vmovq_n_f32(tw3.re) }; let twiddle3im = unsafe { vmovq_n_f32(tw3.im) }; let twiddle4re = unsafe { vmovq_n_f32(tw4.re) }; let twiddle4im = unsafe { vmovq_n_f32(tw4.im) }; let twiddle5re = unsafe { vmovq_n_f32(tw5.re) }; let twiddle5im = unsafe { vmovq_n_f32(tw5.im) }; let twiddle6re = unsafe { vmovq_n_f32(tw6.re) }; let twiddle6im = unsafe { vmovq_n_f32(tw6.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[6]), extract_hi_lo_f32(input_packed[0], input_packed[7]), extract_lo_hi_f32(input_packed[1], input_packed[7]), extract_hi_lo_f32(input_packed[1], input_packed[8]), extract_lo_hi_f32(input_packed[2], input_packed[8]), extract_hi_lo_f32(input_packed[2], input_packed[9]), extract_lo_hi_f32(input_packed[3], input_packed[9]), extract_hi_lo_f32(input_packed[3], input_packed[10]), extract_lo_hi_f32(input_packed[4], input_packed[10]), extract_hi_lo_f32(input_packed[4], input_packed[11]), extract_lo_hi_f32(input_packed[5], input_packed[11]), extract_hi_lo_f32(input_packed[5], input_packed[12]), extract_lo_hi_f32(input_packed[6], input_packed[12]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_hi_f32(out[12], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [float32x4_t; 13]) -> [float32x4_t; 13] { let [x1p12, x1m12] = parallel_fft2_interleaved_f32(values[1], values[12]); let [x2p11, x2m11] = parallel_fft2_interleaved_f32(values[2], values[11]); let [x3p10, x3m10] = parallel_fft2_interleaved_f32(values[3], values[10]); let [x4p9, x4m9] = parallel_fft2_interleaved_f32(values[4], values[9]); let [x5p8, x5m8] = parallel_fft2_interleaved_f32(values[5], values[8]); let [x6p7, x6m7] = parallel_fft2_interleaved_f32(values[6], values[7]); let t_a1_1 = vmulq_f32(self.twiddle1re, x1p12); let t_a1_2 = vmulq_f32(self.twiddle2re, x2p11); let t_a1_3 = vmulq_f32(self.twiddle3re, x3p10); let t_a1_4 = vmulq_f32(self.twiddle4re, x4p9); let t_a1_5 = vmulq_f32(self.twiddle5re, x5p8); let t_a1_6 = vmulq_f32(self.twiddle6re, x6p7); let t_a2_1 = vmulq_f32(self.twiddle2re, x1p12); let t_a2_2 = vmulq_f32(self.twiddle4re, x2p11); let t_a2_3 = vmulq_f32(self.twiddle6re, x3p10); let t_a2_4 = vmulq_f32(self.twiddle5re, x4p9); let t_a2_5 = vmulq_f32(self.twiddle3re, x5p8); let t_a2_6 = vmulq_f32(self.twiddle1re, x6p7); let t_a3_1 = vmulq_f32(self.twiddle3re, x1p12); let t_a3_2 = vmulq_f32(self.twiddle6re, x2p11); let t_a3_3 = vmulq_f32(self.twiddle4re, x3p10); let t_a3_4 = vmulq_f32(self.twiddle1re, x4p9); let t_a3_5 = vmulq_f32(self.twiddle2re, x5p8); let t_a3_6 = vmulq_f32(self.twiddle5re, x6p7); let t_a4_1 = vmulq_f32(self.twiddle4re, x1p12); let t_a4_2 = vmulq_f32(self.twiddle5re, x2p11); let t_a4_3 = vmulq_f32(self.twiddle1re, x3p10); let t_a4_4 = vmulq_f32(self.twiddle3re, x4p9); let t_a4_5 = vmulq_f32(self.twiddle6re, x5p8); let t_a4_6 = vmulq_f32(self.twiddle2re, x6p7); let t_a5_1 = vmulq_f32(self.twiddle5re, x1p12); let t_a5_2 = vmulq_f32(self.twiddle3re, x2p11); let t_a5_3 = vmulq_f32(self.twiddle2re, x3p10); let t_a5_4 = vmulq_f32(self.twiddle6re, x4p9); let t_a5_5 = vmulq_f32(self.twiddle1re, x5p8); let t_a5_6 = vmulq_f32(self.twiddle4re, x6p7); let t_a6_1 = vmulq_f32(self.twiddle6re, x1p12); let t_a6_2 = vmulq_f32(self.twiddle1re, x2p11); let t_a6_3 = vmulq_f32(self.twiddle5re, x3p10); let t_a6_4 = vmulq_f32(self.twiddle2re, x4p9); let t_a6_5 = vmulq_f32(self.twiddle4re, x5p8); let t_a6_6 = vmulq_f32(self.twiddle3re, x6p7); let t_b1_1 = vmulq_f32(self.twiddle1im, x1m12); let t_b1_2 = vmulq_f32(self.twiddle2im, x2m11); let t_b1_3 = vmulq_f32(self.twiddle3im, x3m10); let t_b1_4 = vmulq_f32(self.twiddle4im, x4m9); let t_b1_5 = vmulq_f32(self.twiddle5im, x5m8); let t_b1_6 = vmulq_f32(self.twiddle6im, x6m7); let t_b2_1 = vmulq_f32(self.twiddle2im, x1m12); let t_b2_2 = vmulq_f32(self.twiddle4im, x2m11); let t_b2_3 = vmulq_f32(self.twiddle6im, x3m10); let t_b2_4 = vmulq_f32(self.twiddle5im, x4m9); let t_b2_5 = vmulq_f32(self.twiddle3im, x5m8); let t_b2_6 = vmulq_f32(self.twiddle1im, x6m7); let t_b3_1 = vmulq_f32(self.twiddle3im, x1m12); let t_b3_2 = vmulq_f32(self.twiddle6im, x2m11); let t_b3_3 = vmulq_f32(self.twiddle4im, x3m10); let t_b3_4 = vmulq_f32(self.twiddle1im, x4m9); let t_b3_5 = vmulq_f32(self.twiddle2im, x5m8); let t_b3_6 = vmulq_f32(self.twiddle5im, x6m7); let t_b4_1 = vmulq_f32(self.twiddle4im, x1m12); let t_b4_2 = vmulq_f32(self.twiddle5im, x2m11); let t_b4_3 = vmulq_f32(self.twiddle1im, x3m10); let t_b4_4 = vmulq_f32(self.twiddle3im, x4m9); let t_b4_5 = vmulq_f32(self.twiddle6im, x5m8); let t_b4_6 = vmulq_f32(self.twiddle2im, x6m7); let t_b5_1 = vmulq_f32(self.twiddle5im, x1m12); let t_b5_2 = vmulq_f32(self.twiddle3im, x2m11); let t_b5_3 = vmulq_f32(self.twiddle2im, x3m10); let t_b5_4 = vmulq_f32(self.twiddle6im, x4m9); let t_b5_5 = vmulq_f32(self.twiddle1im, x5m8); let t_b5_6 = vmulq_f32(self.twiddle4im, x6m7); let t_b6_1 = vmulq_f32(self.twiddle6im, x1m12); let t_b6_2 = vmulq_f32(self.twiddle1im, x2m11); let t_b6_3 = vmulq_f32(self.twiddle5im, x3m10); let t_b6_4 = vmulq_f32(self.twiddle2im, x4m9); let t_b6_5 = vmulq_f32(self.twiddle4im, x5m8); let t_b6_6 = vmulq_f32(self.twiddle3im, x6m7); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6); let t_a6 = calc_f32!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 + t_b2_3 - t_b2_4 - t_b2_5 - t_b2_6); let t_b3 = calc_f32!(t_b3_1 + t_b3_2 - t_b3_3 - t_b3_4 + t_b3_5 + t_b3_6); let t_b4 = calc_f32!(t_b4_1 - t_b4_2 - t_b4_3 + t_b4_4 - t_b4_5 - t_b4_6); let t_b5 = calc_f32!(t_b5_1 - t_b5_2 + t_b5_3 - t_b5_4 - t_b5_5 + t_b5_6); let t_b6 = calc_f32!(t_b6_1 - t_b6_2 + t_b6_3 - t_b6_4 + t_b6_5 - t_b6_6); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let y0 = calc_f32!(x0 + x1p12 + x2p11 + x3p10 + x4p9 + x5p8 + x6p7); let [y1, y12] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y11] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y10] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y9] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y8] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y7] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12] } } // _ _____ __ _ _ _ _ _ // / |___ / / /_ | || | | |__ (_) |_ // | | |_ \ _____ | '_ \| || |_| '_ \| | __| // | |___) | |_____| | (_) |__ _| |_) | | |_ // |_|____/ \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly13 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: float64x2_t, twiddle1im: float64x2_t, twiddle2re: float64x2_t, twiddle2im: float64x2_t, twiddle3re: float64x2_t, twiddle3im: float64x2_t, twiddle4re: float64x2_t, twiddle4im: float64x2_t, twiddle5re: float64x2_t, twiddle5im: float64x2_t, twiddle6re: float64x2_t, twiddle6im: float64x2_t, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly13, 13, |this: &NeonF64Butterfly13<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly13, 13, |this: &NeonF64Butterfly13<_>| this .direction); impl NeonF64Butterfly13 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 13, direction); let tw2: Complex = twiddles::compute_twiddle(2, 13, direction); let tw3: Complex = twiddles::compute_twiddle(3, 13, direction); let tw4: Complex = twiddles::compute_twiddle(4, 13, direction); let tw5: Complex = twiddles::compute_twiddle(5, 13, direction); let tw6: Complex = twiddles::compute_twiddle(6, 13, direction); let twiddle1re = unsafe { vmovq_n_f64(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f64(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f64(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f64(tw2.im) }; let twiddle3re = unsafe { vmovq_n_f64(tw3.re) }; let twiddle3im = unsafe { vmovq_n_f64(tw3.im) }; let twiddle4re = unsafe { vmovq_n_f64(tw4.re) }; let twiddle4im = unsafe { vmovq_n_f64(tw4.im) }; let twiddle5re = unsafe { vmovq_n_f64(tw5.re) }; let twiddle5im = unsafe { vmovq_n_f64(tw5.im) }; let twiddle6re = unsafe { vmovq_n_f64(tw6.re) }; let twiddle6im = unsafe { vmovq_n_f64(tw6.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [float64x2_t; 13]) -> [float64x2_t; 13] { let [x1p12, x1m12] = solo_fft2_f64(values[1], values[12]); let [x2p11, x2m11] = solo_fft2_f64(values[2], values[11]); let [x3p10, x3m10] = solo_fft2_f64(values[3], values[10]); let [x4p9, x4m9] = solo_fft2_f64(values[4], values[9]); let [x5p8, x5m8] = solo_fft2_f64(values[5], values[8]); let [x6p7, x6m7] = solo_fft2_f64(values[6], values[7]); let t_a1_1 = vmulq_f64(self.twiddle1re, x1p12); let t_a1_2 = vmulq_f64(self.twiddle2re, x2p11); let t_a1_3 = vmulq_f64(self.twiddle3re, x3p10); let t_a1_4 = vmulq_f64(self.twiddle4re, x4p9); let t_a1_5 = vmulq_f64(self.twiddle5re, x5p8); let t_a1_6 = vmulq_f64(self.twiddle6re, x6p7); let t_a2_1 = vmulq_f64(self.twiddle2re, x1p12); let t_a2_2 = vmulq_f64(self.twiddle4re, x2p11); let t_a2_3 = vmulq_f64(self.twiddle6re, x3p10); let t_a2_4 = vmulq_f64(self.twiddle5re, x4p9); let t_a2_5 = vmulq_f64(self.twiddle3re, x5p8); let t_a2_6 = vmulq_f64(self.twiddle1re, x6p7); let t_a3_1 = vmulq_f64(self.twiddle3re, x1p12); let t_a3_2 = vmulq_f64(self.twiddle6re, x2p11); let t_a3_3 = vmulq_f64(self.twiddle4re, x3p10); let t_a3_4 = vmulq_f64(self.twiddle1re, x4p9); let t_a3_5 = vmulq_f64(self.twiddle2re, x5p8); let t_a3_6 = vmulq_f64(self.twiddle5re, x6p7); let t_a4_1 = vmulq_f64(self.twiddle4re, x1p12); let t_a4_2 = vmulq_f64(self.twiddle5re, x2p11); let t_a4_3 = vmulq_f64(self.twiddle1re, x3p10); let t_a4_4 = vmulq_f64(self.twiddle3re, x4p9); let t_a4_5 = vmulq_f64(self.twiddle6re, x5p8); let t_a4_6 = vmulq_f64(self.twiddle2re, x6p7); let t_a5_1 = vmulq_f64(self.twiddle5re, x1p12); let t_a5_2 = vmulq_f64(self.twiddle3re, x2p11); let t_a5_3 = vmulq_f64(self.twiddle2re, x3p10); let t_a5_4 = vmulq_f64(self.twiddle6re, x4p9); let t_a5_5 = vmulq_f64(self.twiddle1re, x5p8); let t_a5_6 = vmulq_f64(self.twiddle4re, x6p7); let t_a6_1 = vmulq_f64(self.twiddle6re, x1p12); let t_a6_2 = vmulq_f64(self.twiddle1re, x2p11); let t_a6_3 = vmulq_f64(self.twiddle5re, x3p10); let t_a6_4 = vmulq_f64(self.twiddle2re, x4p9); let t_a6_5 = vmulq_f64(self.twiddle4re, x5p8); let t_a6_6 = vmulq_f64(self.twiddle3re, x6p7); let t_b1_1 = vmulq_f64(self.twiddle1im, x1m12); let t_b1_2 = vmulq_f64(self.twiddle2im, x2m11); let t_b1_3 = vmulq_f64(self.twiddle3im, x3m10); let t_b1_4 = vmulq_f64(self.twiddle4im, x4m9); let t_b1_5 = vmulq_f64(self.twiddle5im, x5m8); let t_b1_6 = vmulq_f64(self.twiddle6im, x6m7); let t_b2_1 = vmulq_f64(self.twiddle2im, x1m12); let t_b2_2 = vmulq_f64(self.twiddle4im, x2m11); let t_b2_3 = vmulq_f64(self.twiddle6im, x3m10); let t_b2_4 = vmulq_f64(self.twiddle5im, x4m9); let t_b2_5 = vmulq_f64(self.twiddle3im, x5m8); let t_b2_6 = vmulq_f64(self.twiddle1im, x6m7); let t_b3_1 = vmulq_f64(self.twiddle3im, x1m12); let t_b3_2 = vmulq_f64(self.twiddle6im, x2m11); let t_b3_3 = vmulq_f64(self.twiddle4im, x3m10); let t_b3_4 = vmulq_f64(self.twiddle1im, x4m9); let t_b3_5 = vmulq_f64(self.twiddle2im, x5m8); let t_b3_6 = vmulq_f64(self.twiddle5im, x6m7); let t_b4_1 = vmulq_f64(self.twiddle4im, x1m12); let t_b4_2 = vmulq_f64(self.twiddle5im, x2m11); let t_b4_3 = vmulq_f64(self.twiddle1im, x3m10); let t_b4_4 = vmulq_f64(self.twiddle3im, x4m9); let t_b4_5 = vmulq_f64(self.twiddle6im, x5m8); let t_b4_6 = vmulq_f64(self.twiddle2im, x6m7); let t_b5_1 = vmulq_f64(self.twiddle5im, x1m12); let t_b5_2 = vmulq_f64(self.twiddle3im, x2m11); let t_b5_3 = vmulq_f64(self.twiddle2im, x3m10); let t_b5_4 = vmulq_f64(self.twiddle6im, x4m9); let t_b5_5 = vmulq_f64(self.twiddle1im, x5m8); let t_b5_6 = vmulq_f64(self.twiddle4im, x6m7); let t_b6_1 = vmulq_f64(self.twiddle6im, x1m12); let t_b6_2 = vmulq_f64(self.twiddle1im, x2m11); let t_b6_3 = vmulq_f64(self.twiddle5im, x3m10); let t_b6_4 = vmulq_f64(self.twiddle2im, x4m9); let t_b6_5 = vmulq_f64(self.twiddle4im, x5m8); let t_b6_6 = vmulq_f64(self.twiddle3im, x6m7); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6); let t_a6 = calc_f64!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 + t_b2_3 - t_b2_4 - t_b2_5 - t_b2_6); let t_b3 = calc_f64!(t_b3_1 + t_b3_2 - t_b3_3 - t_b3_4 + t_b3_5 + t_b3_6); let t_b4 = calc_f64!(t_b4_1 - t_b4_2 - t_b4_3 + t_b4_4 - t_b4_5 - t_b4_6); let t_b5 = calc_f64!(t_b5_1 - t_b5_2 + t_b5_3 - t_b5_4 - t_b5_5 + t_b5_6); let t_b6 = calc_f64!(t_b6_1 - t_b6_2 + t_b6_3 - t_b6_4 + t_b6_5 - t_b6_6); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let y0 = calc_f64!(x0 + x1p12 + x2p11 + x3p10 + x4p9 + x5p8 + x6p7); let [y1, y12] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y11] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y10] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y9] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y8] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y7] = solo_fft2_f64(t_a6, t_b6_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12] } } // _ _____ _________ _ _ _ // / |___ | |___ /___ \| |__ (_) |_ // | | / / _____ |_ \ __) | '_ \| | __| // | | / / |_____| ___) / __/| |_) | | |_ // |_|/_/ |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly17 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: float32x4_t, twiddle1im: float32x4_t, twiddle2re: float32x4_t, twiddle2im: float32x4_t, twiddle3re: float32x4_t, twiddle3im: float32x4_t, twiddle4re: float32x4_t, twiddle4im: float32x4_t, twiddle5re: float32x4_t, twiddle5im: float32x4_t, twiddle6re: float32x4_t, twiddle6im: float32x4_t, twiddle7re: float32x4_t, twiddle7im: float32x4_t, twiddle8re: float32x4_t, twiddle8im: float32x4_t, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly17, 17, |this: &NeonF32Butterfly17<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly17, 17, |this: &NeonF32Butterfly17<_>| this .direction); impl NeonF32Butterfly17 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 17, direction); let tw2: Complex = twiddles::compute_twiddle(2, 17, direction); let tw3: Complex = twiddles::compute_twiddle(3, 17, direction); let tw4: Complex = twiddles::compute_twiddle(4, 17, direction); let tw5: Complex = twiddles::compute_twiddle(5, 17, direction); let tw6: Complex = twiddles::compute_twiddle(6, 17, direction); let tw7: Complex = twiddles::compute_twiddle(7, 17, direction); let tw8: Complex = twiddles::compute_twiddle(8, 17, direction); let twiddle1re = unsafe { vmovq_n_f32(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f32(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f32(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f32(tw2.im) }; let twiddle3re = unsafe { vmovq_n_f32(tw3.re) }; let twiddle3im = unsafe { vmovq_n_f32(tw3.im) }; let twiddle4re = unsafe { vmovq_n_f32(tw4.re) }; let twiddle4im = unsafe { vmovq_n_f32(tw4.im) }; let twiddle5re = unsafe { vmovq_n_f32(tw5.re) }; let twiddle5im = unsafe { vmovq_n_f32(tw5.im) }; let twiddle6re = unsafe { vmovq_n_f32(tw6.re) }; let twiddle6im = unsafe { vmovq_n_f32(tw6.im) }; let twiddle7re = unsafe { vmovq_n_f32(tw7.re) }; let twiddle7im = unsafe { vmovq_n_f32(tw7.im) }; let twiddle8re = unsafe { vmovq_n_f32(tw8.re) }; let twiddle8im = unsafe { vmovq_n_f32(tw8.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[8]), extract_hi_lo_f32(input_packed[0], input_packed[9]), extract_lo_hi_f32(input_packed[1], input_packed[9]), extract_hi_lo_f32(input_packed[1], input_packed[10]), extract_lo_hi_f32(input_packed[2], input_packed[10]), extract_hi_lo_f32(input_packed[2], input_packed[11]), extract_lo_hi_f32(input_packed[3], input_packed[11]), extract_hi_lo_f32(input_packed[3], input_packed[12]), extract_lo_hi_f32(input_packed[4], input_packed[12]), extract_hi_lo_f32(input_packed[4], input_packed[13]), extract_lo_hi_f32(input_packed[5], input_packed[13]), extract_hi_lo_f32(input_packed[5], input_packed[14]), extract_lo_hi_f32(input_packed[6], input_packed[14]), extract_hi_lo_f32(input_packed[6], input_packed[15]), extract_lo_hi_f32(input_packed[7], input_packed[15]), extract_hi_lo_f32(input_packed[7], input_packed[16]), extract_lo_hi_f32(input_packed[8], input_packed[16]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_lo_f32(out[14], out[15]), extract_lo_hi_f32(out[16], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), extract_hi_hi_f32(out[15], out[16]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [float32x4_t; 17]) -> [float32x4_t; 17] { let [x1p16, x1m16] = parallel_fft2_interleaved_f32(values[1], values[16]); let [x2p15, x2m15] = parallel_fft2_interleaved_f32(values[2], values[15]); let [x3p14, x3m14] = parallel_fft2_interleaved_f32(values[3], values[14]); let [x4p13, x4m13] = parallel_fft2_interleaved_f32(values[4], values[13]); let [x5p12, x5m12] = parallel_fft2_interleaved_f32(values[5], values[12]); let [x6p11, x6m11] = parallel_fft2_interleaved_f32(values[6], values[11]); let [x7p10, x7m10] = parallel_fft2_interleaved_f32(values[7], values[10]); let [x8p9, x8m9] = parallel_fft2_interleaved_f32(values[8], values[9]); let t_a1_1 = vmulq_f32(self.twiddle1re, x1p16); let t_a1_2 = vmulq_f32(self.twiddle2re, x2p15); let t_a1_3 = vmulq_f32(self.twiddle3re, x3p14); let t_a1_4 = vmulq_f32(self.twiddle4re, x4p13); let t_a1_5 = vmulq_f32(self.twiddle5re, x5p12); let t_a1_6 = vmulq_f32(self.twiddle6re, x6p11); let t_a1_7 = vmulq_f32(self.twiddle7re, x7p10); let t_a1_8 = vmulq_f32(self.twiddle8re, x8p9); let t_a2_1 = vmulq_f32(self.twiddle2re, x1p16); let t_a2_2 = vmulq_f32(self.twiddle4re, x2p15); let t_a2_3 = vmulq_f32(self.twiddle6re, x3p14); let t_a2_4 = vmulq_f32(self.twiddle8re, x4p13); let t_a2_5 = vmulq_f32(self.twiddle7re, x5p12); let t_a2_6 = vmulq_f32(self.twiddle5re, x6p11); let t_a2_7 = vmulq_f32(self.twiddle3re, x7p10); let t_a2_8 = vmulq_f32(self.twiddle1re, x8p9); let t_a3_1 = vmulq_f32(self.twiddle3re, x1p16); let t_a3_2 = vmulq_f32(self.twiddle6re, x2p15); let t_a3_3 = vmulq_f32(self.twiddle8re, x3p14); let t_a3_4 = vmulq_f32(self.twiddle5re, x4p13); let t_a3_5 = vmulq_f32(self.twiddle2re, x5p12); let t_a3_6 = vmulq_f32(self.twiddle1re, x6p11); let t_a3_7 = vmulq_f32(self.twiddle4re, x7p10); let t_a3_8 = vmulq_f32(self.twiddle7re, x8p9); let t_a4_1 = vmulq_f32(self.twiddle4re, x1p16); let t_a4_2 = vmulq_f32(self.twiddle8re, x2p15); let t_a4_3 = vmulq_f32(self.twiddle5re, x3p14); let t_a4_4 = vmulq_f32(self.twiddle1re, x4p13); let t_a4_5 = vmulq_f32(self.twiddle3re, x5p12); let t_a4_6 = vmulq_f32(self.twiddle7re, x6p11); let t_a4_7 = vmulq_f32(self.twiddle6re, x7p10); let t_a4_8 = vmulq_f32(self.twiddle2re, x8p9); let t_a5_1 = vmulq_f32(self.twiddle5re, x1p16); let t_a5_2 = vmulq_f32(self.twiddle7re, x2p15); let t_a5_3 = vmulq_f32(self.twiddle2re, x3p14); let t_a5_4 = vmulq_f32(self.twiddle3re, x4p13); let t_a5_5 = vmulq_f32(self.twiddle8re, x5p12); let t_a5_6 = vmulq_f32(self.twiddle4re, x6p11); let t_a5_7 = vmulq_f32(self.twiddle1re, x7p10); let t_a5_8 = vmulq_f32(self.twiddle6re, x8p9); let t_a6_1 = vmulq_f32(self.twiddle6re, x1p16); let t_a6_2 = vmulq_f32(self.twiddle5re, x2p15); let t_a6_3 = vmulq_f32(self.twiddle1re, x3p14); let t_a6_4 = vmulq_f32(self.twiddle7re, x4p13); let t_a6_5 = vmulq_f32(self.twiddle4re, x5p12); let t_a6_6 = vmulq_f32(self.twiddle2re, x6p11); let t_a6_7 = vmulq_f32(self.twiddle8re, x7p10); let t_a6_8 = vmulq_f32(self.twiddle3re, x8p9); let t_a7_1 = vmulq_f32(self.twiddle7re, x1p16); let t_a7_2 = vmulq_f32(self.twiddle3re, x2p15); let t_a7_3 = vmulq_f32(self.twiddle4re, x3p14); let t_a7_4 = vmulq_f32(self.twiddle6re, x4p13); let t_a7_5 = vmulq_f32(self.twiddle1re, x5p12); let t_a7_6 = vmulq_f32(self.twiddle8re, x6p11); let t_a7_7 = vmulq_f32(self.twiddle2re, x7p10); let t_a7_8 = vmulq_f32(self.twiddle5re, x8p9); let t_a8_1 = vmulq_f32(self.twiddle8re, x1p16); let t_a8_2 = vmulq_f32(self.twiddle1re, x2p15); let t_a8_3 = vmulq_f32(self.twiddle7re, x3p14); let t_a8_4 = vmulq_f32(self.twiddle2re, x4p13); let t_a8_5 = vmulq_f32(self.twiddle6re, x5p12); let t_a8_6 = vmulq_f32(self.twiddle3re, x6p11); let t_a8_7 = vmulq_f32(self.twiddle5re, x7p10); let t_a8_8 = vmulq_f32(self.twiddle4re, x8p9); let t_b1_1 = vmulq_f32(self.twiddle1im, x1m16); let t_b1_2 = vmulq_f32(self.twiddle2im, x2m15); let t_b1_3 = vmulq_f32(self.twiddle3im, x3m14); let t_b1_4 = vmulq_f32(self.twiddle4im, x4m13); let t_b1_5 = vmulq_f32(self.twiddle5im, x5m12); let t_b1_6 = vmulq_f32(self.twiddle6im, x6m11); let t_b1_7 = vmulq_f32(self.twiddle7im, x7m10); let t_b1_8 = vmulq_f32(self.twiddle8im, x8m9); let t_b2_1 = vmulq_f32(self.twiddle2im, x1m16); let t_b2_2 = vmulq_f32(self.twiddle4im, x2m15); let t_b2_3 = vmulq_f32(self.twiddle6im, x3m14); let t_b2_4 = vmulq_f32(self.twiddle8im, x4m13); let t_b2_5 = vmulq_f32(self.twiddle7im, x5m12); let t_b2_6 = vmulq_f32(self.twiddle5im, x6m11); let t_b2_7 = vmulq_f32(self.twiddle3im, x7m10); let t_b2_8 = vmulq_f32(self.twiddle1im, x8m9); let t_b3_1 = vmulq_f32(self.twiddle3im, x1m16); let t_b3_2 = vmulq_f32(self.twiddle6im, x2m15); let t_b3_3 = vmulq_f32(self.twiddle8im, x3m14); let t_b3_4 = vmulq_f32(self.twiddle5im, x4m13); let t_b3_5 = vmulq_f32(self.twiddle2im, x5m12); let t_b3_6 = vmulq_f32(self.twiddle1im, x6m11); let t_b3_7 = vmulq_f32(self.twiddle4im, x7m10); let t_b3_8 = vmulq_f32(self.twiddle7im, x8m9); let t_b4_1 = vmulq_f32(self.twiddle4im, x1m16); let t_b4_2 = vmulq_f32(self.twiddle8im, x2m15); let t_b4_3 = vmulq_f32(self.twiddle5im, x3m14); let t_b4_4 = vmulq_f32(self.twiddle1im, x4m13); let t_b4_5 = vmulq_f32(self.twiddle3im, x5m12); let t_b4_6 = vmulq_f32(self.twiddle7im, x6m11); let t_b4_7 = vmulq_f32(self.twiddle6im, x7m10); let t_b4_8 = vmulq_f32(self.twiddle2im, x8m9); let t_b5_1 = vmulq_f32(self.twiddle5im, x1m16); let t_b5_2 = vmulq_f32(self.twiddle7im, x2m15); let t_b5_3 = vmulq_f32(self.twiddle2im, x3m14); let t_b5_4 = vmulq_f32(self.twiddle3im, x4m13); let t_b5_5 = vmulq_f32(self.twiddle8im, x5m12); let t_b5_6 = vmulq_f32(self.twiddle4im, x6m11); let t_b5_7 = vmulq_f32(self.twiddle1im, x7m10); let t_b5_8 = vmulq_f32(self.twiddle6im, x8m9); let t_b6_1 = vmulq_f32(self.twiddle6im, x1m16); let t_b6_2 = vmulq_f32(self.twiddle5im, x2m15); let t_b6_3 = vmulq_f32(self.twiddle1im, x3m14); let t_b6_4 = vmulq_f32(self.twiddle7im, x4m13); let t_b6_5 = vmulq_f32(self.twiddle4im, x5m12); let t_b6_6 = vmulq_f32(self.twiddle2im, x6m11); let t_b6_7 = vmulq_f32(self.twiddle8im, x7m10); let t_b6_8 = vmulq_f32(self.twiddle3im, x8m9); let t_b7_1 = vmulq_f32(self.twiddle7im, x1m16); let t_b7_2 = vmulq_f32(self.twiddle3im, x2m15); let t_b7_3 = vmulq_f32(self.twiddle4im, x3m14); let t_b7_4 = vmulq_f32(self.twiddle6im, x4m13); let t_b7_5 = vmulq_f32(self.twiddle1im, x5m12); let t_b7_6 = vmulq_f32(self.twiddle8im, x6m11); let t_b7_7 = vmulq_f32(self.twiddle2im, x7m10); let t_b7_8 = vmulq_f32(self.twiddle5im, x8m9); let t_b8_1 = vmulq_f32(self.twiddle8im, x1m16); let t_b8_2 = vmulq_f32(self.twiddle1im, x2m15); let t_b8_3 = vmulq_f32(self.twiddle7im, x3m14); let t_b8_4 = vmulq_f32(self.twiddle2im, x4m13); let t_b8_5 = vmulq_f32(self.twiddle6im, x5m12); let t_b8_6 = vmulq_f32(self.twiddle3im, x6m11); let t_b8_7 = vmulq_f32(self.twiddle5im, x7m10); let t_b8_8 = vmulq_f32(self.twiddle4im, x8m9); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8); let t_a6 = calc_f32!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8); let t_a7 = calc_f32!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8); let t_a8 = calc_f32!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 - t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8); let t_b3 = calc_f32!(t_b3_1 + t_b3_2 - t_b3_3 - t_b3_4 - t_b3_5 + t_b3_6 + t_b3_7 + t_b3_8); let t_b4 = calc_f32!(t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 + t_b4_5 + t_b4_6 - t_b4_7 - t_b4_8); let t_b5 = calc_f32!(t_b5_1 - t_b5_2 - t_b5_3 + t_b5_4 + t_b5_5 - t_b5_6 + t_b5_7 + t_b5_8); let t_b6 = calc_f32!(t_b6_1 - t_b6_2 + t_b6_3 + t_b6_4 - t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8); let t_b7 = calc_f32!(t_b7_1 - t_b7_2 + t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 + t_b7_8); let t_b8 = calc_f32!(t_b8_1 - t_b8_2 + t_b8_3 - t_b8_4 + t_b8_5 - t_b8_6 + t_b8_7 - t_b8_8); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let t_b7_rot = self.rotate.rotate_both(t_b7); let t_b8_rot = self.rotate.rotate_both(t_b8); let y0 = calc_f32!(x0 + x1p16 + x2p15 + x3p14 + x4p13 + x5p12 + x6p11 + x7p10 + x8p9); let [y1, y16] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y15] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y14] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y13] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y12] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y11] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); let [y7, y10] = parallel_fft2_interleaved_f32(t_a7, t_b7_rot); let [y8, y9] = parallel_fft2_interleaved_f32(t_a8, t_b8_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16] } } // _ _____ __ _ _ _ _ _ // / |___ | / /_ | || | | |__ (_) |_ // | | / / _____ | '_ \| || |_| '_ \| | __| // | | / / |_____| | (_) |__ _| |_) | | |_ // |_|/_/ \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly17 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: float64x2_t, twiddle1im: float64x2_t, twiddle2re: float64x2_t, twiddle2im: float64x2_t, twiddle3re: float64x2_t, twiddle3im: float64x2_t, twiddle4re: float64x2_t, twiddle4im: float64x2_t, twiddle5re: float64x2_t, twiddle5im: float64x2_t, twiddle6re: float64x2_t, twiddle6im: float64x2_t, twiddle7re: float64x2_t, twiddle7im: float64x2_t, twiddle8re: float64x2_t, twiddle8im: float64x2_t, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly17, 17, |this: &NeonF64Butterfly17<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly17, 17, |this: &NeonF64Butterfly17<_>| this .direction); impl NeonF64Butterfly17 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 17, direction); let tw2: Complex = twiddles::compute_twiddle(2, 17, direction); let tw3: Complex = twiddles::compute_twiddle(3, 17, direction); let tw4: Complex = twiddles::compute_twiddle(4, 17, direction); let tw5: Complex = twiddles::compute_twiddle(5, 17, direction); let tw6: Complex = twiddles::compute_twiddle(6, 17, direction); let tw7: Complex = twiddles::compute_twiddle(7, 17, direction); let tw8: Complex = twiddles::compute_twiddle(8, 17, direction); let twiddle1re = unsafe { vmovq_n_f64(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f64(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f64(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f64(tw2.im) }; let twiddle3re = unsafe { vmovq_n_f64(tw3.re) }; let twiddle3im = unsafe { vmovq_n_f64(tw3.im) }; let twiddle4re = unsafe { vmovq_n_f64(tw4.re) }; let twiddle4im = unsafe { vmovq_n_f64(tw4.im) }; let twiddle5re = unsafe { vmovq_n_f64(tw5.re) }; let twiddle5im = unsafe { vmovq_n_f64(tw5.im) }; let twiddle6re = unsafe { vmovq_n_f64(tw6.re) }; let twiddle6im = unsafe { vmovq_n_f64(tw6.im) }; let twiddle7re = unsafe { vmovq_n_f64(tw7.re) }; let twiddle7im = unsafe { vmovq_n_f64(tw7.im) }; let twiddle8re = unsafe { vmovq_n_f64(tw8.re) }; let twiddle8im = unsafe { vmovq_n_f64(tw8.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [float64x2_t; 17]) -> [float64x2_t; 17] { let [x1p16, x1m16] = solo_fft2_f64(values[1], values[16]); let [x2p15, x2m15] = solo_fft2_f64(values[2], values[15]); let [x3p14, x3m14] = solo_fft2_f64(values[3], values[14]); let [x4p13, x4m13] = solo_fft2_f64(values[4], values[13]); let [x5p12, x5m12] = solo_fft2_f64(values[5], values[12]); let [x6p11, x6m11] = solo_fft2_f64(values[6], values[11]); let [x7p10, x7m10] = solo_fft2_f64(values[7], values[10]); let [x8p9, x8m9] = solo_fft2_f64(values[8], values[9]); let t_a1_1 = vmulq_f64(self.twiddle1re, x1p16); let t_a1_2 = vmulq_f64(self.twiddle2re, x2p15); let t_a1_3 = vmulq_f64(self.twiddle3re, x3p14); let t_a1_4 = vmulq_f64(self.twiddle4re, x4p13); let t_a1_5 = vmulq_f64(self.twiddle5re, x5p12); let t_a1_6 = vmulq_f64(self.twiddle6re, x6p11); let t_a1_7 = vmulq_f64(self.twiddle7re, x7p10); let t_a1_8 = vmulq_f64(self.twiddle8re, x8p9); let t_a2_1 = vmulq_f64(self.twiddle2re, x1p16); let t_a2_2 = vmulq_f64(self.twiddle4re, x2p15); let t_a2_3 = vmulq_f64(self.twiddle6re, x3p14); let t_a2_4 = vmulq_f64(self.twiddle8re, x4p13); let t_a2_5 = vmulq_f64(self.twiddle7re, x5p12); let t_a2_6 = vmulq_f64(self.twiddle5re, x6p11); let t_a2_7 = vmulq_f64(self.twiddle3re, x7p10); let t_a2_8 = vmulq_f64(self.twiddle1re, x8p9); let t_a3_1 = vmulq_f64(self.twiddle3re, x1p16); let t_a3_2 = vmulq_f64(self.twiddle6re, x2p15); let t_a3_3 = vmulq_f64(self.twiddle8re, x3p14); let t_a3_4 = vmulq_f64(self.twiddle5re, x4p13); let t_a3_5 = vmulq_f64(self.twiddle2re, x5p12); let t_a3_6 = vmulq_f64(self.twiddle1re, x6p11); let t_a3_7 = vmulq_f64(self.twiddle4re, x7p10); let t_a3_8 = vmulq_f64(self.twiddle7re, x8p9); let t_a4_1 = vmulq_f64(self.twiddle4re, x1p16); let t_a4_2 = vmulq_f64(self.twiddle8re, x2p15); let t_a4_3 = vmulq_f64(self.twiddle5re, x3p14); let t_a4_4 = vmulq_f64(self.twiddle1re, x4p13); let t_a4_5 = vmulq_f64(self.twiddle3re, x5p12); let t_a4_6 = vmulq_f64(self.twiddle7re, x6p11); let t_a4_7 = vmulq_f64(self.twiddle6re, x7p10); let t_a4_8 = vmulq_f64(self.twiddle2re, x8p9); let t_a5_1 = vmulq_f64(self.twiddle5re, x1p16); let t_a5_2 = vmulq_f64(self.twiddle7re, x2p15); let t_a5_3 = vmulq_f64(self.twiddle2re, x3p14); let t_a5_4 = vmulq_f64(self.twiddle3re, x4p13); let t_a5_5 = vmulq_f64(self.twiddle8re, x5p12); let t_a5_6 = vmulq_f64(self.twiddle4re, x6p11); let t_a5_7 = vmulq_f64(self.twiddle1re, x7p10); let t_a5_8 = vmulq_f64(self.twiddle6re, x8p9); let t_a6_1 = vmulq_f64(self.twiddle6re, x1p16); let t_a6_2 = vmulq_f64(self.twiddle5re, x2p15); let t_a6_3 = vmulq_f64(self.twiddle1re, x3p14); let t_a6_4 = vmulq_f64(self.twiddle7re, x4p13); let t_a6_5 = vmulq_f64(self.twiddle4re, x5p12); let t_a6_6 = vmulq_f64(self.twiddle2re, x6p11); let t_a6_7 = vmulq_f64(self.twiddle8re, x7p10); let t_a6_8 = vmulq_f64(self.twiddle3re, x8p9); let t_a7_1 = vmulq_f64(self.twiddle7re, x1p16); let t_a7_2 = vmulq_f64(self.twiddle3re, x2p15); let t_a7_3 = vmulq_f64(self.twiddle4re, x3p14); let t_a7_4 = vmulq_f64(self.twiddle6re, x4p13); let t_a7_5 = vmulq_f64(self.twiddle1re, x5p12); let t_a7_6 = vmulq_f64(self.twiddle8re, x6p11); let t_a7_7 = vmulq_f64(self.twiddle2re, x7p10); let t_a7_8 = vmulq_f64(self.twiddle5re, x8p9); let t_a8_1 = vmulq_f64(self.twiddle8re, x1p16); let t_a8_2 = vmulq_f64(self.twiddle1re, x2p15); let t_a8_3 = vmulq_f64(self.twiddle7re, x3p14); let t_a8_4 = vmulq_f64(self.twiddle2re, x4p13); let t_a8_5 = vmulq_f64(self.twiddle6re, x5p12); let t_a8_6 = vmulq_f64(self.twiddle3re, x6p11); let t_a8_7 = vmulq_f64(self.twiddle5re, x7p10); let t_a8_8 = vmulq_f64(self.twiddle4re, x8p9); let t_b1_1 = vmulq_f64(self.twiddle1im, x1m16); let t_b1_2 = vmulq_f64(self.twiddle2im, x2m15); let t_b1_3 = vmulq_f64(self.twiddle3im, x3m14); let t_b1_4 = vmulq_f64(self.twiddle4im, x4m13); let t_b1_5 = vmulq_f64(self.twiddle5im, x5m12); let t_b1_6 = vmulq_f64(self.twiddle6im, x6m11); let t_b1_7 = vmulq_f64(self.twiddle7im, x7m10); let t_b1_8 = vmulq_f64(self.twiddle8im, x8m9); let t_b2_1 = vmulq_f64(self.twiddle2im, x1m16); let t_b2_2 = vmulq_f64(self.twiddle4im, x2m15); let t_b2_3 = vmulq_f64(self.twiddle6im, x3m14); let t_b2_4 = vmulq_f64(self.twiddle8im, x4m13); let t_b2_5 = vmulq_f64(self.twiddle7im, x5m12); let t_b2_6 = vmulq_f64(self.twiddle5im, x6m11); let t_b2_7 = vmulq_f64(self.twiddle3im, x7m10); let t_b2_8 = vmulq_f64(self.twiddle1im, x8m9); let t_b3_1 = vmulq_f64(self.twiddle3im, x1m16); let t_b3_2 = vmulq_f64(self.twiddle6im, x2m15); let t_b3_3 = vmulq_f64(self.twiddle8im, x3m14); let t_b3_4 = vmulq_f64(self.twiddle5im, x4m13); let t_b3_5 = vmulq_f64(self.twiddle2im, x5m12); let t_b3_6 = vmulq_f64(self.twiddle1im, x6m11); let t_b3_7 = vmulq_f64(self.twiddle4im, x7m10); let t_b3_8 = vmulq_f64(self.twiddle7im, x8m9); let t_b4_1 = vmulq_f64(self.twiddle4im, x1m16); let t_b4_2 = vmulq_f64(self.twiddle8im, x2m15); let t_b4_3 = vmulq_f64(self.twiddle5im, x3m14); let t_b4_4 = vmulq_f64(self.twiddle1im, x4m13); let t_b4_5 = vmulq_f64(self.twiddle3im, x5m12); let t_b4_6 = vmulq_f64(self.twiddle7im, x6m11); let t_b4_7 = vmulq_f64(self.twiddle6im, x7m10); let t_b4_8 = vmulq_f64(self.twiddle2im, x8m9); let t_b5_1 = vmulq_f64(self.twiddle5im, x1m16); let t_b5_2 = vmulq_f64(self.twiddle7im, x2m15); let t_b5_3 = vmulq_f64(self.twiddle2im, x3m14); let t_b5_4 = vmulq_f64(self.twiddle3im, x4m13); let t_b5_5 = vmulq_f64(self.twiddle8im, x5m12); let t_b5_6 = vmulq_f64(self.twiddle4im, x6m11); let t_b5_7 = vmulq_f64(self.twiddle1im, x7m10); let t_b5_8 = vmulq_f64(self.twiddle6im, x8m9); let t_b6_1 = vmulq_f64(self.twiddle6im, x1m16); let t_b6_2 = vmulq_f64(self.twiddle5im, x2m15); let t_b6_3 = vmulq_f64(self.twiddle1im, x3m14); let t_b6_4 = vmulq_f64(self.twiddle7im, x4m13); let t_b6_5 = vmulq_f64(self.twiddle4im, x5m12); let t_b6_6 = vmulq_f64(self.twiddle2im, x6m11); let t_b6_7 = vmulq_f64(self.twiddle8im, x7m10); let t_b6_8 = vmulq_f64(self.twiddle3im, x8m9); let t_b7_1 = vmulq_f64(self.twiddle7im, x1m16); let t_b7_2 = vmulq_f64(self.twiddle3im, x2m15); let t_b7_3 = vmulq_f64(self.twiddle4im, x3m14); let t_b7_4 = vmulq_f64(self.twiddle6im, x4m13); let t_b7_5 = vmulq_f64(self.twiddle1im, x5m12); let t_b7_6 = vmulq_f64(self.twiddle8im, x6m11); let t_b7_7 = vmulq_f64(self.twiddle2im, x7m10); let t_b7_8 = vmulq_f64(self.twiddle5im, x8m9); let t_b8_1 = vmulq_f64(self.twiddle8im, x1m16); let t_b8_2 = vmulq_f64(self.twiddle1im, x2m15); let t_b8_3 = vmulq_f64(self.twiddle7im, x3m14); let t_b8_4 = vmulq_f64(self.twiddle2im, x4m13); let t_b8_5 = vmulq_f64(self.twiddle6im, x5m12); let t_b8_6 = vmulq_f64(self.twiddle3im, x6m11); let t_b8_7 = vmulq_f64(self.twiddle5im, x7m10); let t_b8_8 = vmulq_f64(self.twiddle4im, x8m9); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8); let t_a6 = calc_f64!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8); let t_a7 = calc_f64!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8); let t_a8 = calc_f64!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 - t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8); let t_b3 = calc_f64!(t_b3_1 + t_b3_2 - t_b3_3 - t_b3_4 - t_b3_5 + t_b3_6 + t_b3_7 + t_b3_8); let t_b4 = calc_f64!(t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 + t_b4_5 + t_b4_6 - t_b4_7 - t_b4_8); let t_b5 = calc_f64!(t_b5_1 - t_b5_2 - t_b5_3 + t_b5_4 + t_b5_5 - t_b5_6 + t_b5_7 + t_b5_8); let t_b6 = calc_f64!(t_b6_1 - t_b6_2 + t_b6_3 + t_b6_4 - t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8); let t_b7 = calc_f64!(t_b7_1 - t_b7_2 + t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 + t_b7_8); let t_b8 = calc_f64!(t_b8_1 - t_b8_2 + t_b8_3 - t_b8_4 + t_b8_5 - t_b8_6 + t_b8_7 - t_b8_8); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let t_b7_rot = self.rotate.rotate(t_b7); let t_b8_rot = self.rotate.rotate(t_b8); let y0 = calc_f64!(x0 + x1p16 + x2p15 + x3p14 + x4p13 + x5p12 + x6p11 + x7p10 + x8p9); let [y1, y16] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y15] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y14] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y13] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y12] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y11] = solo_fft2_f64(t_a6, t_b6_rot); let [y7, y10] = solo_fft2_f64(t_a7, t_b7_rot); let [y8, y9] = solo_fft2_f64(t_a8, t_b8_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16] } } // _ ___ _________ _ _ _ // / |/ _ \ |___ /___ \| |__ (_) |_ // | | (_) | _____ |_ \ __) | '_ \| | __| // | |\__, | |_____| ___) / __/| |_) | | |_ // |_| /_/ |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly19 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: float32x4_t, twiddle1im: float32x4_t, twiddle2re: float32x4_t, twiddle2im: float32x4_t, twiddle3re: float32x4_t, twiddle3im: float32x4_t, twiddle4re: float32x4_t, twiddle4im: float32x4_t, twiddle5re: float32x4_t, twiddle5im: float32x4_t, twiddle6re: float32x4_t, twiddle6im: float32x4_t, twiddle7re: float32x4_t, twiddle7im: float32x4_t, twiddle8re: float32x4_t, twiddle8im: float32x4_t, twiddle9re: float32x4_t, twiddle9im: float32x4_t, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly19, 19, |this: &NeonF32Butterfly19<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly19, 19, |this: &NeonF32Butterfly19<_>| this .direction); impl NeonF32Butterfly19 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 19, direction); let tw2: Complex = twiddles::compute_twiddle(2, 19, direction); let tw3: Complex = twiddles::compute_twiddle(3, 19, direction); let tw4: Complex = twiddles::compute_twiddle(4, 19, direction); let tw5: Complex = twiddles::compute_twiddle(5, 19, direction); let tw6: Complex = twiddles::compute_twiddle(6, 19, direction); let tw7: Complex = twiddles::compute_twiddle(7, 19, direction); let tw8: Complex = twiddles::compute_twiddle(8, 19, direction); let tw9: Complex = twiddles::compute_twiddle(9, 19, direction); let twiddle1re = unsafe { vmovq_n_f32(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f32(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f32(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f32(tw2.im) }; let twiddle3re = unsafe { vmovq_n_f32(tw3.re) }; let twiddle3im = unsafe { vmovq_n_f32(tw3.im) }; let twiddle4re = unsafe { vmovq_n_f32(tw4.re) }; let twiddle4im = unsafe { vmovq_n_f32(tw4.im) }; let twiddle5re = unsafe { vmovq_n_f32(tw5.re) }; let twiddle5im = unsafe { vmovq_n_f32(tw5.im) }; let twiddle6re = unsafe { vmovq_n_f32(tw6.re) }; let twiddle6im = unsafe { vmovq_n_f32(tw6.im) }; let twiddle7re = unsafe { vmovq_n_f32(tw7.re) }; let twiddle7im = unsafe { vmovq_n_f32(tw7.im) }; let twiddle8re = unsafe { vmovq_n_f32(tw8.re) }; let twiddle8im = unsafe { vmovq_n_f32(tw8.im) }; let twiddle9re = unsafe { vmovq_n_f32(tw9.re) }; let twiddle9im = unsafe { vmovq_n_f32(tw9.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[9]), extract_hi_lo_f32(input_packed[0], input_packed[10]), extract_lo_hi_f32(input_packed[1], input_packed[10]), extract_hi_lo_f32(input_packed[1], input_packed[11]), extract_lo_hi_f32(input_packed[2], input_packed[11]), extract_hi_lo_f32(input_packed[2], input_packed[12]), extract_lo_hi_f32(input_packed[3], input_packed[12]), extract_hi_lo_f32(input_packed[3], input_packed[13]), extract_lo_hi_f32(input_packed[4], input_packed[13]), extract_hi_lo_f32(input_packed[4], input_packed[14]), extract_lo_hi_f32(input_packed[5], input_packed[14]), extract_hi_lo_f32(input_packed[5], input_packed[15]), extract_lo_hi_f32(input_packed[6], input_packed[15]), extract_hi_lo_f32(input_packed[6], input_packed[16]), extract_lo_hi_f32(input_packed[7], input_packed[16]), extract_hi_lo_f32(input_packed[7], input_packed[17]), extract_lo_hi_f32(input_packed[8], input_packed[17]), extract_hi_lo_f32(input_packed[8], input_packed[18]), extract_lo_hi_f32(input_packed[9], input_packed[18]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_lo_f32(out[14], out[15]), extract_lo_lo_f32(out[16], out[17]), extract_lo_hi_f32(out[18], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), extract_hi_hi_f32(out[15], out[16]), extract_hi_hi_f32(out[17], out[18]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [float32x4_t; 19]) -> [float32x4_t; 19] { let [x1p18, x1m18] = parallel_fft2_interleaved_f32(values[1], values[18]); let [x2p17, x2m17] = parallel_fft2_interleaved_f32(values[2], values[17]); let [x3p16, x3m16] = parallel_fft2_interleaved_f32(values[3], values[16]); let [x4p15, x4m15] = parallel_fft2_interleaved_f32(values[4], values[15]); let [x5p14, x5m14] = parallel_fft2_interleaved_f32(values[5], values[14]); let [x6p13, x6m13] = parallel_fft2_interleaved_f32(values[6], values[13]); let [x7p12, x7m12] = parallel_fft2_interleaved_f32(values[7], values[12]); let [x8p11, x8m11] = parallel_fft2_interleaved_f32(values[8], values[11]); let [x9p10, x9m10] = parallel_fft2_interleaved_f32(values[9], values[10]); let t_a1_1 = vmulq_f32(self.twiddle1re, x1p18); let t_a1_2 = vmulq_f32(self.twiddle2re, x2p17); let t_a1_3 = vmulq_f32(self.twiddle3re, x3p16); let t_a1_4 = vmulq_f32(self.twiddle4re, x4p15); let t_a1_5 = vmulq_f32(self.twiddle5re, x5p14); let t_a1_6 = vmulq_f32(self.twiddle6re, x6p13); let t_a1_7 = vmulq_f32(self.twiddle7re, x7p12); let t_a1_8 = vmulq_f32(self.twiddle8re, x8p11); let t_a1_9 = vmulq_f32(self.twiddle9re, x9p10); let t_a2_1 = vmulq_f32(self.twiddle2re, x1p18); let t_a2_2 = vmulq_f32(self.twiddle4re, x2p17); let t_a2_3 = vmulq_f32(self.twiddle6re, x3p16); let t_a2_4 = vmulq_f32(self.twiddle8re, x4p15); let t_a2_5 = vmulq_f32(self.twiddle9re, x5p14); let t_a2_6 = vmulq_f32(self.twiddle7re, x6p13); let t_a2_7 = vmulq_f32(self.twiddle5re, x7p12); let t_a2_8 = vmulq_f32(self.twiddle3re, x8p11); let t_a2_9 = vmulq_f32(self.twiddle1re, x9p10); let t_a3_1 = vmulq_f32(self.twiddle3re, x1p18); let t_a3_2 = vmulq_f32(self.twiddle6re, x2p17); let t_a3_3 = vmulq_f32(self.twiddle9re, x3p16); let t_a3_4 = vmulq_f32(self.twiddle7re, x4p15); let t_a3_5 = vmulq_f32(self.twiddle4re, x5p14); let t_a3_6 = vmulq_f32(self.twiddle1re, x6p13); let t_a3_7 = vmulq_f32(self.twiddle2re, x7p12); let t_a3_8 = vmulq_f32(self.twiddle5re, x8p11); let t_a3_9 = vmulq_f32(self.twiddle8re, x9p10); let t_a4_1 = vmulq_f32(self.twiddle4re, x1p18); let t_a4_2 = vmulq_f32(self.twiddle8re, x2p17); let t_a4_3 = vmulq_f32(self.twiddle7re, x3p16); let t_a4_4 = vmulq_f32(self.twiddle3re, x4p15); let t_a4_5 = vmulq_f32(self.twiddle1re, x5p14); let t_a4_6 = vmulq_f32(self.twiddle5re, x6p13); let t_a4_7 = vmulq_f32(self.twiddle9re, x7p12); let t_a4_8 = vmulq_f32(self.twiddle6re, x8p11); let t_a4_9 = vmulq_f32(self.twiddle2re, x9p10); let t_a5_1 = vmulq_f32(self.twiddle5re, x1p18); let t_a5_2 = vmulq_f32(self.twiddle9re, x2p17); let t_a5_3 = vmulq_f32(self.twiddle4re, x3p16); let t_a5_4 = vmulq_f32(self.twiddle1re, x4p15); let t_a5_5 = vmulq_f32(self.twiddle6re, x5p14); let t_a5_6 = vmulq_f32(self.twiddle8re, x6p13); let t_a5_7 = vmulq_f32(self.twiddle3re, x7p12); let t_a5_8 = vmulq_f32(self.twiddle2re, x8p11); let t_a5_9 = vmulq_f32(self.twiddle7re, x9p10); let t_a6_1 = vmulq_f32(self.twiddle6re, x1p18); let t_a6_2 = vmulq_f32(self.twiddle7re, x2p17); let t_a6_3 = vmulq_f32(self.twiddle1re, x3p16); let t_a6_4 = vmulq_f32(self.twiddle5re, x4p15); let t_a6_5 = vmulq_f32(self.twiddle8re, x5p14); let t_a6_6 = vmulq_f32(self.twiddle2re, x6p13); let t_a6_7 = vmulq_f32(self.twiddle4re, x7p12); let t_a6_8 = vmulq_f32(self.twiddle9re, x8p11); let t_a6_9 = vmulq_f32(self.twiddle3re, x9p10); let t_a7_1 = vmulq_f32(self.twiddle7re, x1p18); let t_a7_2 = vmulq_f32(self.twiddle5re, x2p17); let t_a7_3 = vmulq_f32(self.twiddle2re, x3p16); let t_a7_4 = vmulq_f32(self.twiddle9re, x4p15); let t_a7_5 = vmulq_f32(self.twiddle3re, x5p14); let t_a7_6 = vmulq_f32(self.twiddle4re, x6p13); let t_a7_7 = vmulq_f32(self.twiddle8re, x7p12); let t_a7_8 = vmulq_f32(self.twiddle1re, x8p11); let t_a7_9 = vmulq_f32(self.twiddle6re, x9p10); let t_a8_1 = vmulq_f32(self.twiddle8re, x1p18); let t_a8_2 = vmulq_f32(self.twiddle3re, x2p17); let t_a8_3 = vmulq_f32(self.twiddle5re, x3p16); let t_a8_4 = vmulq_f32(self.twiddle6re, x4p15); let t_a8_5 = vmulq_f32(self.twiddle2re, x5p14); let t_a8_6 = vmulq_f32(self.twiddle9re, x6p13); let t_a8_7 = vmulq_f32(self.twiddle1re, x7p12); let t_a8_8 = vmulq_f32(self.twiddle7re, x8p11); let t_a8_9 = vmulq_f32(self.twiddle4re, x9p10); let t_a9_1 = vmulq_f32(self.twiddle9re, x1p18); let t_a9_2 = vmulq_f32(self.twiddle1re, x2p17); let t_a9_3 = vmulq_f32(self.twiddle8re, x3p16); let t_a9_4 = vmulq_f32(self.twiddle2re, x4p15); let t_a9_5 = vmulq_f32(self.twiddle7re, x5p14); let t_a9_6 = vmulq_f32(self.twiddle3re, x6p13); let t_a9_7 = vmulq_f32(self.twiddle6re, x7p12); let t_a9_8 = vmulq_f32(self.twiddle4re, x8p11); let t_a9_9 = vmulq_f32(self.twiddle5re, x9p10); let t_b1_1 = vmulq_f32(self.twiddle1im, x1m18); let t_b1_2 = vmulq_f32(self.twiddle2im, x2m17); let t_b1_3 = vmulq_f32(self.twiddle3im, x3m16); let t_b1_4 = vmulq_f32(self.twiddle4im, x4m15); let t_b1_5 = vmulq_f32(self.twiddle5im, x5m14); let t_b1_6 = vmulq_f32(self.twiddle6im, x6m13); let t_b1_7 = vmulq_f32(self.twiddle7im, x7m12); let t_b1_8 = vmulq_f32(self.twiddle8im, x8m11); let t_b1_9 = vmulq_f32(self.twiddle9im, x9m10); let t_b2_1 = vmulq_f32(self.twiddle2im, x1m18); let t_b2_2 = vmulq_f32(self.twiddle4im, x2m17); let t_b2_3 = vmulq_f32(self.twiddle6im, x3m16); let t_b2_4 = vmulq_f32(self.twiddle8im, x4m15); let t_b2_5 = vmulq_f32(self.twiddle9im, x5m14); let t_b2_6 = vmulq_f32(self.twiddle7im, x6m13); let t_b2_7 = vmulq_f32(self.twiddle5im, x7m12); let t_b2_8 = vmulq_f32(self.twiddle3im, x8m11); let t_b2_9 = vmulq_f32(self.twiddle1im, x9m10); let t_b3_1 = vmulq_f32(self.twiddle3im, x1m18); let t_b3_2 = vmulq_f32(self.twiddle6im, x2m17); let t_b3_3 = vmulq_f32(self.twiddle9im, x3m16); let t_b3_4 = vmulq_f32(self.twiddle7im, x4m15); let t_b3_5 = vmulq_f32(self.twiddle4im, x5m14); let t_b3_6 = vmulq_f32(self.twiddle1im, x6m13); let t_b3_7 = vmulq_f32(self.twiddle2im, x7m12); let t_b3_8 = vmulq_f32(self.twiddle5im, x8m11); let t_b3_9 = vmulq_f32(self.twiddle8im, x9m10); let t_b4_1 = vmulq_f32(self.twiddle4im, x1m18); let t_b4_2 = vmulq_f32(self.twiddle8im, x2m17); let t_b4_3 = vmulq_f32(self.twiddle7im, x3m16); let t_b4_4 = vmulq_f32(self.twiddle3im, x4m15); let t_b4_5 = vmulq_f32(self.twiddle1im, x5m14); let t_b4_6 = vmulq_f32(self.twiddle5im, x6m13); let t_b4_7 = vmulq_f32(self.twiddle9im, x7m12); let t_b4_8 = vmulq_f32(self.twiddle6im, x8m11); let t_b4_9 = vmulq_f32(self.twiddle2im, x9m10); let t_b5_1 = vmulq_f32(self.twiddle5im, x1m18); let t_b5_2 = vmulq_f32(self.twiddle9im, x2m17); let t_b5_3 = vmulq_f32(self.twiddle4im, x3m16); let t_b5_4 = vmulq_f32(self.twiddle1im, x4m15); let t_b5_5 = vmulq_f32(self.twiddle6im, x5m14); let t_b5_6 = vmulq_f32(self.twiddle8im, x6m13); let t_b5_7 = vmulq_f32(self.twiddle3im, x7m12); let t_b5_8 = vmulq_f32(self.twiddle2im, x8m11); let t_b5_9 = vmulq_f32(self.twiddle7im, x9m10); let t_b6_1 = vmulq_f32(self.twiddle6im, x1m18); let t_b6_2 = vmulq_f32(self.twiddle7im, x2m17); let t_b6_3 = vmulq_f32(self.twiddle1im, x3m16); let t_b6_4 = vmulq_f32(self.twiddle5im, x4m15); let t_b6_5 = vmulq_f32(self.twiddle8im, x5m14); let t_b6_6 = vmulq_f32(self.twiddle2im, x6m13); let t_b6_7 = vmulq_f32(self.twiddle4im, x7m12); let t_b6_8 = vmulq_f32(self.twiddle9im, x8m11); let t_b6_9 = vmulq_f32(self.twiddle3im, x9m10); let t_b7_1 = vmulq_f32(self.twiddle7im, x1m18); let t_b7_2 = vmulq_f32(self.twiddle5im, x2m17); let t_b7_3 = vmulq_f32(self.twiddle2im, x3m16); let t_b7_4 = vmulq_f32(self.twiddle9im, x4m15); let t_b7_5 = vmulq_f32(self.twiddle3im, x5m14); let t_b7_6 = vmulq_f32(self.twiddle4im, x6m13); let t_b7_7 = vmulq_f32(self.twiddle8im, x7m12); let t_b7_8 = vmulq_f32(self.twiddle1im, x8m11); let t_b7_9 = vmulq_f32(self.twiddle6im, x9m10); let t_b8_1 = vmulq_f32(self.twiddle8im, x1m18); let t_b8_2 = vmulq_f32(self.twiddle3im, x2m17); let t_b8_3 = vmulq_f32(self.twiddle5im, x3m16); let t_b8_4 = vmulq_f32(self.twiddle6im, x4m15); let t_b8_5 = vmulq_f32(self.twiddle2im, x5m14); let t_b8_6 = vmulq_f32(self.twiddle9im, x6m13); let t_b8_7 = vmulq_f32(self.twiddle1im, x7m12); let t_b8_8 = vmulq_f32(self.twiddle7im, x8m11); let t_b8_9 = vmulq_f32(self.twiddle4im, x9m10); let t_b9_1 = vmulq_f32(self.twiddle9im, x1m18); let t_b9_2 = vmulq_f32(self.twiddle1im, x2m17); let t_b9_3 = vmulq_f32(self.twiddle8im, x3m16); let t_b9_4 = vmulq_f32(self.twiddle2im, x4m15); let t_b9_5 = vmulq_f32(self.twiddle7im, x5m14); let t_b9_6 = vmulq_f32(self.twiddle3im, x6m13); let t_b9_7 = vmulq_f32(self.twiddle6im, x7m12); let t_b9_8 = vmulq_f32(self.twiddle4im, x8m11); let t_b9_9 = vmulq_f32(self.twiddle5im, x9m10); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9); let t_a6 = calc_f32!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9); let t_a7 = calc_f32!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9); let t_a8 = calc_f32!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9); let t_a9 = calc_f32!(x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 - t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8 - t_b2_9); let t_b3 = calc_f32!(t_b3_1 + t_b3_2 + t_b3_3 - t_b3_4 - t_b3_5 - t_b3_6 + t_b3_7 + t_b3_8 + t_b3_9); let t_b4 = calc_f32!(t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 + t_b4_5 + t_b4_6 + t_b4_7 - t_b4_8 - t_b4_9); let t_b5 = calc_f32!(t_b5_1 - t_b5_2 - t_b5_3 + t_b5_4 + t_b5_5 - t_b5_6 - t_b5_7 + t_b5_8 + t_b5_9); let t_b6 = calc_f32!(t_b6_1 - t_b6_2 - t_b6_3 + t_b6_4 - t_b6_5 - t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9); let t_b7 = calc_f32!(t_b7_1 - t_b7_2 + t_b7_3 + t_b7_4 - t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9); let t_b8 = calc_f32!(t_b8_1 - t_b8_2 + t_b8_3 - t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 - t_b8_9); let t_b9 = calc_f32!(t_b9_1 - t_b9_2 + t_b9_3 - t_b9_4 + t_b9_5 - t_b9_6 + t_b9_7 - t_b9_8 + t_b9_9); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let t_b7_rot = self.rotate.rotate_both(t_b7); let t_b8_rot = self.rotate.rotate_both(t_b8); let t_b9_rot = self.rotate.rotate_both(t_b9); let y0 = calc_f32!(x0 + x1p18 + x2p17 + x3p16 + x4p15 + x5p14 + x6p13 + x7p12 + x8p11 + x9p10); let [y1, y18] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y17] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y16] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y15] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y14] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y13] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); let [y7, y12] = parallel_fft2_interleaved_f32(t_a7, t_b7_rot); let [y8, y11] = parallel_fft2_interleaved_f32(t_a8, t_b8_rot); let [y9, y10] = parallel_fft2_interleaved_f32(t_a9, t_b9_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18] } } // _ ___ __ _ _ _ _ _ // / |/ _ \ / /_ | || | | |__ (_) |_ // | | (_) | _____ | '_ \| || |_| '_ \| | __| // | |\__, | |_____| | (_) |__ _| |_) | | |_ // |_| /_/ \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly19 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: float64x2_t, twiddle1im: float64x2_t, twiddle2re: float64x2_t, twiddle2im: float64x2_t, twiddle3re: float64x2_t, twiddle3im: float64x2_t, twiddle4re: float64x2_t, twiddle4im: float64x2_t, twiddle5re: float64x2_t, twiddle5im: float64x2_t, twiddle6re: float64x2_t, twiddle6im: float64x2_t, twiddle7re: float64x2_t, twiddle7im: float64x2_t, twiddle8re: float64x2_t, twiddle8im: float64x2_t, twiddle9re: float64x2_t, twiddle9im: float64x2_t, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly19, 19, |this: &NeonF64Butterfly19<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly19, 19, |this: &NeonF64Butterfly19<_>| this .direction); impl NeonF64Butterfly19 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 19, direction); let tw2: Complex = twiddles::compute_twiddle(2, 19, direction); let tw3: Complex = twiddles::compute_twiddle(3, 19, direction); let tw4: Complex = twiddles::compute_twiddle(4, 19, direction); let tw5: Complex = twiddles::compute_twiddle(5, 19, direction); let tw6: Complex = twiddles::compute_twiddle(6, 19, direction); let tw7: Complex = twiddles::compute_twiddle(7, 19, direction); let tw8: Complex = twiddles::compute_twiddle(8, 19, direction); let tw9: Complex = twiddles::compute_twiddle(9, 19, direction); let twiddle1re = unsafe { vmovq_n_f64(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f64(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f64(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f64(tw2.im) }; let twiddle3re = unsafe { vmovq_n_f64(tw3.re) }; let twiddle3im = unsafe { vmovq_n_f64(tw3.im) }; let twiddle4re = unsafe { vmovq_n_f64(tw4.re) }; let twiddle4im = unsafe { vmovq_n_f64(tw4.im) }; let twiddle5re = unsafe { vmovq_n_f64(tw5.re) }; let twiddle5im = unsafe { vmovq_n_f64(tw5.im) }; let twiddle6re = unsafe { vmovq_n_f64(tw6.re) }; let twiddle6im = unsafe { vmovq_n_f64(tw6.im) }; let twiddle7re = unsafe { vmovq_n_f64(tw7.re) }; let twiddle7im = unsafe { vmovq_n_f64(tw7.im) }; let twiddle8re = unsafe { vmovq_n_f64(tw8.re) }; let twiddle8im = unsafe { vmovq_n_f64(tw8.im) }; let twiddle9re = unsafe { vmovq_n_f64(tw9.re) }; let twiddle9im = unsafe { vmovq_n_f64(tw9.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [float64x2_t; 19]) -> [float64x2_t; 19] { let [x1p18, x1m18] = solo_fft2_f64(values[1], values[18]); let [x2p17, x2m17] = solo_fft2_f64(values[2], values[17]); let [x3p16, x3m16] = solo_fft2_f64(values[3], values[16]); let [x4p15, x4m15] = solo_fft2_f64(values[4], values[15]); let [x5p14, x5m14] = solo_fft2_f64(values[5], values[14]); let [x6p13, x6m13] = solo_fft2_f64(values[6], values[13]); let [x7p12, x7m12] = solo_fft2_f64(values[7], values[12]); let [x8p11, x8m11] = solo_fft2_f64(values[8], values[11]); let [x9p10, x9m10] = solo_fft2_f64(values[9], values[10]); let t_a1_1 = vmulq_f64(self.twiddle1re, x1p18); let t_a1_2 = vmulq_f64(self.twiddle2re, x2p17); let t_a1_3 = vmulq_f64(self.twiddle3re, x3p16); let t_a1_4 = vmulq_f64(self.twiddle4re, x4p15); let t_a1_5 = vmulq_f64(self.twiddle5re, x5p14); let t_a1_6 = vmulq_f64(self.twiddle6re, x6p13); let t_a1_7 = vmulq_f64(self.twiddle7re, x7p12); let t_a1_8 = vmulq_f64(self.twiddle8re, x8p11); let t_a1_9 = vmulq_f64(self.twiddle9re, x9p10); let t_a2_1 = vmulq_f64(self.twiddle2re, x1p18); let t_a2_2 = vmulq_f64(self.twiddle4re, x2p17); let t_a2_3 = vmulq_f64(self.twiddle6re, x3p16); let t_a2_4 = vmulq_f64(self.twiddle8re, x4p15); let t_a2_5 = vmulq_f64(self.twiddle9re, x5p14); let t_a2_6 = vmulq_f64(self.twiddle7re, x6p13); let t_a2_7 = vmulq_f64(self.twiddle5re, x7p12); let t_a2_8 = vmulq_f64(self.twiddle3re, x8p11); let t_a2_9 = vmulq_f64(self.twiddle1re, x9p10); let t_a3_1 = vmulq_f64(self.twiddle3re, x1p18); let t_a3_2 = vmulq_f64(self.twiddle6re, x2p17); let t_a3_3 = vmulq_f64(self.twiddle9re, x3p16); let t_a3_4 = vmulq_f64(self.twiddle7re, x4p15); let t_a3_5 = vmulq_f64(self.twiddle4re, x5p14); let t_a3_6 = vmulq_f64(self.twiddle1re, x6p13); let t_a3_7 = vmulq_f64(self.twiddle2re, x7p12); let t_a3_8 = vmulq_f64(self.twiddle5re, x8p11); let t_a3_9 = vmulq_f64(self.twiddle8re, x9p10); let t_a4_1 = vmulq_f64(self.twiddle4re, x1p18); let t_a4_2 = vmulq_f64(self.twiddle8re, x2p17); let t_a4_3 = vmulq_f64(self.twiddle7re, x3p16); let t_a4_4 = vmulq_f64(self.twiddle3re, x4p15); let t_a4_5 = vmulq_f64(self.twiddle1re, x5p14); let t_a4_6 = vmulq_f64(self.twiddle5re, x6p13); let t_a4_7 = vmulq_f64(self.twiddle9re, x7p12); let t_a4_8 = vmulq_f64(self.twiddle6re, x8p11); let t_a4_9 = vmulq_f64(self.twiddle2re, x9p10); let t_a5_1 = vmulq_f64(self.twiddle5re, x1p18); let t_a5_2 = vmulq_f64(self.twiddle9re, x2p17); let t_a5_3 = vmulq_f64(self.twiddle4re, x3p16); let t_a5_4 = vmulq_f64(self.twiddle1re, x4p15); let t_a5_5 = vmulq_f64(self.twiddle6re, x5p14); let t_a5_6 = vmulq_f64(self.twiddle8re, x6p13); let t_a5_7 = vmulq_f64(self.twiddle3re, x7p12); let t_a5_8 = vmulq_f64(self.twiddle2re, x8p11); let t_a5_9 = vmulq_f64(self.twiddle7re, x9p10); let t_a6_1 = vmulq_f64(self.twiddle6re, x1p18); let t_a6_2 = vmulq_f64(self.twiddle7re, x2p17); let t_a6_3 = vmulq_f64(self.twiddle1re, x3p16); let t_a6_4 = vmulq_f64(self.twiddle5re, x4p15); let t_a6_5 = vmulq_f64(self.twiddle8re, x5p14); let t_a6_6 = vmulq_f64(self.twiddle2re, x6p13); let t_a6_7 = vmulq_f64(self.twiddle4re, x7p12); let t_a6_8 = vmulq_f64(self.twiddle9re, x8p11); let t_a6_9 = vmulq_f64(self.twiddle3re, x9p10); let t_a7_1 = vmulq_f64(self.twiddle7re, x1p18); let t_a7_2 = vmulq_f64(self.twiddle5re, x2p17); let t_a7_3 = vmulq_f64(self.twiddle2re, x3p16); let t_a7_4 = vmulq_f64(self.twiddle9re, x4p15); let t_a7_5 = vmulq_f64(self.twiddle3re, x5p14); let t_a7_6 = vmulq_f64(self.twiddle4re, x6p13); let t_a7_7 = vmulq_f64(self.twiddle8re, x7p12); let t_a7_8 = vmulq_f64(self.twiddle1re, x8p11); let t_a7_9 = vmulq_f64(self.twiddle6re, x9p10); let t_a8_1 = vmulq_f64(self.twiddle8re, x1p18); let t_a8_2 = vmulq_f64(self.twiddle3re, x2p17); let t_a8_3 = vmulq_f64(self.twiddle5re, x3p16); let t_a8_4 = vmulq_f64(self.twiddle6re, x4p15); let t_a8_5 = vmulq_f64(self.twiddle2re, x5p14); let t_a8_6 = vmulq_f64(self.twiddle9re, x6p13); let t_a8_7 = vmulq_f64(self.twiddle1re, x7p12); let t_a8_8 = vmulq_f64(self.twiddle7re, x8p11); let t_a8_9 = vmulq_f64(self.twiddle4re, x9p10); let t_a9_1 = vmulq_f64(self.twiddle9re, x1p18); let t_a9_2 = vmulq_f64(self.twiddle1re, x2p17); let t_a9_3 = vmulq_f64(self.twiddle8re, x3p16); let t_a9_4 = vmulq_f64(self.twiddle2re, x4p15); let t_a9_5 = vmulq_f64(self.twiddle7re, x5p14); let t_a9_6 = vmulq_f64(self.twiddle3re, x6p13); let t_a9_7 = vmulq_f64(self.twiddle6re, x7p12); let t_a9_8 = vmulq_f64(self.twiddle4re, x8p11); let t_a9_9 = vmulq_f64(self.twiddle5re, x9p10); let t_b1_1 = vmulq_f64(self.twiddle1im, x1m18); let t_b1_2 = vmulq_f64(self.twiddle2im, x2m17); let t_b1_3 = vmulq_f64(self.twiddle3im, x3m16); let t_b1_4 = vmulq_f64(self.twiddle4im, x4m15); let t_b1_5 = vmulq_f64(self.twiddle5im, x5m14); let t_b1_6 = vmulq_f64(self.twiddle6im, x6m13); let t_b1_7 = vmulq_f64(self.twiddle7im, x7m12); let t_b1_8 = vmulq_f64(self.twiddle8im, x8m11); let t_b1_9 = vmulq_f64(self.twiddle9im, x9m10); let t_b2_1 = vmulq_f64(self.twiddle2im, x1m18); let t_b2_2 = vmulq_f64(self.twiddle4im, x2m17); let t_b2_3 = vmulq_f64(self.twiddle6im, x3m16); let t_b2_4 = vmulq_f64(self.twiddle8im, x4m15); let t_b2_5 = vmulq_f64(self.twiddle9im, x5m14); let t_b2_6 = vmulq_f64(self.twiddle7im, x6m13); let t_b2_7 = vmulq_f64(self.twiddle5im, x7m12); let t_b2_8 = vmulq_f64(self.twiddle3im, x8m11); let t_b2_9 = vmulq_f64(self.twiddle1im, x9m10); let t_b3_1 = vmulq_f64(self.twiddle3im, x1m18); let t_b3_2 = vmulq_f64(self.twiddle6im, x2m17); let t_b3_3 = vmulq_f64(self.twiddle9im, x3m16); let t_b3_4 = vmulq_f64(self.twiddle7im, x4m15); let t_b3_5 = vmulq_f64(self.twiddle4im, x5m14); let t_b3_6 = vmulq_f64(self.twiddle1im, x6m13); let t_b3_7 = vmulq_f64(self.twiddle2im, x7m12); let t_b3_8 = vmulq_f64(self.twiddle5im, x8m11); let t_b3_9 = vmulq_f64(self.twiddle8im, x9m10); let t_b4_1 = vmulq_f64(self.twiddle4im, x1m18); let t_b4_2 = vmulq_f64(self.twiddle8im, x2m17); let t_b4_3 = vmulq_f64(self.twiddle7im, x3m16); let t_b4_4 = vmulq_f64(self.twiddle3im, x4m15); let t_b4_5 = vmulq_f64(self.twiddle1im, x5m14); let t_b4_6 = vmulq_f64(self.twiddle5im, x6m13); let t_b4_7 = vmulq_f64(self.twiddle9im, x7m12); let t_b4_8 = vmulq_f64(self.twiddle6im, x8m11); let t_b4_9 = vmulq_f64(self.twiddle2im, x9m10); let t_b5_1 = vmulq_f64(self.twiddle5im, x1m18); let t_b5_2 = vmulq_f64(self.twiddle9im, x2m17); let t_b5_3 = vmulq_f64(self.twiddle4im, x3m16); let t_b5_4 = vmulq_f64(self.twiddle1im, x4m15); let t_b5_5 = vmulq_f64(self.twiddle6im, x5m14); let t_b5_6 = vmulq_f64(self.twiddle8im, x6m13); let t_b5_7 = vmulq_f64(self.twiddle3im, x7m12); let t_b5_8 = vmulq_f64(self.twiddle2im, x8m11); let t_b5_9 = vmulq_f64(self.twiddle7im, x9m10); let t_b6_1 = vmulq_f64(self.twiddle6im, x1m18); let t_b6_2 = vmulq_f64(self.twiddle7im, x2m17); let t_b6_3 = vmulq_f64(self.twiddle1im, x3m16); let t_b6_4 = vmulq_f64(self.twiddle5im, x4m15); let t_b6_5 = vmulq_f64(self.twiddle8im, x5m14); let t_b6_6 = vmulq_f64(self.twiddle2im, x6m13); let t_b6_7 = vmulq_f64(self.twiddle4im, x7m12); let t_b6_8 = vmulq_f64(self.twiddle9im, x8m11); let t_b6_9 = vmulq_f64(self.twiddle3im, x9m10); let t_b7_1 = vmulq_f64(self.twiddle7im, x1m18); let t_b7_2 = vmulq_f64(self.twiddle5im, x2m17); let t_b7_3 = vmulq_f64(self.twiddle2im, x3m16); let t_b7_4 = vmulq_f64(self.twiddle9im, x4m15); let t_b7_5 = vmulq_f64(self.twiddle3im, x5m14); let t_b7_6 = vmulq_f64(self.twiddle4im, x6m13); let t_b7_7 = vmulq_f64(self.twiddle8im, x7m12); let t_b7_8 = vmulq_f64(self.twiddle1im, x8m11); let t_b7_9 = vmulq_f64(self.twiddle6im, x9m10); let t_b8_1 = vmulq_f64(self.twiddle8im, x1m18); let t_b8_2 = vmulq_f64(self.twiddle3im, x2m17); let t_b8_3 = vmulq_f64(self.twiddle5im, x3m16); let t_b8_4 = vmulq_f64(self.twiddle6im, x4m15); let t_b8_5 = vmulq_f64(self.twiddle2im, x5m14); let t_b8_6 = vmulq_f64(self.twiddle9im, x6m13); let t_b8_7 = vmulq_f64(self.twiddle1im, x7m12); let t_b8_8 = vmulq_f64(self.twiddle7im, x8m11); let t_b8_9 = vmulq_f64(self.twiddle4im, x9m10); let t_b9_1 = vmulq_f64(self.twiddle9im, x1m18); let t_b9_2 = vmulq_f64(self.twiddle1im, x2m17); let t_b9_3 = vmulq_f64(self.twiddle8im, x3m16); let t_b9_4 = vmulq_f64(self.twiddle2im, x4m15); let t_b9_5 = vmulq_f64(self.twiddle7im, x5m14); let t_b9_6 = vmulq_f64(self.twiddle3im, x6m13); let t_b9_7 = vmulq_f64(self.twiddle6im, x7m12); let t_b9_8 = vmulq_f64(self.twiddle4im, x8m11); let t_b9_9 = vmulq_f64(self.twiddle5im, x9m10); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9); let t_a6 = calc_f64!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9); let t_a7 = calc_f64!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9); let t_a8 = calc_f64!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9); let t_a9 = calc_f64!(x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 - t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8 - t_b2_9); let t_b3 = calc_f64!(t_b3_1 + t_b3_2 + t_b3_3 - t_b3_4 - t_b3_5 - t_b3_6 + t_b3_7 + t_b3_8 + t_b3_9); let t_b4 = calc_f64!(t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 + t_b4_5 + t_b4_6 + t_b4_7 - t_b4_8 - t_b4_9); let t_b5 = calc_f64!(t_b5_1 - t_b5_2 - t_b5_3 + t_b5_4 + t_b5_5 - t_b5_6 - t_b5_7 + t_b5_8 + t_b5_9); let t_b6 = calc_f64!(t_b6_1 - t_b6_2 - t_b6_3 + t_b6_4 - t_b6_5 - t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9); let t_b7 = calc_f64!(t_b7_1 - t_b7_2 + t_b7_3 + t_b7_4 - t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9); let t_b8 = calc_f64!(t_b8_1 - t_b8_2 + t_b8_3 - t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 - t_b8_9); let t_b9 = calc_f64!(t_b9_1 - t_b9_2 + t_b9_3 - t_b9_4 + t_b9_5 - t_b9_6 + t_b9_7 - t_b9_8 + t_b9_9); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let t_b7_rot = self.rotate.rotate(t_b7); let t_b8_rot = self.rotate.rotate(t_b8); let t_b9_rot = self.rotate.rotate(t_b9); let y0 = calc_f64!(x0 + x1p18 + x2p17 + x3p16 + x4p15 + x5p14 + x6p13 + x7p12 + x8p11 + x9p10); let [y1, y18] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y17] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y16] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y15] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y14] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y13] = solo_fft2_f64(t_a6, t_b6_rot); let [y7, y12] = solo_fft2_f64(t_a7, t_b7_rot); let [y8, y11] = solo_fft2_f64(t_a8, t_b8_rot); let [y9, y10] = solo_fft2_f64(t_a9, t_b9_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18] } } // ____ _____ _________ _ _ _ // |___ \|___ / |___ /___ \| |__ (_) |_ // __) | |_ \ _____ |_ \ __) | '_ \| | __| // / __/ ___) | |_____| ___) / __/| |_) | | |_ // |_____|____/ |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly23 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: float32x4_t, twiddle1im: float32x4_t, twiddle2re: float32x4_t, twiddle2im: float32x4_t, twiddle3re: float32x4_t, twiddle3im: float32x4_t, twiddle4re: float32x4_t, twiddle4im: float32x4_t, twiddle5re: float32x4_t, twiddle5im: float32x4_t, twiddle6re: float32x4_t, twiddle6im: float32x4_t, twiddle7re: float32x4_t, twiddle7im: float32x4_t, twiddle8re: float32x4_t, twiddle8im: float32x4_t, twiddle9re: float32x4_t, twiddle9im: float32x4_t, twiddle10re: float32x4_t, twiddle10im: float32x4_t, twiddle11re: float32x4_t, twiddle11im: float32x4_t, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly23, 23, |this: &NeonF32Butterfly23<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly23, 23, |this: &NeonF32Butterfly23<_>| this .direction); impl NeonF32Butterfly23 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 23, direction); let tw2: Complex = twiddles::compute_twiddle(2, 23, direction); let tw3: Complex = twiddles::compute_twiddle(3, 23, direction); let tw4: Complex = twiddles::compute_twiddle(4, 23, direction); let tw5: Complex = twiddles::compute_twiddle(5, 23, direction); let tw6: Complex = twiddles::compute_twiddle(6, 23, direction); let tw7: Complex = twiddles::compute_twiddle(7, 23, direction); let tw8: Complex = twiddles::compute_twiddle(8, 23, direction); let tw9: Complex = twiddles::compute_twiddle(9, 23, direction); let tw10: Complex = twiddles::compute_twiddle(10, 23, direction); let tw11: Complex = twiddles::compute_twiddle(11, 23, direction); let twiddle1re = unsafe { vmovq_n_f32(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f32(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f32(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f32(tw2.im) }; let twiddle3re = unsafe { vmovq_n_f32(tw3.re) }; let twiddle3im = unsafe { vmovq_n_f32(tw3.im) }; let twiddle4re = unsafe { vmovq_n_f32(tw4.re) }; let twiddle4im = unsafe { vmovq_n_f32(tw4.im) }; let twiddle5re = unsafe { vmovq_n_f32(tw5.re) }; let twiddle5im = unsafe { vmovq_n_f32(tw5.im) }; let twiddle6re = unsafe { vmovq_n_f32(tw6.re) }; let twiddle6im = unsafe { vmovq_n_f32(tw6.im) }; let twiddle7re = unsafe { vmovq_n_f32(tw7.re) }; let twiddle7im = unsafe { vmovq_n_f32(tw7.im) }; let twiddle8re = unsafe { vmovq_n_f32(tw8.re) }; let twiddle8im = unsafe { vmovq_n_f32(tw8.im) }; let twiddle9re = unsafe { vmovq_n_f32(tw9.re) }; let twiddle9im = unsafe { vmovq_n_f32(tw9.im) }; let twiddle10re = unsafe { vmovq_n_f32(tw10.re) }; let twiddle10im = unsafe { vmovq_n_f32(tw10.im) }; let twiddle11re = unsafe { vmovq_n_f32(tw11.re) }; let twiddle11im = unsafe { vmovq_n_f32(tw11.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[11]), extract_hi_lo_f32(input_packed[0], input_packed[12]), extract_lo_hi_f32(input_packed[1], input_packed[12]), extract_hi_lo_f32(input_packed[1], input_packed[13]), extract_lo_hi_f32(input_packed[2], input_packed[13]), extract_hi_lo_f32(input_packed[2], input_packed[14]), extract_lo_hi_f32(input_packed[3], input_packed[14]), extract_hi_lo_f32(input_packed[3], input_packed[15]), extract_lo_hi_f32(input_packed[4], input_packed[15]), extract_hi_lo_f32(input_packed[4], input_packed[16]), extract_lo_hi_f32(input_packed[5], input_packed[16]), extract_hi_lo_f32(input_packed[5], input_packed[17]), extract_lo_hi_f32(input_packed[6], input_packed[17]), extract_hi_lo_f32(input_packed[6], input_packed[18]), extract_lo_hi_f32(input_packed[7], input_packed[18]), extract_hi_lo_f32(input_packed[7], input_packed[19]), extract_lo_hi_f32(input_packed[8], input_packed[19]), extract_hi_lo_f32(input_packed[8], input_packed[20]), extract_lo_hi_f32(input_packed[9], input_packed[20]), extract_hi_lo_f32(input_packed[9], input_packed[21]), extract_lo_hi_f32(input_packed[10], input_packed[21]), extract_hi_lo_f32(input_packed[10], input_packed[22]), extract_lo_hi_f32(input_packed[11], input_packed[22]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_lo_f32(out[14], out[15]), extract_lo_lo_f32(out[16], out[17]), extract_lo_lo_f32(out[18], out[19]), extract_lo_lo_f32(out[20], out[21]), extract_lo_hi_f32(out[22], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), extract_hi_hi_f32(out[15], out[16]), extract_hi_hi_f32(out[17], out[18]), extract_hi_hi_f32(out[19], out[20]), extract_hi_hi_f32(out[21], out[22]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [float32x4_t; 23]) -> [float32x4_t; 23] { let [x1p22, x1m22] = parallel_fft2_interleaved_f32(values[1], values[22]); let [x2p21, x2m21] = parallel_fft2_interleaved_f32(values[2], values[21]); let [x3p20, x3m20] = parallel_fft2_interleaved_f32(values[3], values[20]); let [x4p19, x4m19] = parallel_fft2_interleaved_f32(values[4], values[19]); let [x5p18, x5m18] = parallel_fft2_interleaved_f32(values[5], values[18]); let [x6p17, x6m17] = parallel_fft2_interleaved_f32(values[6], values[17]); let [x7p16, x7m16] = parallel_fft2_interleaved_f32(values[7], values[16]); let [x8p15, x8m15] = parallel_fft2_interleaved_f32(values[8], values[15]); let [x9p14, x9m14] = parallel_fft2_interleaved_f32(values[9], values[14]); let [x10p13, x10m13] = parallel_fft2_interleaved_f32(values[10], values[13]); let [x11p12, x11m12] = parallel_fft2_interleaved_f32(values[11], values[12]); let t_a1_1 = vmulq_f32(self.twiddle1re, x1p22); let t_a1_2 = vmulq_f32(self.twiddle2re, x2p21); let t_a1_3 = vmulq_f32(self.twiddle3re, x3p20); let t_a1_4 = vmulq_f32(self.twiddle4re, x4p19); let t_a1_5 = vmulq_f32(self.twiddle5re, x5p18); let t_a1_6 = vmulq_f32(self.twiddle6re, x6p17); let t_a1_7 = vmulq_f32(self.twiddle7re, x7p16); let t_a1_8 = vmulq_f32(self.twiddle8re, x8p15); let t_a1_9 = vmulq_f32(self.twiddle9re, x9p14); let t_a1_10 = vmulq_f32(self.twiddle10re, x10p13); let t_a1_11 = vmulq_f32(self.twiddle11re, x11p12); let t_a2_1 = vmulq_f32(self.twiddle2re, x1p22); let t_a2_2 = vmulq_f32(self.twiddle4re, x2p21); let t_a2_3 = vmulq_f32(self.twiddle6re, x3p20); let t_a2_4 = vmulq_f32(self.twiddle8re, x4p19); let t_a2_5 = vmulq_f32(self.twiddle10re, x5p18); let t_a2_6 = vmulq_f32(self.twiddle11re, x6p17); let t_a2_7 = vmulq_f32(self.twiddle9re, x7p16); let t_a2_8 = vmulq_f32(self.twiddle7re, x8p15); let t_a2_9 = vmulq_f32(self.twiddle5re, x9p14); let t_a2_10 = vmulq_f32(self.twiddle3re, x10p13); let t_a2_11 = vmulq_f32(self.twiddle1re, x11p12); let t_a3_1 = vmulq_f32(self.twiddle3re, x1p22); let t_a3_2 = vmulq_f32(self.twiddle6re, x2p21); let t_a3_3 = vmulq_f32(self.twiddle9re, x3p20); let t_a3_4 = vmulq_f32(self.twiddle11re, x4p19); let t_a3_5 = vmulq_f32(self.twiddle8re, x5p18); let t_a3_6 = vmulq_f32(self.twiddle5re, x6p17); let t_a3_7 = vmulq_f32(self.twiddle2re, x7p16); let t_a3_8 = vmulq_f32(self.twiddle1re, x8p15); let t_a3_9 = vmulq_f32(self.twiddle4re, x9p14); let t_a3_10 = vmulq_f32(self.twiddle7re, x10p13); let t_a3_11 = vmulq_f32(self.twiddle10re, x11p12); let t_a4_1 = vmulq_f32(self.twiddle4re, x1p22); let t_a4_2 = vmulq_f32(self.twiddle8re, x2p21); let t_a4_3 = vmulq_f32(self.twiddle11re, x3p20); let t_a4_4 = vmulq_f32(self.twiddle7re, x4p19); let t_a4_5 = vmulq_f32(self.twiddle3re, x5p18); let t_a4_6 = vmulq_f32(self.twiddle1re, x6p17); let t_a4_7 = vmulq_f32(self.twiddle5re, x7p16); let t_a4_8 = vmulq_f32(self.twiddle9re, x8p15); let t_a4_9 = vmulq_f32(self.twiddle10re, x9p14); let t_a4_10 = vmulq_f32(self.twiddle6re, x10p13); let t_a4_11 = vmulq_f32(self.twiddle2re, x11p12); let t_a5_1 = vmulq_f32(self.twiddle5re, x1p22); let t_a5_2 = vmulq_f32(self.twiddle10re, x2p21); let t_a5_3 = vmulq_f32(self.twiddle8re, x3p20); let t_a5_4 = vmulq_f32(self.twiddle3re, x4p19); let t_a5_5 = vmulq_f32(self.twiddle2re, x5p18); let t_a5_6 = vmulq_f32(self.twiddle7re, x6p17); let t_a5_7 = vmulq_f32(self.twiddle11re, x7p16); let t_a5_8 = vmulq_f32(self.twiddle6re, x8p15); let t_a5_9 = vmulq_f32(self.twiddle1re, x9p14); let t_a5_10 = vmulq_f32(self.twiddle4re, x10p13); let t_a5_11 = vmulq_f32(self.twiddle9re, x11p12); let t_a6_1 = vmulq_f32(self.twiddle6re, x1p22); let t_a6_2 = vmulq_f32(self.twiddle11re, x2p21); let t_a6_3 = vmulq_f32(self.twiddle5re, x3p20); let t_a6_4 = vmulq_f32(self.twiddle1re, x4p19); let t_a6_5 = vmulq_f32(self.twiddle7re, x5p18); let t_a6_6 = vmulq_f32(self.twiddle10re, x6p17); let t_a6_7 = vmulq_f32(self.twiddle4re, x7p16); let t_a6_8 = vmulq_f32(self.twiddle2re, x8p15); let t_a6_9 = vmulq_f32(self.twiddle8re, x9p14); let t_a6_10 = vmulq_f32(self.twiddle9re, x10p13); let t_a6_11 = vmulq_f32(self.twiddle3re, x11p12); let t_a7_1 = vmulq_f32(self.twiddle7re, x1p22); let t_a7_2 = vmulq_f32(self.twiddle9re, x2p21); let t_a7_3 = vmulq_f32(self.twiddle2re, x3p20); let t_a7_4 = vmulq_f32(self.twiddle5re, x4p19); let t_a7_5 = vmulq_f32(self.twiddle11re, x5p18); let t_a7_6 = vmulq_f32(self.twiddle4re, x6p17); let t_a7_7 = vmulq_f32(self.twiddle3re, x7p16); let t_a7_8 = vmulq_f32(self.twiddle10re, x8p15); let t_a7_9 = vmulq_f32(self.twiddle6re, x9p14); let t_a7_10 = vmulq_f32(self.twiddle1re, x10p13); let t_a7_11 = vmulq_f32(self.twiddle8re, x11p12); let t_a8_1 = vmulq_f32(self.twiddle8re, x1p22); let t_a8_2 = vmulq_f32(self.twiddle7re, x2p21); let t_a8_3 = vmulq_f32(self.twiddle1re, x3p20); let t_a8_4 = vmulq_f32(self.twiddle9re, x4p19); let t_a8_5 = vmulq_f32(self.twiddle6re, x5p18); let t_a8_6 = vmulq_f32(self.twiddle2re, x6p17); let t_a8_7 = vmulq_f32(self.twiddle10re, x7p16); let t_a8_8 = vmulq_f32(self.twiddle5re, x8p15); let t_a8_9 = vmulq_f32(self.twiddle3re, x9p14); let t_a8_10 = vmulq_f32(self.twiddle11re, x10p13); let t_a8_11 = vmulq_f32(self.twiddle4re, x11p12); let t_a9_1 = vmulq_f32(self.twiddle9re, x1p22); let t_a9_2 = vmulq_f32(self.twiddle5re, x2p21); let t_a9_3 = vmulq_f32(self.twiddle4re, x3p20); let t_a9_4 = vmulq_f32(self.twiddle10re, x4p19); let t_a9_5 = vmulq_f32(self.twiddle1re, x5p18); let t_a9_6 = vmulq_f32(self.twiddle8re, x6p17); let t_a9_7 = vmulq_f32(self.twiddle6re, x7p16); let t_a9_8 = vmulq_f32(self.twiddle3re, x8p15); let t_a9_9 = vmulq_f32(self.twiddle11re, x9p14); let t_a9_10 = vmulq_f32(self.twiddle2re, x10p13); let t_a9_11 = vmulq_f32(self.twiddle7re, x11p12); let t_a10_1 = vmulq_f32(self.twiddle10re, x1p22); let t_a10_2 = vmulq_f32(self.twiddle3re, x2p21); let t_a10_3 = vmulq_f32(self.twiddle7re, x3p20); let t_a10_4 = vmulq_f32(self.twiddle6re, x4p19); let t_a10_5 = vmulq_f32(self.twiddle4re, x5p18); let t_a10_6 = vmulq_f32(self.twiddle9re, x6p17); let t_a10_7 = vmulq_f32(self.twiddle1re, x7p16); let t_a10_8 = vmulq_f32(self.twiddle11re, x8p15); let t_a10_9 = vmulq_f32(self.twiddle2re, x9p14); let t_a10_10 = vmulq_f32(self.twiddle8re, x10p13); let t_a10_11 = vmulq_f32(self.twiddle5re, x11p12); let t_a11_1 = vmulq_f32(self.twiddle11re, x1p22); let t_a11_2 = vmulq_f32(self.twiddle1re, x2p21); let t_a11_3 = vmulq_f32(self.twiddle10re, x3p20); let t_a11_4 = vmulq_f32(self.twiddle2re, x4p19); let t_a11_5 = vmulq_f32(self.twiddle9re, x5p18); let t_a11_6 = vmulq_f32(self.twiddle3re, x6p17); let t_a11_7 = vmulq_f32(self.twiddle8re, x7p16); let t_a11_8 = vmulq_f32(self.twiddle4re, x8p15); let t_a11_9 = vmulq_f32(self.twiddle7re, x9p14); let t_a11_10 = vmulq_f32(self.twiddle5re, x10p13); let t_a11_11 = vmulq_f32(self.twiddle6re, x11p12); let t_b1_1 = vmulq_f32(self.twiddle1im, x1m22); let t_b1_2 = vmulq_f32(self.twiddle2im, x2m21); let t_b1_3 = vmulq_f32(self.twiddle3im, x3m20); let t_b1_4 = vmulq_f32(self.twiddle4im, x4m19); let t_b1_5 = vmulq_f32(self.twiddle5im, x5m18); let t_b1_6 = vmulq_f32(self.twiddle6im, x6m17); let t_b1_7 = vmulq_f32(self.twiddle7im, x7m16); let t_b1_8 = vmulq_f32(self.twiddle8im, x8m15); let t_b1_9 = vmulq_f32(self.twiddle9im, x9m14); let t_b1_10 = vmulq_f32(self.twiddle10im, x10m13); let t_b1_11 = vmulq_f32(self.twiddle11im, x11m12); let t_b2_1 = vmulq_f32(self.twiddle2im, x1m22); let t_b2_2 = vmulq_f32(self.twiddle4im, x2m21); let t_b2_3 = vmulq_f32(self.twiddle6im, x3m20); let t_b2_4 = vmulq_f32(self.twiddle8im, x4m19); let t_b2_5 = vmulq_f32(self.twiddle10im, x5m18); let t_b2_6 = vmulq_f32(self.twiddle11im, x6m17); let t_b2_7 = vmulq_f32(self.twiddle9im, x7m16); let t_b2_8 = vmulq_f32(self.twiddle7im, x8m15); let t_b2_9 = vmulq_f32(self.twiddle5im, x9m14); let t_b2_10 = vmulq_f32(self.twiddle3im, x10m13); let t_b2_11 = vmulq_f32(self.twiddle1im, x11m12); let t_b3_1 = vmulq_f32(self.twiddle3im, x1m22); let t_b3_2 = vmulq_f32(self.twiddle6im, x2m21); let t_b3_3 = vmulq_f32(self.twiddle9im, x3m20); let t_b3_4 = vmulq_f32(self.twiddle11im, x4m19); let t_b3_5 = vmulq_f32(self.twiddle8im, x5m18); let t_b3_6 = vmulq_f32(self.twiddle5im, x6m17); let t_b3_7 = vmulq_f32(self.twiddle2im, x7m16); let t_b3_8 = vmulq_f32(self.twiddle1im, x8m15); let t_b3_9 = vmulq_f32(self.twiddle4im, x9m14); let t_b3_10 = vmulq_f32(self.twiddle7im, x10m13); let t_b3_11 = vmulq_f32(self.twiddle10im, x11m12); let t_b4_1 = vmulq_f32(self.twiddle4im, x1m22); let t_b4_2 = vmulq_f32(self.twiddle8im, x2m21); let t_b4_3 = vmulq_f32(self.twiddle11im, x3m20); let t_b4_4 = vmulq_f32(self.twiddle7im, x4m19); let t_b4_5 = vmulq_f32(self.twiddle3im, x5m18); let t_b4_6 = vmulq_f32(self.twiddle1im, x6m17); let t_b4_7 = vmulq_f32(self.twiddle5im, x7m16); let t_b4_8 = vmulq_f32(self.twiddle9im, x8m15); let t_b4_9 = vmulq_f32(self.twiddle10im, x9m14); let t_b4_10 = vmulq_f32(self.twiddle6im, x10m13); let t_b4_11 = vmulq_f32(self.twiddle2im, x11m12); let t_b5_1 = vmulq_f32(self.twiddle5im, x1m22); let t_b5_2 = vmulq_f32(self.twiddle10im, x2m21); let t_b5_3 = vmulq_f32(self.twiddle8im, x3m20); let t_b5_4 = vmulq_f32(self.twiddle3im, x4m19); let t_b5_5 = vmulq_f32(self.twiddle2im, x5m18); let t_b5_6 = vmulq_f32(self.twiddle7im, x6m17); let t_b5_7 = vmulq_f32(self.twiddle11im, x7m16); let t_b5_8 = vmulq_f32(self.twiddle6im, x8m15); let t_b5_9 = vmulq_f32(self.twiddle1im, x9m14); let t_b5_10 = vmulq_f32(self.twiddle4im, x10m13); let t_b5_11 = vmulq_f32(self.twiddle9im, x11m12); let t_b6_1 = vmulq_f32(self.twiddle6im, x1m22); let t_b6_2 = vmulq_f32(self.twiddle11im, x2m21); let t_b6_3 = vmulq_f32(self.twiddle5im, x3m20); let t_b6_4 = vmulq_f32(self.twiddle1im, x4m19); let t_b6_5 = vmulq_f32(self.twiddle7im, x5m18); let t_b6_6 = vmulq_f32(self.twiddle10im, x6m17); let t_b6_7 = vmulq_f32(self.twiddle4im, x7m16); let t_b6_8 = vmulq_f32(self.twiddle2im, x8m15); let t_b6_9 = vmulq_f32(self.twiddle8im, x9m14); let t_b6_10 = vmulq_f32(self.twiddle9im, x10m13); let t_b6_11 = vmulq_f32(self.twiddle3im, x11m12); let t_b7_1 = vmulq_f32(self.twiddle7im, x1m22); let t_b7_2 = vmulq_f32(self.twiddle9im, x2m21); let t_b7_3 = vmulq_f32(self.twiddle2im, x3m20); let t_b7_4 = vmulq_f32(self.twiddle5im, x4m19); let t_b7_5 = vmulq_f32(self.twiddle11im, x5m18); let t_b7_6 = vmulq_f32(self.twiddle4im, x6m17); let t_b7_7 = vmulq_f32(self.twiddle3im, x7m16); let t_b7_8 = vmulq_f32(self.twiddle10im, x8m15); let t_b7_9 = vmulq_f32(self.twiddle6im, x9m14); let t_b7_10 = vmulq_f32(self.twiddle1im, x10m13); let t_b7_11 = vmulq_f32(self.twiddle8im, x11m12); let t_b8_1 = vmulq_f32(self.twiddle8im, x1m22); let t_b8_2 = vmulq_f32(self.twiddle7im, x2m21); let t_b8_3 = vmulq_f32(self.twiddle1im, x3m20); let t_b8_4 = vmulq_f32(self.twiddle9im, x4m19); let t_b8_5 = vmulq_f32(self.twiddle6im, x5m18); let t_b8_6 = vmulq_f32(self.twiddle2im, x6m17); let t_b8_7 = vmulq_f32(self.twiddle10im, x7m16); let t_b8_8 = vmulq_f32(self.twiddle5im, x8m15); let t_b8_9 = vmulq_f32(self.twiddle3im, x9m14); let t_b8_10 = vmulq_f32(self.twiddle11im, x10m13); let t_b8_11 = vmulq_f32(self.twiddle4im, x11m12); let t_b9_1 = vmulq_f32(self.twiddle9im, x1m22); let t_b9_2 = vmulq_f32(self.twiddle5im, x2m21); let t_b9_3 = vmulq_f32(self.twiddle4im, x3m20); let t_b9_4 = vmulq_f32(self.twiddle10im, x4m19); let t_b9_5 = vmulq_f32(self.twiddle1im, x5m18); let t_b9_6 = vmulq_f32(self.twiddle8im, x6m17); let t_b9_7 = vmulq_f32(self.twiddle6im, x7m16); let t_b9_8 = vmulq_f32(self.twiddle3im, x8m15); let t_b9_9 = vmulq_f32(self.twiddle11im, x9m14); let t_b9_10 = vmulq_f32(self.twiddle2im, x10m13); let t_b9_11 = vmulq_f32(self.twiddle7im, x11m12); let t_b10_1 = vmulq_f32(self.twiddle10im, x1m22); let t_b10_2 = vmulq_f32(self.twiddle3im, x2m21); let t_b10_3 = vmulq_f32(self.twiddle7im, x3m20); let t_b10_4 = vmulq_f32(self.twiddle6im, x4m19); let t_b10_5 = vmulq_f32(self.twiddle4im, x5m18); let t_b10_6 = vmulq_f32(self.twiddle9im, x6m17); let t_b10_7 = vmulq_f32(self.twiddle1im, x7m16); let t_b10_8 = vmulq_f32(self.twiddle11im, x8m15); let t_b10_9 = vmulq_f32(self.twiddle2im, x9m14); let t_b10_10 = vmulq_f32(self.twiddle8im, x10m13); let t_b10_11 = vmulq_f32(self.twiddle5im, x11m12); let t_b11_1 = vmulq_f32(self.twiddle11im, x1m22); let t_b11_2 = vmulq_f32(self.twiddle1im, x2m21); let t_b11_3 = vmulq_f32(self.twiddle10im, x3m20); let t_b11_4 = vmulq_f32(self.twiddle2im, x4m19); let t_b11_5 = vmulq_f32(self.twiddle9im, x5m18); let t_b11_6 = vmulq_f32(self.twiddle3im, x6m17); let t_b11_7 = vmulq_f32(self.twiddle8im, x7m16); let t_b11_8 = vmulq_f32(self.twiddle4im, x8m15); let t_b11_9 = vmulq_f32(self.twiddle7im, x9m14); let t_b11_10 = vmulq_f32(self.twiddle5im, x10m13); let t_b11_11 = vmulq_f32(self.twiddle6im, x11m12); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11); let t_a6 = calc_f32!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11); let t_a7 = calc_f32!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11); let t_a8 = calc_f32!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11); let t_a9 = calc_f32!(x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11); let t_a10 = calc_f32!(x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11); let t_a11 = calc_f32!(x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11); let t_b3 = calc_f32!(t_b3_1 + t_b3_2 + t_b3_3 - t_b3_4 - t_b3_5 - t_b3_6 - t_b3_7 + t_b3_8 + t_b3_9 + t_b3_10 + t_b3_11); let t_b4 = calc_f32!(t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 - t_b4_5 + t_b4_6 + t_b4_7 + t_b4_8 - t_b4_9 - t_b4_10 - t_b4_11); let t_b5 = calc_f32!(t_b5_1 + t_b5_2 - t_b5_3 - t_b5_4 + t_b5_5 + t_b5_6 - t_b5_7 - t_b5_8 - t_b5_9 + t_b5_10 + t_b5_11); let t_b6 = calc_f32!(t_b6_1 - t_b6_2 - t_b6_3 + t_b6_4 + t_b6_5 - t_b6_6 - t_b6_7 + t_b6_8 + t_b6_9 - t_b6_10 - t_b6_11); let t_b7 = calc_f32!(t_b7_1 - t_b7_2 - t_b7_3 + t_b7_4 - t_b7_5 - t_b7_6 + t_b7_7 + t_b7_8 - t_b7_9 + t_b7_10 + t_b7_11); let t_b8 = calc_f32!(t_b8_1 - t_b8_2 + t_b8_3 + t_b8_4 - t_b8_5 + t_b8_6 + t_b8_7 - t_b8_8 + t_b8_9 + t_b8_10 - t_b8_11); let t_b9 = calc_f32!(t_b9_1 - t_b9_2 + t_b9_3 - t_b9_4 - t_b9_5 + t_b9_6 - t_b9_7 + t_b9_8 - t_b9_9 - t_b9_10 + t_b9_11); let t_b10 = calc_f32!(t_b10_1 - t_b10_2 + t_b10_3 - t_b10_4 + t_b10_5 - t_b10_6 + t_b10_7 + t_b10_8 - t_b10_9 + t_b10_10 - t_b10_11); let t_b11 = calc_f32!(t_b11_1 - t_b11_2 + t_b11_3 - t_b11_4 + t_b11_5 - t_b11_6 + t_b11_7 - t_b11_8 + t_b11_9 - t_b11_10 + t_b11_11); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let t_b7_rot = self.rotate.rotate_both(t_b7); let t_b8_rot = self.rotate.rotate_both(t_b8); let t_b9_rot = self.rotate.rotate_both(t_b9); let t_b10_rot = self.rotate.rotate_both(t_b10); let t_b11_rot = self.rotate.rotate_both(t_b11); let y0 = calc_f32!(x0 + x1p22 + x2p21 + x3p20 + x4p19 + x5p18 + x6p17 + x7p16 + x8p15 + x9p14 + x10p13 + x11p12); let [y1, y22] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y21] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y20] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y19] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y18] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y17] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); let [y7, y16] = parallel_fft2_interleaved_f32(t_a7, t_b7_rot); let [y8, y15] = parallel_fft2_interleaved_f32(t_a8, t_b8_rot); let [y9, y14] = parallel_fft2_interleaved_f32(t_a9, t_b9_rot); let [y10, y13] = parallel_fft2_interleaved_f32(t_a10, t_b10_rot); let [y11, y12] = parallel_fft2_interleaved_f32(t_a11, t_b11_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22] } } // ____ _____ __ _ _ _ _ _ // |___ \|___ / / /_ | || | | |__ (_) |_ // __) | |_ \ _____ | '_ \| || |_| '_ \| | __| // / __/ ___) | |_____| | (_) |__ _| |_) | | |_ // |_____|____/ \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly23 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: float64x2_t, twiddle1im: float64x2_t, twiddle2re: float64x2_t, twiddle2im: float64x2_t, twiddle3re: float64x2_t, twiddle3im: float64x2_t, twiddle4re: float64x2_t, twiddle4im: float64x2_t, twiddle5re: float64x2_t, twiddle5im: float64x2_t, twiddle6re: float64x2_t, twiddle6im: float64x2_t, twiddle7re: float64x2_t, twiddle7im: float64x2_t, twiddle8re: float64x2_t, twiddle8im: float64x2_t, twiddle9re: float64x2_t, twiddle9im: float64x2_t, twiddle10re: float64x2_t, twiddle10im: float64x2_t, twiddle11re: float64x2_t, twiddle11im: float64x2_t, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly23, 23, |this: &NeonF64Butterfly23<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly23, 23, |this: &NeonF64Butterfly23<_>| this .direction); impl NeonF64Butterfly23 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 23, direction); let tw2: Complex = twiddles::compute_twiddle(2, 23, direction); let tw3: Complex = twiddles::compute_twiddle(3, 23, direction); let tw4: Complex = twiddles::compute_twiddle(4, 23, direction); let tw5: Complex = twiddles::compute_twiddle(5, 23, direction); let tw6: Complex = twiddles::compute_twiddle(6, 23, direction); let tw7: Complex = twiddles::compute_twiddle(7, 23, direction); let tw8: Complex = twiddles::compute_twiddle(8, 23, direction); let tw9: Complex = twiddles::compute_twiddle(9, 23, direction); let tw10: Complex = twiddles::compute_twiddle(10, 23, direction); let tw11: Complex = twiddles::compute_twiddle(11, 23, direction); let twiddle1re = unsafe { vmovq_n_f64(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f64(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f64(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f64(tw2.im) }; let twiddle3re = unsafe { vmovq_n_f64(tw3.re) }; let twiddle3im = unsafe { vmovq_n_f64(tw3.im) }; let twiddle4re = unsafe { vmovq_n_f64(tw4.re) }; let twiddle4im = unsafe { vmovq_n_f64(tw4.im) }; let twiddle5re = unsafe { vmovq_n_f64(tw5.re) }; let twiddle5im = unsafe { vmovq_n_f64(tw5.im) }; let twiddle6re = unsafe { vmovq_n_f64(tw6.re) }; let twiddle6im = unsafe { vmovq_n_f64(tw6.im) }; let twiddle7re = unsafe { vmovq_n_f64(tw7.re) }; let twiddle7im = unsafe { vmovq_n_f64(tw7.im) }; let twiddle8re = unsafe { vmovq_n_f64(tw8.re) }; let twiddle8im = unsafe { vmovq_n_f64(tw8.im) }; let twiddle9re = unsafe { vmovq_n_f64(tw9.re) }; let twiddle9im = unsafe { vmovq_n_f64(tw9.im) }; let twiddle10re = unsafe { vmovq_n_f64(tw10.re) }; let twiddle10im = unsafe { vmovq_n_f64(tw10.im) }; let twiddle11re = unsafe { vmovq_n_f64(tw11.re) }; let twiddle11im = unsafe { vmovq_n_f64(tw11.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [float64x2_t; 23]) -> [float64x2_t; 23] { let [x1p22, x1m22] = solo_fft2_f64(values[1], values[22]); let [x2p21, x2m21] = solo_fft2_f64(values[2], values[21]); let [x3p20, x3m20] = solo_fft2_f64(values[3], values[20]); let [x4p19, x4m19] = solo_fft2_f64(values[4], values[19]); let [x5p18, x5m18] = solo_fft2_f64(values[5], values[18]); let [x6p17, x6m17] = solo_fft2_f64(values[6], values[17]); let [x7p16, x7m16] = solo_fft2_f64(values[7], values[16]); let [x8p15, x8m15] = solo_fft2_f64(values[8], values[15]); let [x9p14, x9m14] = solo_fft2_f64(values[9], values[14]); let [x10p13, x10m13] = solo_fft2_f64(values[10], values[13]); let [x11p12, x11m12] = solo_fft2_f64(values[11], values[12]); let t_a1_1 = vmulq_f64(self.twiddle1re, x1p22); let t_a1_2 = vmulq_f64(self.twiddle2re, x2p21); let t_a1_3 = vmulq_f64(self.twiddle3re, x3p20); let t_a1_4 = vmulq_f64(self.twiddle4re, x4p19); let t_a1_5 = vmulq_f64(self.twiddle5re, x5p18); let t_a1_6 = vmulq_f64(self.twiddle6re, x6p17); let t_a1_7 = vmulq_f64(self.twiddle7re, x7p16); let t_a1_8 = vmulq_f64(self.twiddle8re, x8p15); let t_a1_9 = vmulq_f64(self.twiddle9re, x9p14); let t_a1_10 = vmulq_f64(self.twiddle10re, x10p13); let t_a1_11 = vmulq_f64(self.twiddle11re, x11p12); let t_a2_1 = vmulq_f64(self.twiddle2re, x1p22); let t_a2_2 = vmulq_f64(self.twiddle4re, x2p21); let t_a2_3 = vmulq_f64(self.twiddle6re, x3p20); let t_a2_4 = vmulq_f64(self.twiddle8re, x4p19); let t_a2_5 = vmulq_f64(self.twiddle10re, x5p18); let t_a2_6 = vmulq_f64(self.twiddle11re, x6p17); let t_a2_7 = vmulq_f64(self.twiddle9re, x7p16); let t_a2_8 = vmulq_f64(self.twiddle7re, x8p15); let t_a2_9 = vmulq_f64(self.twiddle5re, x9p14); let t_a2_10 = vmulq_f64(self.twiddle3re, x10p13); let t_a2_11 = vmulq_f64(self.twiddle1re, x11p12); let t_a3_1 = vmulq_f64(self.twiddle3re, x1p22); let t_a3_2 = vmulq_f64(self.twiddle6re, x2p21); let t_a3_3 = vmulq_f64(self.twiddle9re, x3p20); let t_a3_4 = vmulq_f64(self.twiddle11re, x4p19); let t_a3_5 = vmulq_f64(self.twiddle8re, x5p18); let t_a3_6 = vmulq_f64(self.twiddle5re, x6p17); let t_a3_7 = vmulq_f64(self.twiddle2re, x7p16); let t_a3_8 = vmulq_f64(self.twiddle1re, x8p15); let t_a3_9 = vmulq_f64(self.twiddle4re, x9p14); let t_a3_10 = vmulq_f64(self.twiddle7re, x10p13); let t_a3_11 = vmulq_f64(self.twiddle10re, x11p12); let t_a4_1 = vmulq_f64(self.twiddle4re, x1p22); let t_a4_2 = vmulq_f64(self.twiddle8re, x2p21); let t_a4_3 = vmulq_f64(self.twiddle11re, x3p20); let t_a4_4 = vmulq_f64(self.twiddle7re, x4p19); let t_a4_5 = vmulq_f64(self.twiddle3re, x5p18); let t_a4_6 = vmulq_f64(self.twiddle1re, x6p17); let t_a4_7 = vmulq_f64(self.twiddle5re, x7p16); let t_a4_8 = vmulq_f64(self.twiddle9re, x8p15); let t_a4_9 = vmulq_f64(self.twiddle10re, x9p14); let t_a4_10 = vmulq_f64(self.twiddle6re, x10p13); let t_a4_11 = vmulq_f64(self.twiddle2re, x11p12); let t_a5_1 = vmulq_f64(self.twiddle5re, x1p22); let t_a5_2 = vmulq_f64(self.twiddle10re, x2p21); let t_a5_3 = vmulq_f64(self.twiddle8re, x3p20); let t_a5_4 = vmulq_f64(self.twiddle3re, x4p19); let t_a5_5 = vmulq_f64(self.twiddle2re, x5p18); let t_a5_6 = vmulq_f64(self.twiddle7re, x6p17); let t_a5_7 = vmulq_f64(self.twiddle11re, x7p16); let t_a5_8 = vmulq_f64(self.twiddle6re, x8p15); let t_a5_9 = vmulq_f64(self.twiddle1re, x9p14); let t_a5_10 = vmulq_f64(self.twiddle4re, x10p13); let t_a5_11 = vmulq_f64(self.twiddle9re, x11p12); let t_a6_1 = vmulq_f64(self.twiddle6re, x1p22); let t_a6_2 = vmulq_f64(self.twiddle11re, x2p21); let t_a6_3 = vmulq_f64(self.twiddle5re, x3p20); let t_a6_4 = vmulq_f64(self.twiddle1re, x4p19); let t_a6_5 = vmulq_f64(self.twiddle7re, x5p18); let t_a6_6 = vmulq_f64(self.twiddle10re, x6p17); let t_a6_7 = vmulq_f64(self.twiddle4re, x7p16); let t_a6_8 = vmulq_f64(self.twiddle2re, x8p15); let t_a6_9 = vmulq_f64(self.twiddle8re, x9p14); let t_a6_10 = vmulq_f64(self.twiddle9re, x10p13); let t_a6_11 = vmulq_f64(self.twiddle3re, x11p12); let t_a7_1 = vmulq_f64(self.twiddle7re, x1p22); let t_a7_2 = vmulq_f64(self.twiddle9re, x2p21); let t_a7_3 = vmulq_f64(self.twiddle2re, x3p20); let t_a7_4 = vmulq_f64(self.twiddle5re, x4p19); let t_a7_5 = vmulq_f64(self.twiddle11re, x5p18); let t_a7_6 = vmulq_f64(self.twiddle4re, x6p17); let t_a7_7 = vmulq_f64(self.twiddle3re, x7p16); let t_a7_8 = vmulq_f64(self.twiddle10re, x8p15); let t_a7_9 = vmulq_f64(self.twiddle6re, x9p14); let t_a7_10 = vmulq_f64(self.twiddle1re, x10p13); let t_a7_11 = vmulq_f64(self.twiddle8re, x11p12); let t_a8_1 = vmulq_f64(self.twiddle8re, x1p22); let t_a8_2 = vmulq_f64(self.twiddle7re, x2p21); let t_a8_3 = vmulq_f64(self.twiddle1re, x3p20); let t_a8_4 = vmulq_f64(self.twiddle9re, x4p19); let t_a8_5 = vmulq_f64(self.twiddle6re, x5p18); let t_a8_6 = vmulq_f64(self.twiddle2re, x6p17); let t_a8_7 = vmulq_f64(self.twiddle10re, x7p16); let t_a8_8 = vmulq_f64(self.twiddle5re, x8p15); let t_a8_9 = vmulq_f64(self.twiddle3re, x9p14); let t_a8_10 = vmulq_f64(self.twiddle11re, x10p13); let t_a8_11 = vmulq_f64(self.twiddle4re, x11p12); let t_a9_1 = vmulq_f64(self.twiddle9re, x1p22); let t_a9_2 = vmulq_f64(self.twiddle5re, x2p21); let t_a9_3 = vmulq_f64(self.twiddle4re, x3p20); let t_a9_4 = vmulq_f64(self.twiddle10re, x4p19); let t_a9_5 = vmulq_f64(self.twiddle1re, x5p18); let t_a9_6 = vmulq_f64(self.twiddle8re, x6p17); let t_a9_7 = vmulq_f64(self.twiddle6re, x7p16); let t_a9_8 = vmulq_f64(self.twiddle3re, x8p15); let t_a9_9 = vmulq_f64(self.twiddle11re, x9p14); let t_a9_10 = vmulq_f64(self.twiddle2re, x10p13); let t_a9_11 = vmulq_f64(self.twiddle7re, x11p12); let t_a10_1 = vmulq_f64(self.twiddle10re, x1p22); let t_a10_2 = vmulq_f64(self.twiddle3re, x2p21); let t_a10_3 = vmulq_f64(self.twiddle7re, x3p20); let t_a10_4 = vmulq_f64(self.twiddle6re, x4p19); let t_a10_5 = vmulq_f64(self.twiddle4re, x5p18); let t_a10_6 = vmulq_f64(self.twiddle9re, x6p17); let t_a10_7 = vmulq_f64(self.twiddle1re, x7p16); let t_a10_8 = vmulq_f64(self.twiddle11re, x8p15); let t_a10_9 = vmulq_f64(self.twiddle2re, x9p14); let t_a10_10 = vmulq_f64(self.twiddle8re, x10p13); let t_a10_11 = vmulq_f64(self.twiddle5re, x11p12); let t_a11_1 = vmulq_f64(self.twiddle11re, x1p22); let t_a11_2 = vmulq_f64(self.twiddle1re, x2p21); let t_a11_3 = vmulq_f64(self.twiddle10re, x3p20); let t_a11_4 = vmulq_f64(self.twiddle2re, x4p19); let t_a11_5 = vmulq_f64(self.twiddle9re, x5p18); let t_a11_6 = vmulq_f64(self.twiddle3re, x6p17); let t_a11_7 = vmulq_f64(self.twiddle8re, x7p16); let t_a11_8 = vmulq_f64(self.twiddle4re, x8p15); let t_a11_9 = vmulq_f64(self.twiddle7re, x9p14); let t_a11_10 = vmulq_f64(self.twiddle5re, x10p13); let t_a11_11 = vmulq_f64(self.twiddle6re, x11p12); let t_b1_1 = vmulq_f64(self.twiddle1im, x1m22); let t_b1_2 = vmulq_f64(self.twiddle2im, x2m21); let t_b1_3 = vmulq_f64(self.twiddle3im, x3m20); let t_b1_4 = vmulq_f64(self.twiddle4im, x4m19); let t_b1_5 = vmulq_f64(self.twiddle5im, x5m18); let t_b1_6 = vmulq_f64(self.twiddle6im, x6m17); let t_b1_7 = vmulq_f64(self.twiddle7im, x7m16); let t_b1_8 = vmulq_f64(self.twiddle8im, x8m15); let t_b1_9 = vmulq_f64(self.twiddle9im, x9m14); let t_b1_10 = vmulq_f64(self.twiddle10im, x10m13); let t_b1_11 = vmulq_f64(self.twiddle11im, x11m12); let t_b2_1 = vmulq_f64(self.twiddle2im, x1m22); let t_b2_2 = vmulq_f64(self.twiddle4im, x2m21); let t_b2_3 = vmulq_f64(self.twiddle6im, x3m20); let t_b2_4 = vmulq_f64(self.twiddle8im, x4m19); let t_b2_5 = vmulq_f64(self.twiddle10im, x5m18); let t_b2_6 = vmulq_f64(self.twiddle11im, x6m17); let t_b2_7 = vmulq_f64(self.twiddle9im, x7m16); let t_b2_8 = vmulq_f64(self.twiddle7im, x8m15); let t_b2_9 = vmulq_f64(self.twiddle5im, x9m14); let t_b2_10 = vmulq_f64(self.twiddle3im, x10m13); let t_b2_11 = vmulq_f64(self.twiddle1im, x11m12); let t_b3_1 = vmulq_f64(self.twiddle3im, x1m22); let t_b3_2 = vmulq_f64(self.twiddle6im, x2m21); let t_b3_3 = vmulq_f64(self.twiddle9im, x3m20); let t_b3_4 = vmulq_f64(self.twiddle11im, x4m19); let t_b3_5 = vmulq_f64(self.twiddle8im, x5m18); let t_b3_6 = vmulq_f64(self.twiddle5im, x6m17); let t_b3_7 = vmulq_f64(self.twiddle2im, x7m16); let t_b3_8 = vmulq_f64(self.twiddle1im, x8m15); let t_b3_9 = vmulq_f64(self.twiddle4im, x9m14); let t_b3_10 = vmulq_f64(self.twiddle7im, x10m13); let t_b3_11 = vmulq_f64(self.twiddle10im, x11m12); let t_b4_1 = vmulq_f64(self.twiddle4im, x1m22); let t_b4_2 = vmulq_f64(self.twiddle8im, x2m21); let t_b4_3 = vmulq_f64(self.twiddle11im, x3m20); let t_b4_4 = vmulq_f64(self.twiddle7im, x4m19); let t_b4_5 = vmulq_f64(self.twiddle3im, x5m18); let t_b4_6 = vmulq_f64(self.twiddle1im, x6m17); let t_b4_7 = vmulq_f64(self.twiddle5im, x7m16); let t_b4_8 = vmulq_f64(self.twiddle9im, x8m15); let t_b4_9 = vmulq_f64(self.twiddle10im, x9m14); let t_b4_10 = vmulq_f64(self.twiddle6im, x10m13); let t_b4_11 = vmulq_f64(self.twiddle2im, x11m12); let t_b5_1 = vmulq_f64(self.twiddle5im, x1m22); let t_b5_2 = vmulq_f64(self.twiddle10im, x2m21); let t_b5_3 = vmulq_f64(self.twiddle8im, x3m20); let t_b5_4 = vmulq_f64(self.twiddle3im, x4m19); let t_b5_5 = vmulq_f64(self.twiddle2im, x5m18); let t_b5_6 = vmulq_f64(self.twiddle7im, x6m17); let t_b5_7 = vmulq_f64(self.twiddle11im, x7m16); let t_b5_8 = vmulq_f64(self.twiddle6im, x8m15); let t_b5_9 = vmulq_f64(self.twiddle1im, x9m14); let t_b5_10 = vmulq_f64(self.twiddle4im, x10m13); let t_b5_11 = vmulq_f64(self.twiddle9im, x11m12); let t_b6_1 = vmulq_f64(self.twiddle6im, x1m22); let t_b6_2 = vmulq_f64(self.twiddle11im, x2m21); let t_b6_3 = vmulq_f64(self.twiddle5im, x3m20); let t_b6_4 = vmulq_f64(self.twiddle1im, x4m19); let t_b6_5 = vmulq_f64(self.twiddle7im, x5m18); let t_b6_6 = vmulq_f64(self.twiddle10im, x6m17); let t_b6_7 = vmulq_f64(self.twiddle4im, x7m16); let t_b6_8 = vmulq_f64(self.twiddle2im, x8m15); let t_b6_9 = vmulq_f64(self.twiddle8im, x9m14); let t_b6_10 = vmulq_f64(self.twiddle9im, x10m13); let t_b6_11 = vmulq_f64(self.twiddle3im, x11m12); let t_b7_1 = vmulq_f64(self.twiddle7im, x1m22); let t_b7_2 = vmulq_f64(self.twiddle9im, x2m21); let t_b7_3 = vmulq_f64(self.twiddle2im, x3m20); let t_b7_4 = vmulq_f64(self.twiddle5im, x4m19); let t_b7_5 = vmulq_f64(self.twiddle11im, x5m18); let t_b7_6 = vmulq_f64(self.twiddle4im, x6m17); let t_b7_7 = vmulq_f64(self.twiddle3im, x7m16); let t_b7_8 = vmulq_f64(self.twiddle10im, x8m15); let t_b7_9 = vmulq_f64(self.twiddle6im, x9m14); let t_b7_10 = vmulq_f64(self.twiddle1im, x10m13); let t_b7_11 = vmulq_f64(self.twiddle8im, x11m12); let t_b8_1 = vmulq_f64(self.twiddle8im, x1m22); let t_b8_2 = vmulq_f64(self.twiddle7im, x2m21); let t_b8_3 = vmulq_f64(self.twiddle1im, x3m20); let t_b8_4 = vmulq_f64(self.twiddle9im, x4m19); let t_b8_5 = vmulq_f64(self.twiddle6im, x5m18); let t_b8_6 = vmulq_f64(self.twiddle2im, x6m17); let t_b8_7 = vmulq_f64(self.twiddle10im, x7m16); let t_b8_8 = vmulq_f64(self.twiddle5im, x8m15); let t_b8_9 = vmulq_f64(self.twiddle3im, x9m14); let t_b8_10 = vmulq_f64(self.twiddle11im, x10m13); let t_b8_11 = vmulq_f64(self.twiddle4im, x11m12); let t_b9_1 = vmulq_f64(self.twiddle9im, x1m22); let t_b9_2 = vmulq_f64(self.twiddle5im, x2m21); let t_b9_3 = vmulq_f64(self.twiddle4im, x3m20); let t_b9_4 = vmulq_f64(self.twiddle10im, x4m19); let t_b9_5 = vmulq_f64(self.twiddle1im, x5m18); let t_b9_6 = vmulq_f64(self.twiddle8im, x6m17); let t_b9_7 = vmulq_f64(self.twiddle6im, x7m16); let t_b9_8 = vmulq_f64(self.twiddle3im, x8m15); let t_b9_9 = vmulq_f64(self.twiddle11im, x9m14); let t_b9_10 = vmulq_f64(self.twiddle2im, x10m13); let t_b9_11 = vmulq_f64(self.twiddle7im, x11m12); let t_b10_1 = vmulq_f64(self.twiddle10im, x1m22); let t_b10_2 = vmulq_f64(self.twiddle3im, x2m21); let t_b10_3 = vmulq_f64(self.twiddle7im, x3m20); let t_b10_4 = vmulq_f64(self.twiddle6im, x4m19); let t_b10_5 = vmulq_f64(self.twiddle4im, x5m18); let t_b10_6 = vmulq_f64(self.twiddle9im, x6m17); let t_b10_7 = vmulq_f64(self.twiddle1im, x7m16); let t_b10_8 = vmulq_f64(self.twiddle11im, x8m15); let t_b10_9 = vmulq_f64(self.twiddle2im, x9m14); let t_b10_10 = vmulq_f64(self.twiddle8im, x10m13); let t_b10_11 = vmulq_f64(self.twiddle5im, x11m12); let t_b11_1 = vmulq_f64(self.twiddle11im, x1m22); let t_b11_2 = vmulq_f64(self.twiddle1im, x2m21); let t_b11_3 = vmulq_f64(self.twiddle10im, x3m20); let t_b11_4 = vmulq_f64(self.twiddle2im, x4m19); let t_b11_5 = vmulq_f64(self.twiddle9im, x5m18); let t_b11_6 = vmulq_f64(self.twiddle3im, x6m17); let t_b11_7 = vmulq_f64(self.twiddle8im, x7m16); let t_b11_8 = vmulq_f64(self.twiddle4im, x8m15); let t_b11_9 = vmulq_f64(self.twiddle7im, x9m14); let t_b11_10 = vmulq_f64(self.twiddle5im, x10m13); let t_b11_11 = vmulq_f64(self.twiddle6im, x11m12); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11); let t_a6 = calc_f64!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11); let t_a7 = calc_f64!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11); let t_a8 = calc_f64!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11); let t_a9 = calc_f64!(x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11); let t_a10 = calc_f64!(x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11); let t_a11 = calc_f64!(x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11); let t_b3 = calc_f64!(t_b3_1 + t_b3_2 + t_b3_3 - t_b3_4 - t_b3_5 - t_b3_6 - t_b3_7 + t_b3_8 + t_b3_9 + t_b3_10 + t_b3_11); let t_b4 = calc_f64!(t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 - t_b4_5 + t_b4_6 + t_b4_7 + t_b4_8 - t_b4_9 - t_b4_10 - t_b4_11); let t_b5 = calc_f64!(t_b5_1 + t_b5_2 - t_b5_3 - t_b5_4 + t_b5_5 + t_b5_6 - t_b5_7 - t_b5_8 - t_b5_9 + t_b5_10 + t_b5_11); let t_b6 = calc_f64!(t_b6_1 - t_b6_2 - t_b6_3 + t_b6_4 + t_b6_5 - t_b6_6 - t_b6_7 + t_b6_8 + t_b6_9 - t_b6_10 - t_b6_11); let t_b7 = calc_f64!(t_b7_1 - t_b7_2 - t_b7_3 + t_b7_4 - t_b7_5 - t_b7_6 + t_b7_7 + t_b7_8 - t_b7_9 + t_b7_10 + t_b7_11); let t_b8 = calc_f64!(t_b8_1 - t_b8_2 + t_b8_3 + t_b8_4 - t_b8_5 + t_b8_6 + t_b8_7 - t_b8_8 + t_b8_9 + t_b8_10 - t_b8_11); let t_b9 = calc_f64!(t_b9_1 - t_b9_2 + t_b9_3 - t_b9_4 - t_b9_5 + t_b9_6 - t_b9_7 + t_b9_8 - t_b9_9 - t_b9_10 + t_b9_11); let t_b10 = calc_f64!(t_b10_1 - t_b10_2 + t_b10_3 - t_b10_4 + t_b10_5 - t_b10_6 + t_b10_7 + t_b10_8 - t_b10_9 + t_b10_10 - t_b10_11); let t_b11 = calc_f64!(t_b11_1 - t_b11_2 + t_b11_3 - t_b11_4 + t_b11_5 - t_b11_6 + t_b11_7 - t_b11_8 + t_b11_9 - t_b11_10 + t_b11_11); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let t_b7_rot = self.rotate.rotate(t_b7); let t_b8_rot = self.rotate.rotate(t_b8); let t_b9_rot = self.rotate.rotate(t_b9); let t_b10_rot = self.rotate.rotate(t_b10); let t_b11_rot = self.rotate.rotate(t_b11); let y0 = calc_f64!(x0 + x1p22 + x2p21 + x3p20 + x4p19 + x5p18 + x6p17 + x7p16 + x8p15 + x9p14 + x10p13 + x11p12); let [y1, y22] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y21] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y20] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y19] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y18] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y17] = solo_fft2_f64(t_a6, t_b6_rot); let [y7, y16] = solo_fft2_f64(t_a7, t_b7_rot); let [y8, y15] = solo_fft2_f64(t_a8, t_b8_rot); let [y9, y14] = solo_fft2_f64(t_a9, t_b9_rot); let [y10, y13] = solo_fft2_f64(t_a10, t_b10_rot); let [y11, y12] = solo_fft2_f64(t_a11, t_b11_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22] } } // ____ ___ _________ _ _ _ // |___ \ / _ \ |___ /___ \| |__ (_) |_ // __) | (_) | _____ |_ \ __) | '_ \| | __| // / __/ \__, | |_____| ___) / __/| |_) | | |_ // |_____| /_/ |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly29 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: float32x4_t, twiddle1im: float32x4_t, twiddle2re: float32x4_t, twiddle2im: float32x4_t, twiddle3re: float32x4_t, twiddle3im: float32x4_t, twiddle4re: float32x4_t, twiddle4im: float32x4_t, twiddle5re: float32x4_t, twiddle5im: float32x4_t, twiddle6re: float32x4_t, twiddle6im: float32x4_t, twiddle7re: float32x4_t, twiddle7im: float32x4_t, twiddle8re: float32x4_t, twiddle8im: float32x4_t, twiddle9re: float32x4_t, twiddle9im: float32x4_t, twiddle10re: float32x4_t, twiddle10im: float32x4_t, twiddle11re: float32x4_t, twiddle11im: float32x4_t, twiddle12re: float32x4_t, twiddle12im: float32x4_t, twiddle13re: float32x4_t, twiddle13im: float32x4_t, twiddle14re: float32x4_t, twiddle14im: float32x4_t, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly29, 29, |this: &NeonF32Butterfly29<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly29, 29, |this: &NeonF32Butterfly29<_>| this .direction); impl NeonF32Butterfly29 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 29, direction); let tw2: Complex = twiddles::compute_twiddle(2, 29, direction); let tw3: Complex = twiddles::compute_twiddle(3, 29, direction); let tw4: Complex = twiddles::compute_twiddle(4, 29, direction); let tw5: Complex = twiddles::compute_twiddle(5, 29, direction); let tw6: Complex = twiddles::compute_twiddle(6, 29, direction); let tw7: Complex = twiddles::compute_twiddle(7, 29, direction); let tw8: Complex = twiddles::compute_twiddle(8, 29, direction); let tw9: Complex = twiddles::compute_twiddle(9, 29, direction); let tw10: Complex = twiddles::compute_twiddle(10, 29, direction); let tw11: Complex = twiddles::compute_twiddle(11, 29, direction); let tw12: Complex = twiddles::compute_twiddle(12, 29, direction); let tw13: Complex = twiddles::compute_twiddle(13, 29, direction); let tw14: Complex = twiddles::compute_twiddle(14, 29, direction); let twiddle1re = unsafe { vmovq_n_f32(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f32(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f32(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f32(tw2.im) }; let twiddle3re = unsafe { vmovq_n_f32(tw3.re) }; let twiddle3im = unsafe { vmovq_n_f32(tw3.im) }; let twiddle4re = unsafe { vmovq_n_f32(tw4.re) }; let twiddle4im = unsafe { vmovq_n_f32(tw4.im) }; let twiddle5re = unsafe { vmovq_n_f32(tw5.re) }; let twiddle5im = unsafe { vmovq_n_f32(tw5.im) }; let twiddle6re = unsafe { vmovq_n_f32(tw6.re) }; let twiddle6im = unsafe { vmovq_n_f32(tw6.im) }; let twiddle7re = unsafe { vmovq_n_f32(tw7.re) }; let twiddle7im = unsafe { vmovq_n_f32(tw7.im) }; let twiddle8re = unsafe { vmovq_n_f32(tw8.re) }; let twiddle8im = unsafe { vmovq_n_f32(tw8.im) }; let twiddle9re = unsafe { vmovq_n_f32(tw9.re) }; let twiddle9im = unsafe { vmovq_n_f32(tw9.im) }; let twiddle10re = unsafe { vmovq_n_f32(tw10.re) }; let twiddle10im = unsafe { vmovq_n_f32(tw10.im) }; let twiddle11re = unsafe { vmovq_n_f32(tw11.re) }; let twiddle11im = unsafe { vmovq_n_f32(tw11.im) }; let twiddle12re = unsafe { vmovq_n_f32(tw12.re) }; let twiddle12im = unsafe { vmovq_n_f32(tw12.im) }; let twiddle13re = unsafe { vmovq_n_f32(tw13.re) }; let twiddle13im = unsafe { vmovq_n_f32(tw13.im) }; let twiddle14re = unsafe { vmovq_n_f32(tw14.re) }; let twiddle14im = unsafe { vmovq_n_f32(tw14.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, twiddle12re, twiddle12im, twiddle13re, twiddle13im, twiddle14re, twiddle14im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[14]), extract_hi_lo_f32(input_packed[0], input_packed[15]), extract_lo_hi_f32(input_packed[1], input_packed[15]), extract_hi_lo_f32(input_packed[1], input_packed[16]), extract_lo_hi_f32(input_packed[2], input_packed[16]), extract_hi_lo_f32(input_packed[2], input_packed[17]), extract_lo_hi_f32(input_packed[3], input_packed[17]), extract_hi_lo_f32(input_packed[3], input_packed[18]), extract_lo_hi_f32(input_packed[4], input_packed[18]), extract_hi_lo_f32(input_packed[4], input_packed[19]), extract_lo_hi_f32(input_packed[5], input_packed[19]), extract_hi_lo_f32(input_packed[5], input_packed[20]), extract_lo_hi_f32(input_packed[6], input_packed[20]), extract_hi_lo_f32(input_packed[6], input_packed[21]), extract_lo_hi_f32(input_packed[7], input_packed[21]), extract_hi_lo_f32(input_packed[7], input_packed[22]), extract_lo_hi_f32(input_packed[8], input_packed[22]), extract_hi_lo_f32(input_packed[8], input_packed[23]), extract_lo_hi_f32(input_packed[9], input_packed[23]), extract_hi_lo_f32(input_packed[9], input_packed[24]), extract_lo_hi_f32(input_packed[10], input_packed[24]), extract_hi_lo_f32(input_packed[10], input_packed[25]), extract_lo_hi_f32(input_packed[11], input_packed[25]), extract_hi_lo_f32(input_packed[11], input_packed[26]), extract_lo_hi_f32(input_packed[12], input_packed[26]), extract_hi_lo_f32(input_packed[12], input_packed[27]), extract_lo_hi_f32(input_packed[13], input_packed[27]), extract_hi_lo_f32(input_packed[13], input_packed[28]), extract_lo_hi_f32(input_packed[14], input_packed[28]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_lo_f32(out[14], out[15]), extract_lo_lo_f32(out[16], out[17]), extract_lo_lo_f32(out[18], out[19]), extract_lo_lo_f32(out[20], out[21]), extract_lo_lo_f32(out[22], out[23]), extract_lo_lo_f32(out[24], out[25]), extract_lo_lo_f32(out[26], out[27]), extract_lo_hi_f32(out[28], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), extract_hi_hi_f32(out[15], out[16]), extract_hi_hi_f32(out[17], out[18]), extract_hi_hi_f32(out[19], out[20]), extract_hi_hi_f32(out[21], out[22]), extract_hi_hi_f32(out[23], out[24]), extract_hi_hi_f32(out[25], out[26]), extract_hi_hi_f32(out[27], out[28]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [float32x4_t; 29]) -> [float32x4_t; 29] { let [x1p28, x1m28] = parallel_fft2_interleaved_f32(values[1], values[28]); let [x2p27, x2m27] = parallel_fft2_interleaved_f32(values[2], values[27]); let [x3p26, x3m26] = parallel_fft2_interleaved_f32(values[3], values[26]); let [x4p25, x4m25] = parallel_fft2_interleaved_f32(values[4], values[25]); let [x5p24, x5m24] = parallel_fft2_interleaved_f32(values[5], values[24]); let [x6p23, x6m23] = parallel_fft2_interleaved_f32(values[6], values[23]); let [x7p22, x7m22] = parallel_fft2_interleaved_f32(values[7], values[22]); let [x8p21, x8m21] = parallel_fft2_interleaved_f32(values[8], values[21]); let [x9p20, x9m20] = parallel_fft2_interleaved_f32(values[9], values[20]); let [x10p19, x10m19] = parallel_fft2_interleaved_f32(values[10], values[19]); let [x11p18, x11m18] = parallel_fft2_interleaved_f32(values[11], values[18]); let [x12p17, x12m17] = parallel_fft2_interleaved_f32(values[12], values[17]); let [x13p16, x13m16] = parallel_fft2_interleaved_f32(values[13], values[16]); let [x14p15, x14m15] = parallel_fft2_interleaved_f32(values[14], values[15]); let t_a1_1 = vmulq_f32(self.twiddle1re, x1p28); let t_a1_2 = vmulq_f32(self.twiddle2re, x2p27); let t_a1_3 = vmulq_f32(self.twiddle3re, x3p26); let t_a1_4 = vmulq_f32(self.twiddle4re, x4p25); let t_a1_5 = vmulq_f32(self.twiddle5re, x5p24); let t_a1_6 = vmulq_f32(self.twiddle6re, x6p23); let t_a1_7 = vmulq_f32(self.twiddle7re, x7p22); let t_a1_8 = vmulq_f32(self.twiddle8re, x8p21); let t_a1_9 = vmulq_f32(self.twiddle9re, x9p20); let t_a1_10 = vmulq_f32(self.twiddle10re, x10p19); let t_a1_11 = vmulq_f32(self.twiddle11re, x11p18); let t_a1_12 = vmulq_f32(self.twiddle12re, x12p17); let t_a1_13 = vmulq_f32(self.twiddle13re, x13p16); let t_a1_14 = vmulq_f32(self.twiddle14re, x14p15); let t_a2_1 = vmulq_f32(self.twiddle2re, x1p28); let t_a2_2 = vmulq_f32(self.twiddle4re, x2p27); let t_a2_3 = vmulq_f32(self.twiddle6re, x3p26); let t_a2_4 = vmulq_f32(self.twiddle8re, x4p25); let t_a2_5 = vmulq_f32(self.twiddle10re, x5p24); let t_a2_6 = vmulq_f32(self.twiddle12re, x6p23); let t_a2_7 = vmulq_f32(self.twiddle14re, x7p22); let t_a2_8 = vmulq_f32(self.twiddle13re, x8p21); let t_a2_9 = vmulq_f32(self.twiddle11re, x9p20); let t_a2_10 = vmulq_f32(self.twiddle9re, x10p19); let t_a2_11 = vmulq_f32(self.twiddle7re, x11p18); let t_a2_12 = vmulq_f32(self.twiddle5re, x12p17); let t_a2_13 = vmulq_f32(self.twiddle3re, x13p16); let t_a2_14 = vmulq_f32(self.twiddle1re, x14p15); let t_a3_1 = vmulq_f32(self.twiddle3re, x1p28); let t_a3_2 = vmulq_f32(self.twiddle6re, x2p27); let t_a3_3 = vmulq_f32(self.twiddle9re, x3p26); let t_a3_4 = vmulq_f32(self.twiddle12re, x4p25); let t_a3_5 = vmulq_f32(self.twiddle14re, x5p24); let t_a3_6 = vmulq_f32(self.twiddle11re, x6p23); let t_a3_7 = vmulq_f32(self.twiddle8re, x7p22); let t_a3_8 = vmulq_f32(self.twiddle5re, x8p21); let t_a3_9 = vmulq_f32(self.twiddle2re, x9p20); let t_a3_10 = vmulq_f32(self.twiddle1re, x10p19); let t_a3_11 = vmulq_f32(self.twiddle4re, x11p18); let t_a3_12 = vmulq_f32(self.twiddle7re, x12p17); let t_a3_13 = vmulq_f32(self.twiddle10re, x13p16); let t_a3_14 = vmulq_f32(self.twiddle13re, x14p15); let t_a4_1 = vmulq_f32(self.twiddle4re, x1p28); let t_a4_2 = vmulq_f32(self.twiddle8re, x2p27); let t_a4_3 = vmulq_f32(self.twiddle12re, x3p26); let t_a4_4 = vmulq_f32(self.twiddle13re, x4p25); let t_a4_5 = vmulq_f32(self.twiddle9re, x5p24); let t_a4_6 = vmulq_f32(self.twiddle5re, x6p23); let t_a4_7 = vmulq_f32(self.twiddle1re, x7p22); let t_a4_8 = vmulq_f32(self.twiddle3re, x8p21); let t_a4_9 = vmulq_f32(self.twiddle7re, x9p20); let t_a4_10 = vmulq_f32(self.twiddle11re, x10p19); let t_a4_11 = vmulq_f32(self.twiddle14re, x11p18); let t_a4_12 = vmulq_f32(self.twiddle10re, x12p17); let t_a4_13 = vmulq_f32(self.twiddle6re, x13p16); let t_a4_14 = vmulq_f32(self.twiddle2re, x14p15); let t_a5_1 = vmulq_f32(self.twiddle5re, x1p28); let t_a5_2 = vmulq_f32(self.twiddle10re, x2p27); let t_a5_3 = vmulq_f32(self.twiddle14re, x3p26); let t_a5_4 = vmulq_f32(self.twiddle9re, x4p25); let t_a5_5 = vmulq_f32(self.twiddle4re, x5p24); let t_a5_6 = vmulq_f32(self.twiddle1re, x6p23); let t_a5_7 = vmulq_f32(self.twiddle6re, x7p22); let t_a5_8 = vmulq_f32(self.twiddle11re, x8p21); let t_a5_9 = vmulq_f32(self.twiddle13re, x9p20); let t_a5_10 = vmulq_f32(self.twiddle8re, x10p19); let t_a5_11 = vmulq_f32(self.twiddle3re, x11p18); let t_a5_12 = vmulq_f32(self.twiddle2re, x12p17); let t_a5_13 = vmulq_f32(self.twiddle7re, x13p16); let t_a5_14 = vmulq_f32(self.twiddle12re, x14p15); let t_a6_1 = vmulq_f32(self.twiddle6re, x1p28); let t_a6_2 = vmulq_f32(self.twiddle12re, x2p27); let t_a6_3 = vmulq_f32(self.twiddle11re, x3p26); let t_a6_4 = vmulq_f32(self.twiddle5re, x4p25); let t_a6_5 = vmulq_f32(self.twiddle1re, x5p24); let t_a6_6 = vmulq_f32(self.twiddle7re, x6p23); let t_a6_7 = vmulq_f32(self.twiddle13re, x7p22); let t_a6_8 = vmulq_f32(self.twiddle10re, x8p21); let t_a6_9 = vmulq_f32(self.twiddle4re, x9p20); let t_a6_10 = vmulq_f32(self.twiddle2re, x10p19); let t_a6_11 = vmulq_f32(self.twiddle8re, x11p18); let t_a6_12 = vmulq_f32(self.twiddle14re, x12p17); let t_a6_13 = vmulq_f32(self.twiddle9re, x13p16); let t_a6_14 = vmulq_f32(self.twiddle3re, x14p15); let t_a7_1 = vmulq_f32(self.twiddle7re, x1p28); let t_a7_2 = vmulq_f32(self.twiddle14re, x2p27); let t_a7_3 = vmulq_f32(self.twiddle8re, x3p26); let t_a7_4 = vmulq_f32(self.twiddle1re, x4p25); let t_a7_5 = vmulq_f32(self.twiddle6re, x5p24); let t_a7_6 = vmulq_f32(self.twiddle13re, x6p23); let t_a7_7 = vmulq_f32(self.twiddle9re, x7p22); let t_a7_8 = vmulq_f32(self.twiddle2re, x8p21); let t_a7_9 = vmulq_f32(self.twiddle5re, x9p20); let t_a7_10 = vmulq_f32(self.twiddle12re, x10p19); let t_a7_11 = vmulq_f32(self.twiddle10re, x11p18); let t_a7_12 = vmulq_f32(self.twiddle3re, x12p17); let t_a7_13 = vmulq_f32(self.twiddle4re, x13p16); let t_a7_14 = vmulq_f32(self.twiddle11re, x14p15); let t_a8_1 = vmulq_f32(self.twiddle8re, x1p28); let t_a8_2 = vmulq_f32(self.twiddle13re, x2p27); let t_a8_3 = vmulq_f32(self.twiddle5re, x3p26); let t_a8_4 = vmulq_f32(self.twiddle3re, x4p25); let t_a8_5 = vmulq_f32(self.twiddle11re, x5p24); let t_a8_6 = vmulq_f32(self.twiddle10re, x6p23); let t_a8_7 = vmulq_f32(self.twiddle2re, x7p22); let t_a8_8 = vmulq_f32(self.twiddle6re, x8p21); let t_a8_9 = vmulq_f32(self.twiddle14re, x9p20); let t_a8_10 = vmulq_f32(self.twiddle7re, x10p19); let t_a8_11 = vmulq_f32(self.twiddle1re, x11p18); let t_a8_12 = vmulq_f32(self.twiddle9re, x12p17); let t_a8_13 = vmulq_f32(self.twiddle12re, x13p16); let t_a8_14 = vmulq_f32(self.twiddle4re, x14p15); let t_a9_1 = vmulq_f32(self.twiddle9re, x1p28); let t_a9_2 = vmulq_f32(self.twiddle11re, x2p27); let t_a9_3 = vmulq_f32(self.twiddle2re, x3p26); let t_a9_4 = vmulq_f32(self.twiddle7re, x4p25); let t_a9_5 = vmulq_f32(self.twiddle13re, x5p24); let t_a9_6 = vmulq_f32(self.twiddle4re, x6p23); let t_a9_7 = vmulq_f32(self.twiddle5re, x7p22); let t_a9_8 = vmulq_f32(self.twiddle14re, x8p21); let t_a9_9 = vmulq_f32(self.twiddle6re, x9p20); let t_a9_10 = vmulq_f32(self.twiddle3re, x10p19); let t_a9_11 = vmulq_f32(self.twiddle12re, x11p18); let t_a9_12 = vmulq_f32(self.twiddle8re, x12p17); let t_a9_13 = vmulq_f32(self.twiddle1re, x13p16); let t_a9_14 = vmulq_f32(self.twiddle10re, x14p15); let t_a10_1 = vmulq_f32(self.twiddle10re, x1p28); let t_a10_2 = vmulq_f32(self.twiddle9re, x2p27); let t_a10_3 = vmulq_f32(self.twiddle1re, x3p26); let t_a10_4 = vmulq_f32(self.twiddle11re, x4p25); let t_a10_5 = vmulq_f32(self.twiddle8re, x5p24); let t_a10_6 = vmulq_f32(self.twiddle2re, x6p23); let t_a10_7 = vmulq_f32(self.twiddle12re, x7p22); let t_a10_8 = vmulq_f32(self.twiddle7re, x8p21); let t_a10_9 = vmulq_f32(self.twiddle3re, x9p20); let t_a10_10 = vmulq_f32(self.twiddle13re, x10p19); let t_a10_11 = vmulq_f32(self.twiddle6re, x11p18); let t_a10_12 = vmulq_f32(self.twiddle4re, x12p17); let t_a10_13 = vmulq_f32(self.twiddle14re, x13p16); let t_a10_14 = vmulq_f32(self.twiddle5re, x14p15); let t_a11_1 = vmulq_f32(self.twiddle11re, x1p28); let t_a11_2 = vmulq_f32(self.twiddle7re, x2p27); let t_a11_3 = vmulq_f32(self.twiddle4re, x3p26); let t_a11_4 = vmulq_f32(self.twiddle14re, x4p25); let t_a11_5 = vmulq_f32(self.twiddle3re, x5p24); let t_a11_6 = vmulq_f32(self.twiddle8re, x6p23); let t_a11_7 = vmulq_f32(self.twiddle10re, x7p22); let t_a11_8 = vmulq_f32(self.twiddle1re, x8p21); let t_a11_9 = vmulq_f32(self.twiddle12re, x9p20); let t_a11_10 = vmulq_f32(self.twiddle6re, x10p19); let t_a11_11 = vmulq_f32(self.twiddle5re, x11p18); let t_a11_12 = vmulq_f32(self.twiddle13re, x12p17); let t_a11_13 = vmulq_f32(self.twiddle2re, x13p16); let t_a11_14 = vmulq_f32(self.twiddle9re, x14p15); let t_a12_1 = vmulq_f32(self.twiddle12re, x1p28); let t_a12_2 = vmulq_f32(self.twiddle5re, x2p27); let t_a12_3 = vmulq_f32(self.twiddle7re, x3p26); let t_a12_4 = vmulq_f32(self.twiddle10re, x4p25); let t_a12_5 = vmulq_f32(self.twiddle2re, x5p24); let t_a12_6 = vmulq_f32(self.twiddle14re, x6p23); let t_a12_7 = vmulq_f32(self.twiddle3re, x7p22); let t_a12_8 = vmulq_f32(self.twiddle9re, x8p21); let t_a12_9 = vmulq_f32(self.twiddle8re, x9p20); let t_a12_10 = vmulq_f32(self.twiddle4re, x10p19); let t_a12_11 = vmulq_f32(self.twiddle13re, x11p18); let t_a12_12 = vmulq_f32(self.twiddle1re, x12p17); let t_a12_13 = vmulq_f32(self.twiddle11re, x13p16); let t_a12_14 = vmulq_f32(self.twiddle6re, x14p15); let t_a13_1 = vmulq_f32(self.twiddle13re, x1p28); let t_a13_2 = vmulq_f32(self.twiddle3re, x2p27); let t_a13_3 = vmulq_f32(self.twiddle10re, x3p26); let t_a13_4 = vmulq_f32(self.twiddle6re, x4p25); let t_a13_5 = vmulq_f32(self.twiddle7re, x5p24); let t_a13_6 = vmulq_f32(self.twiddle9re, x6p23); let t_a13_7 = vmulq_f32(self.twiddle4re, x7p22); let t_a13_8 = vmulq_f32(self.twiddle12re, x8p21); let t_a13_9 = vmulq_f32(self.twiddle1re, x9p20); let t_a13_10 = vmulq_f32(self.twiddle14re, x10p19); let t_a13_11 = vmulq_f32(self.twiddle2re, x11p18); let t_a13_12 = vmulq_f32(self.twiddle11re, x12p17); let t_a13_13 = vmulq_f32(self.twiddle5re, x13p16); let t_a13_14 = vmulq_f32(self.twiddle8re, x14p15); let t_a14_1 = vmulq_f32(self.twiddle14re, x1p28); let t_a14_2 = vmulq_f32(self.twiddle1re, x2p27); let t_a14_3 = vmulq_f32(self.twiddle13re, x3p26); let t_a14_4 = vmulq_f32(self.twiddle2re, x4p25); let t_a14_5 = vmulq_f32(self.twiddle12re, x5p24); let t_a14_6 = vmulq_f32(self.twiddle3re, x6p23); let t_a14_7 = vmulq_f32(self.twiddle11re, x7p22); let t_a14_8 = vmulq_f32(self.twiddle4re, x8p21); let t_a14_9 = vmulq_f32(self.twiddle10re, x9p20); let t_a14_10 = vmulq_f32(self.twiddle5re, x10p19); let t_a14_11 = vmulq_f32(self.twiddle9re, x11p18); let t_a14_12 = vmulq_f32(self.twiddle6re, x12p17); let t_a14_13 = vmulq_f32(self.twiddle8re, x13p16); let t_a14_14 = vmulq_f32(self.twiddle7re, x14p15); let t_b1_1 = vmulq_f32(self.twiddle1im, x1m28); let t_b1_2 = vmulq_f32(self.twiddle2im, x2m27); let t_b1_3 = vmulq_f32(self.twiddle3im, x3m26); let t_b1_4 = vmulq_f32(self.twiddle4im, x4m25); let t_b1_5 = vmulq_f32(self.twiddle5im, x5m24); let t_b1_6 = vmulq_f32(self.twiddle6im, x6m23); let t_b1_7 = vmulq_f32(self.twiddle7im, x7m22); let t_b1_8 = vmulq_f32(self.twiddle8im, x8m21); let t_b1_9 = vmulq_f32(self.twiddle9im, x9m20); let t_b1_10 = vmulq_f32(self.twiddle10im, x10m19); let t_b1_11 = vmulq_f32(self.twiddle11im, x11m18); let t_b1_12 = vmulq_f32(self.twiddle12im, x12m17); let t_b1_13 = vmulq_f32(self.twiddle13im, x13m16); let t_b1_14 = vmulq_f32(self.twiddle14im, x14m15); let t_b2_1 = vmulq_f32(self.twiddle2im, x1m28); let t_b2_2 = vmulq_f32(self.twiddle4im, x2m27); let t_b2_3 = vmulq_f32(self.twiddle6im, x3m26); let t_b2_4 = vmulq_f32(self.twiddle8im, x4m25); let t_b2_5 = vmulq_f32(self.twiddle10im, x5m24); let t_b2_6 = vmulq_f32(self.twiddle12im, x6m23); let t_b2_7 = vmulq_f32(self.twiddle14im, x7m22); let t_b2_8 = vmulq_f32(self.twiddle13im, x8m21); let t_b2_9 = vmulq_f32(self.twiddle11im, x9m20); let t_b2_10 = vmulq_f32(self.twiddle9im, x10m19); let t_b2_11 = vmulq_f32(self.twiddle7im, x11m18); let t_b2_12 = vmulq_f32(self.twiddle5im, x12m17); let t_b2_13 = vmulq_f32(self.twiddle3im, x13m16); let t_b2_14 = vmulq_f32(self.twiddle1im, x14m15); let t_b3_1 = vmulq_f32(self.twiddle3im, x1m28); let t_b3_2 = vmulq_f32(self.twiddle6im, x2m27); let t_b3_3 = vmulq_f32(self.twiddle9im, x3m26); let t_b3_4 = vmulq_f32(self.twiddle12im, x4m25); let t_b3_5 = vmulq_f32(self.twiddle14im, x5m24); let t_b3_6 = vmulq_f32(self.twiddle11im, x6m23); let t_b3_7 = vmulq_f32(self.twiddle8im, x7m22); let t_b3_8 = vmulq_f32(self.twiddle5im, x8m21); let t_b3_9 = vmulq_f32(self.twiddle2im, x9m20); let t_b3_10 = vmulq_f32(self.twiddle1im, x10m19); let t_b3_11 = vmulq_f32(self.twiddle4im, x11m18); let t_b3_12 = vmulq_f32(self.twiddle7im, x12m17); let t_b3_13 = vmulq_f32(self.twiddle10im, x13m16); let t_b3_14 = vmulq_f32(self.twiddle13im, x14m15); let t_b4_1 = vmulq_f32(self.twiddle4im, x1m28); let t_b4_2 = vmulq_f32(self.twiddle8im, x2m27); let t_b4_3 = vmulq_f32(self.twiddle12im, x3m26); let t_b4_4 = vmulq_f32(self.twiddle13im, x4m25); let t_b4_5 = vmulq_f32(self.twiddle9im, x5m24); let t_b4_6 = vmulq_f32(self.twiddle5im, x6m23); let t_b4_7 = vmulq_f32(self.twiddle1im, x7m22); let t_b4_8 = vmulq_f32(self.twiddle3im, x8m21); let t_b4_9 = vmulq_f32(self.twiddle7im, x9m20); let t_b4_10 = vmulq_f32(self.twiddle11im, x10m19); let t_b4_11 = vmulq_f32(self.twiddle14im, x11m18); let t_b4_12 = vmulq_f32(self.twiddle10im, x12m17); let t_b4_13 = vmulq_f32(self.twiddle6im, x13m16); let t_b4_14 = vmulq_f32(self.twiddle2im, x14m15); let t_b5_1 = vmulq_f32(self.twiddle5im, x1m28); let t_b5_2 = vmulq_f32(self.twiddle10im, x2m27); let t_b5_3 = vmulq_f32(self.twiddle14im, x3m26); let t_b5_4 = vmulq_f32(self.twiddle9im, x4m25); let t_b5_5 = vmulq_f32(self.twiddle4im, x5m24); let t_b5_6 = vmulq_f32(self.twiddle1im, x6m23); let t_b5_7 = vmulq_f32(self.twiddle6im, x7m22); let t_b5_8 = vmulq_f32(self.twiddle11im, x8m21); let t_b5_9 = vmulq_f32(self.twiddle13im, x9m20); let t_b5_10 = vmulq_f32(self.twiddle8im, x10m19); let t_b5_11 = vmulq_f32(self.twiddle3im, x11m18); let t_b5_12 = vmulq_f32(self.twiddle2im, x12m17); let t_b5_13 = vmulq_f32(self.twiddle7im, x13m16); let t_b5_14 = vmulq_f32(self.twiddle12im, x14m15); let t_b6_1 = vmulq_f32(self.twiddle6im, x1m28); let t_b6_2 = vmulq_f32(self.twiddle12im, x2m27); let t_b6_3 = vmulq_f32(self.twiddle11im, x3m26); let t_b6_4 = vmulq_f32(self.twiddle5im, x4m25); let t_b6_5 = vmulq_f32(self.twiddle1im, x5m24); let t_b6_6 = vmulq_f32(self.twiddle7im, x6m23); let t_b6_7 = vmulq_f32(self.twiddle13im, x7m22); let t_b6_8 = vmulq_f32(self.twiddle10im, x8m21); let t_b6_9 = vmulq_f32(self.twiddle4im, x9m20); let t_b6_10 = vmulq_f32(self.twiddle2im, x10m19); let t_b6_11 = vmulq_f32(self.twiddle8im, x11m18); let t_b6_12 = vmulq_f32(self.twiddle14im, x12m17); let t_b6_13 = vmulq_f32(self.twiddle9im, x13m16); let t_b6_14 = vmulq_f32(self.twiddle3im, x14m15); let t_b7_1 = vmulq_f32(self.twiddle7im, x1m28); let t_b7_2 = vmulq_f32(self.twiddle14im, x2m27); let t_b7_3 = vmulq_f32(self.twiddle8im, x3m26); let t_b7_4 = vmulq_f32(self.twiddle1im, x4m25); let t_b7_5 = vmulq_f32(self.twiddle6im, x5m24); let t_b7_6 = vmulq_f32(self.twiddle13im, x6m23); let t_b7_7 = vmulq_f32(self.twiddle9im, x7m22); let t_b7_8 = vmulq_f32(self.twiddle2im, x8m21); let t_b7_9 = vmulq_f32(self.twiddle5im, x9m20); let t_b7_10 = vmulq_f32(self.twiddle12im, x10m19); let t_b7_11 = vmulq_f32(self.twiddle10im, x11m18); let t_b7_12 = vmulq_f32(self.twiddle3im, x12m17); let t_b7_13 = vmulq_f32(self.twiddle4im, x13m16); let t_b7_14 = vmulq_f32(self.twiddle11im, x14m15); let t_b8_1 = vmulq_f32(self.twiddle8im, x1m28); let t_b8_2 = vmulq_f32(self.twiddle13im, x2m27); let t_b8_3 = vmulq_f32(self.twiddle5im, x3m26); let t_b8_4 = vmulq_f32(self.twiddle3im, x4m25); let t_b8_5 = vmulq_f32(self.twiddle11im, x5m24); let t_b8_6 = vmulq_f32(self.twiddle10im, x6m23); let t_b8_7 = vmulq_f32(self.twiddle2im, x7m22); let t_b8_8 = vmulq_f32(self.twiddle6im, x8m21); let t_b8_9 = vmulq_f32(self.twiddle14im, x9m20); let t_b8_10 = vmulq_f32(self.twiddle7im, x10m19); let t_b8_11 = vmulq_f32(self.twiddle1im, x11m18); let t_b8_12 = vmulq_f32(self.twiddle9im, x12m17); let t_b8_13 = vmulq_f32(self.twiddle12im, x13m16); let t_b8_14 = vmulq_f32(self.twiddle4im, x14m15); let t_b9_1 = vmulq_f32(self.twiddle9im, x1m28); let t_b9_2 = vmulq_f32(self.twiddle11im, x2m27); let t_b9_3 = vmulq_f32(self.twiddle2im, x3m26); let t_b9_4 = vmulq_f32(self.twiddle7im, x4m25); let t_b9_5 = vmulq_f32(self.twiddle13im, x5m24); let t_b9_6 = vmulq_f32(self.twiddle4im, x6m23); let t_b9_7 = vmulq_f32(self.twiddle5im, x7m22); let t_b9_8 = vmulq_f32(self.twiddle14im, x8m21); let t_b9_9 = vmulq_f32(self.twiddle6im, x9m20); let t_b9_10 = vmulq_f32(self.twiddle3im, x10m19); let t_b9_11 = vmulq_f32(self.twiddle12im, x11m18); let t_b9_12 = vmulq_f32(self.twiddle8im, x12m17); let t_b9_13 = vmulq_f32(self.twiddle1im, x13m16); let t_b9_14 = vmulq_f32(self.twiddle10im, x14m15); let t_b10_1 = vmulq_f32(self.twiddle10im, x1m28); let t_b10_2 = vmulq_f32(self.twiddle9im, x2m27); let t_b10_3 = vmulq_f32(self.twiddle1im, x3m26); let t_b10_4 = vmulq_f32(self.twiddle11im, x4m25); let t_b10_5 = vmulq_f32(self.twiddle8im, x5m24); let t_b10_6 = vmulq_f32(self.twiddle2im, x6m23); let t_b10_7 = vmulq_f32(self.twiddle12im, x7m22); let t_b10_8 = vmulq_f32(self.twiddle7im, x8m21); let t_b10_9 = vmulq_f32(self.twiddle3im, x9m20); let t_b10_10 = vmulq_f32(self.twiddle13im, x10m19); let t_b10_11 = vmulq_f32(self.twiddle6im, x11m18); let t_b10_12 = vmulq_f32(self.twiddle4im, x12m17); let t_b10_13 = vmulq_f32(self.twiddle14im, x13m16); let t_b10_14 = vmulq_f32(self.twiddle5im, x14m15); let t_b11_1 = vmulq_f32(self.twiddle11im, x1m28); let t_b11_2 = vmulq_f32(self.twiddle7im, x2m27); let t_b11_3 = vmulq_f32(self.twiddle4im, x3m26); let t_b11_4 = vmulq_f32(self.twiddle14im, x4m25); let t_b11_5 = vmulq_f32(self.twiddle3im, x5m24); let t_b11_6 = vmulq_f32(self.twiddle8im, x6m23); let t_b11_7 = vmulq_f32(self.twiddle10im, x7m22); let t_b11_8 = vmulq_f32(self.twiddle1im, x8m21); let t_b11_9 = vmulq_f32(self.twiddle12im, x9m20); let t_b11_10 = vmulq_f32(self.twiddle6im, x10m19); let t_b11_11 = vmulq_f32(self.twiddle5im, x11m18); let t_b11_12 = vmulq_f32(self.twiddle13im, x12m17); let t_b11_13 = vmulq_f32(self.twiddle2im, x13m16); let t_b11_14 = vmulq_f32(self.twiddle9im, x14m15); let t_b12_1 = vmulq_f32(self.twiddle12im, x1m28); let t_b12_2 = vmulq_f32(self.twiddle5im, x2m27); let t_b12_3 = vmulq_f32(self.twiddle7im, x3m26); let t_b12_4 = vmulq_f32(self.twiddle10im, x4m25); let t_b12_5 = vmulq_f32(self.twiddle2im, x5m24); let t_b12_6 = vmulq_f32(self.twiddle14im, x6m23); let t_b12_7 = vmulq_f32(self.twiddle3im, x7m22); let t_b12_8 = vmulq_f32(self.twiddle9im, x8m21); let t_b12_9 = vmulq_f32(self.twiddle8im, x9m20); let t_b12_10 = vmulq_f32(self.twiddle4im, x10m19); let t_b12_11 = vmulq_f32(self.twiddle13im, x11m18); let t_b12_12 = vmulq_f32(self.twiddle1im, x12m17); let t_b12_13 = vmulq_f32(self.twiddle11im, x13m16); let t_b12_14 = vmulq_f32(self.twiddle6im, x14m15); let t_b13_1 = vmulq_f32(self.twiddle13im, x1m28); let t_b13_2 = vmulq_f32(self.twiddle3im, x2m27); let t_b13_3 = vmulq_f32(self.twiddle10im, x3m26); let t_b13_4 = vmulq_f32(self.twiddle6im, x4m25); let t_b13_5 = vmulq_f32(self.twiddle7im, x5m24); let t_b13_6 = vmulq_f32(self.twiddle9im, x6m23); let t_b13_7 = vmulq_f32(self.twiddle4im, x7m22); let t_b13_8 = vmulq_f32(self.twiddle12im, x8m21); let t_b13_9 = vmulq_f32(self.twiddle1im, x9m20); let t_b13_10 = vmulq_f32(self.twiddle14im, x10m19); let t_b13_11 = vmulq_f32(self.twiddle2im, x11m18); let t_b13_12 = vmulq_f32(self.twiddle11im, x12m17); let t_b13_13 = vmulq_f32(self.twiddle5im, x13m16); let t_b13_14 = vmulq_f32(self.twiddle8im, x14m15); let t_b14_1 = vmulq_f32(self.twiddle14im, x1m28); let t_b14_2 = vmulq_f32(self.twiddle1im, x2m27); let t_b14_3 = vmulq_f32(self.twiddle13im, x3m26); let t_b14_4 = vmulq_f32(self.twiddle2im, x4m25); let t_b14_5 = vmulq_f32(self.twiddle12im, x5m24); let t_b14_6 = vmulq_f32(self.twiddle3im, x6m23); let t_b14_7 = vmulq_f32(self.twiddle11im, x7m22); let t_b14_8 = vmulq_f32(self.twiddle4im, x8m21); let t_b14_9 = vmulq_f32(self.twiddle10im, x9m20); let t_b14_10 = vmulq_f32(self.twiddle5im, x10m19); let t_b14_11 = vmulq_f32(self.twiddle9im, x11m18); let t_b14_12 = vmulq_f32(self.twiddle6im, x12m17); let t_b14_13 = vmulq_f32(self.twiddle8im, x13m16); let t_b14_14 = vmulq_f32(self.twiddle7im, x14m15); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11 + t_a1_12 + t_a1_13 + t_a1_14); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11 + t_a2_12 + t_a2_13 + t_a2_14); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11 + t_a3_12 + t_a3_13 + t_a3_14); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11 + t_a4_12 + t_a4_13 + t_a4_14); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11 + t_a5_12 + t_a5_13 + t_a5_14); let t_a6 = calc_f32!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11 + t_a6_12 + t_a6_13 + t_a6_14); let t_a7 = calc_f32!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11 + t_a7_12 + t_a7_13 + t_a7_14); let t_a8 = calc_f32!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11 + t_a8_12 + t_a8_13 + t_a8_14); let t_a9 = calc_f32!(x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11 + t_a9_12 + t_a9_13 + t_a9_14); let t_a10 = calc_f32!(x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11 + t_a10_12 + t_a10_13 + t_a10_14); let t_a11 = calc_f32!(x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11 + t_a11_12 + t_a11_13 + t_a11_14); let t_a12 = calc_f32!(x0 + t_a12_1 + t_a12_2 + t_a12_3 + t_a12_4 + t_a12_5 + t_a12_6 + t_a12_7 + t_a12_8 + t_a12_9 + t_a12_10 + t_a12_11 + t_a12_12 + t_a12_13 + t_a12_14); let t_a13 = calc_f32!(x0 + t_a13_1 + t_a13_2 + t_a13_3 + t_a13_4 + t_a13_5 + t_a13_6 + t_a13_7 + t_a13_8 + t_a13_9 + t_a13_10 + t_a13_11 + t_a13_12 + t_a13_13 + t_a13_14); let t_a14 = calc_f32!(x0 + t_a14_1 + t_a14_2 + t_a14_3 + t_a14_4 + t_a14_5 + t_a14_6 + t_a14_7 + t_a14_8 + t_a14_9 + t_a14_10 + t_a14_11 + t_a14_12 + t_a14_13 + t_a14_14); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11 + t_b1_12 + t_b1_13 + t_b1_14); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 + t_b2_6 + t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11 - t_b2_12 - t_b2_13 - t_b2_14); let t_b3 = calc_f32!(t_b3_1 + t_b3_2 + t_b3_3 + t_b3_4 - t_b3_5 - t_b3_6 - t_b3_7 - t_b3_8 - t_b3_9 + t_b3_10 + t_b3_11 + t_b3_12 + t_b3_13 + t_b3_14); let t_b4 = calc_f32!(t_b4_1 + t_b4_2 + t_b4_3 - t_b4_4 - t_b4_5 - t_b4_6 - t_b4_7 + t_b4_8 + t_b4_9 + t_b4_10 - t_b4_11 - t_b4_12 - t_b4_13 - t_b4_14); let t_b5 = calc_f32!(t_b5_1 + t_b5_2 - t_b5_3 - t_b5_4 - t_b5_5 + t_b5_6 + t_b5_7 + t_b5_8 - t_b5_9 - t_b5_10 - t_b5_11 + t_b5_12 + t_b5_13 + t_b5_14); let t_b6 = calc_f32!(t_b6_1 + t_b6_2 - t_b6_3 - t_b6_4 + t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9 + t_b6_10 + t_b6_11 + t_b6_12 - t_b6_13 - t_b6_14); let t_b7 = calc_f32!(t_b7_1 + t_b7_2 - t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9 + t_b7_10 - t_b7_11 - t_b7_12 + t_b7_13 + t_b7_14); let t_b8 = calc_f32!(t_b8_1 - t_b8_2 - t_b8_3 + t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 + t_b8_9 - t_b8_10 + t_b8_11 + t_b8_12 - t_b8_13 - t_b8_14); let t_b9 = calc_f32!(t_b9_1 - t_b9_2 - t_b9_3 + t_b9_4 - t_b9_5 - t_b9_6 + t_b9_7 + t_b9_8 - t_b9_9 + t_b9_10 + t_b9_11 - t_b9_12 + t_b9_13 + t_b9_14); let t_b10 = calc_f32!(t_b10_1 - t_b10_2 + t_b10_3 + t_b10_4 - t_b10_5 + t_b10_6 + t_b10_7 - t_b10_8 + t_b10_9 + t_b10_10 - t_b10_11 + t_b10_12 + t_b10_13 - t_b10_14); let t_b11 = calc_f32!(t_b11_1 - t_b11_2 + t_b11_3 - t_b11_4 - t_b11_5 + t_b11_6 - t_b11_7 + t_b11_8 + t_b11_9 - t_b11_10 + t_b11_11 - t_b11_12 - t_b11_13 + t_b11_14); let t_b12 = calc_f32!(t_b12_1 - t_b12_2 + t_b12_3 - t_b12_4 + t_b12_5 + t_b12_6 - t_b12_7 + t_b12_8 - t_b12_9 + t_b12_10 - t_b12_11 - t_b12_12 + t_b12_13 - t_b12_14); let t_b13 = calc_f32!(t_b13_1 - t_b13_2 + t_b13_3 - t_b13_4 + t_b13_5 - t_b13_6 + t_b13_7 - t_b13_8 + t_b13_9 + t_b13_10 - t_b13_11 + t_b13_12 - t_b13_13 + t_b13_14); let t_b14 = calc_f32!(t_b14_1 - t_b14_2 + t_b14_3 - t_b14_4 + t_b14_5 - t_b14_6 + t_b14_7 - t_b14_8 + t_b14_9 - t_b14_10 + t_b14_11 - t_b14_12 + t_b14_13 - t_b14_14); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let t_b7_rot = self.rotate.rotate_both(t_b7); let t_b8_rot = self.rotate.rotate_both(t_b8); let t_b9_rot = self.rotate.rotate_both(t_b9); let t_b10_rot = self.rotate.rotate_both(t_b10); let t_b11_rot = self.rotate.rotate_both(t_b11); let t_b12_rot = self.rotate.rotate_both(t_b12); let t_b13_rot = self.rotate.rotate_both(t_b13); let t_b14_rot = self.rotate.rotate_both(t_b14); let y0 = calc_f32!(x0 + x1p28 + x2p27 + x3p26 + x4p25 + x5p24 + x6p23 + x7p22 + x8p21 + x9p20 + x10p19 + x11p18 + x12p17 + x13p16 + x14p15); let [y1, y28] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y27] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y26] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y25] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y24] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y23] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); let [y7, y22] = parallel_fft2_interleaved_f32(t_a7, t_b7_rot); let [y8, y21] = parallel_fft2_interleaved_f32(t_a8, t_b8_rot); let [y9, y20] = parallel_fft2_interleaved_f32(t_a9, t_b9_rot); let [y10, y19] = parallel_fft2_interleaved_f32(t_a10, t_b10_rot); let [y11, y18] = parallel_fft2_interleaved_f32(t_a11, t_b11_rot); let [y12, y17] = parallel_fft2_interleaved_f32(t_a12, t_b12_rot); let [y13, y16] = parallel_fft2_interleaved_f32(t_a13, t_b13_rot); let [y14, y15] = parallel_fft2_interleaved_f32(t_a14, t_b14_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22, y23, y24, y25, y26, y27, y28] } } // ____ ___ __ _ _ _ _ _ // |___ \ / _ \ / /_ | || | | |__ (_) |_ // __) | (_) | _____ | '_ \| || |_| '_ \| | __| // / __/ \__, | |_____| | (_) |__ _| |_) | | |_ // |_____| /_/ \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly29 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: float64x2_t, twiddle1im: float64x2_t, twiddle2re: float64x2_t, twiddle2im: float64x2_t, twiddle3re: float64x2_t, twiddle3im: float64x2_t, twiddle4re: float64x2_t, twiddle4im: float64x2_t, twiddle5re: float64x2_t, twiddle5im: float64x2_t, twiddle6re: float64x2_t, twiddle6im: float64x2_t, twiddle7re: float64x2_t, twiddle7im: float64x2_t, twiddle8re: float64x2_t, twiddle8im: float64x2_t, twiddle9re: float64x2_t, twiddle9im: float64x2_t, twiddle10re: float64x2_t, twiddle10im: float64x2_t, twiddle11re: float64x2_t, twiddle11im: float64x2_t, twiddle12re: float64x2_t, twiddle12im: float64x2_t, twiddle13re: float64x2_t, twiddle13im: float64x2_t, twiddle14re: float64x2_t, twiddle14im: float64x2_t, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly29, 29, |this: &NeonF64Butterfly29<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly29, 29, |this: &NeonF64Butterfly29<_>| this .direction); impl NeonF64Butterfly29 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 29, direction); let tw2: Complex = twiddles::compute_twiddle(2, 29, direction); let tw3: Complex = twiddles::compute_twiddle(3, 29, direction); let tw4: Complex = twiddles::compute_twiddle(4, 29, direction); let tw5: Complex = twiddles::compute_twiddle(5, 29, direction); let tw6: Complex = twiddles::compute_twiddle(6, 29, direction); let tw7: Complex = twiddles::compute_twiddle(7, 29, direction); let tw8: Complex = twiddles::compute_twiddle(8, 29, direction); let tw9: Complex = twiddles::compute_twiddle(9, 29, direction); let tw10: Complex = twiddles::compute_twiddle(10, 29, direction); let tw11: Complex = twiddles::compute_twiddle(11, 29, direction); let tw12: Complex = twiddles::compute_twiddle(12, 29, direction); let tw13: Complex = twiddles::compute_twiddle(13, 29, direction); let tw14: Complex = twiddles::compute_twiddle(14, 29, direction); let twiddle1re = unsafe { vmovq_n_f64(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f64(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f64(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f64(tw2.im) }; let twiddle3re = unsafe { vmovq_n_f64(tw3.re) }; let twiddle3im = unsafe { vmovq_n_f64(tw3.im) }; let twiddle4re = unsafe { vmovq_n_f64(tw4.re) }; let twiddle4im = unsafe { vmovq_n_f64(tw4.im) }; let twiddle5re = unsafe { vmovq_n_f64(tw5.re) }; let twiddle5im = unsafe { vmovq_n_f64(tw5.im) }; let twiddle6re = unsafe { vmovq_n_f64(tw6.re) }; let twiddle6im = unsafe { vmovq_n_f64(tw6.im) }; let twiddle7re = unsafe { vmovq_n_f64(tw7.re) }; let twiddle7im = unsafe { vmovq_n_f64(tw7.im) }; let twiddle8re = unsafe { vmovq_n_f64(tw8.re) }; let twiddle8im = unsafe { vmovq_n_f64(tw8.im) }; let twiddle9re = unsafe { vmovq_n_f64(tw9.re) }; let twiddle9im = unsafe { vmovq_n_f64(tw9.im) }; let twiddle10re = unsafe { vmovq_n_f64(tw10.re) }; let twiddle10im = unsafe { vmovq_n_f64(tw10.im) }; let twiddle11re = unsafe { vmovq_n_f64(tw11.re) }; let twiddle11im = unsafe { vmovq_n_f64(tw11.im) }; let twiddle12re = unsafe { vmovq_n_f64(tw12.re) }; let twiddle12im = unsafe { vmovq_n_f64(tw12.im) }; let twiddle13re = unsafe { vmovq_n_f64(tw13.re) }; let twiddle13im = unsafe { vmovq_n_f64(tw13.im) }; let twiddle14re = unsafe { vmovq_n_f64(tw14.re) }; let twiddle14im = unsafe { vmovq_n_f64(tw14.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, twiddle12re, twiddle12im, twiddle13re, twiddle13im, twiddle14re, twiddle14im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [float64x2_t; 29]) -> [float64x2_t; 29] { let [x1p28, x1m28] = solo_fft2_f64(values[1], values[28]); let [x2p27, x2m27] = solo_fft2_f64(values[2], values[27]); let [x3p26, x3m26] = solo_fft2_f64(values[3], values[26]); let [x4p25, x4m25] = solo_fft2_f64(values[4], values[25]); let [x5p24, x5m24] = solo_fft2_f64(values[5], values[24]); let [x6p23, x6m23] = solo_fft2_f64(values[6], values[23]); let [x7p22, x7m22] = solo_fft2_f64(values[7], values[22]); let [x8p21, x8m21] = solo_fft2_f64(values[8], values[21]); let [x9p20, x9m20] = solo_fft2_f64(values[9], values[20]); let [x10p19, x10m19] = solo_fft2_f64(values[10], values[19]); let [x11p18, x11m18] = solo_fft2_f64(values[11], values[18]); let [x12p17, x12m17] = solo_fft2_f64(values[12], values[17]); let [x13p16, x13m16] = solo_fft2_f64(values[13], values[16]); let [x14p15, x14m15] = solo_fft2_f64(values[14], values[15]); let t_a1_1 = vmulq_f64(self.twiddle1re, x1p28); let t_a1_2 = vmulq_f64(self.twiddle2re, x2p27); let t_a1_3 = vmulq_f64(self.twiddle3re, x3p26); let t_a1_4 = vmulq_f64(self.twiddle4re, x4p25); let t_a1_5 = vmulq_f64(self.twiddle5re, x5p24); let t_a1_6 = vmulq_f64(self.twiddle6re, x6p23); let t_a1_7 = vmulq_f64(self.twiddle7re, x7p22); let t_a1_8 = vmulq_f64(self.twiddle8re, x8p21); let t_a1_9 = vmulq_f64(self.twiddle9re, x9p20); let t_a1_10 = vmulq_f64(self.twiddle10re, x10p19); let t_a1_11 = vmulq_f64(self.twiddle11re, x11p18); let t_a1_12 = vmulq_f64(self.twiddle12re, x12p17); let t_a1_13 = vmulq_f64(self.twiddle13re, x13p16); let t_a1_14 = vmulq_f64(self.twiddle14re, x14p15); let t_a2_1 = vmulq_f64(self.twiddle2re, x1p28); let t_a2_2 = vmulq_f64(self.twiddle4re, x2p27); let t_a2_3 = vmulq_f64(self.twiddle6re, x3p26); let t_a2_4 = vmulq_f64(self.twiddle8re, x4p25); let t_a2_5 = vmulq_f64(self.twiddle10re, x5p24); let t_a2_6 = vmulq_f64(self.twiddle12re, x6p23); let t_a2_7 = vmulq_f64(self.twiddle14re, x7p22); let t_a2_8 = vmulq_f64(self.twiddle13re, x8p21); let t_a2_9 = vmulq_f64(self.twiddle11re, x9p20); let t_a2_10 = vmulq_f64(self.twiddle9re, x10p19); let t_a2_11 = vmulq_f64(self.twiddle7re, x11p18); let t_a2_12 = vmulq_f64(self.twiddle5re, x12p17); let t_a2_13 = vmulq_f64(self.twiddle3re, x13p16); let t_a2_14 = vmulq_f64(self.twiddle1re, x14p15); let t_a3_1 = vmulq_f64(self.twiddle3re, x1p28); let t_a3_2 = vmulq_f64(self.twiddle6re, x2p27); let t_a3_3 = vmulq_f64(self.twiddle9re, x3p26); let t_a3_4 = vmulq_f64(self.twiddle12re, x4p25); let t_a3_5 = vmulq_f64(self.twiddle14re, x5p24); let t_a3_6 = vmulq_f64(self.twiddle11re, x6p23); let t_a3_7 = vmulq_f64(self.twiddle8re, x7p22); let t_a3_8 = vmulq_f64(self.twiddle5re, x8p21); let t_a3_9 = vmulq_f64(self.twiddle2re, x9p20); let t_a3_10 = vmulq_f64(self.twiddle1re, x10p19); let t_a3_11 = vmulq_f64(self.twiddle4re, x11p18); let t_a3_12 = vmulq_f64(self.twiddle7re, x12p17); let t_a3_13 = vmulq_f64(self.twiddle10re, x13p16); let t_a3_14 = vmulq_f64(self.twiddle13re, x14p15); let t_a4_1 = vmulq_f64(self.twiddle4re, x1p28); let t_a4_2 = vmulq_f64(self.twiddle8re, x2p27); let t_a4_3 = vmulq_f64(self.twiddle12re, x3p26); let t_a4_4 = vmulq_f64(self.twiddle13re, x4p25); let t_a4_5 = vmulq_f64(self.twiddle9re, x5p24); let t_a4_6 = vmulq_f64(self.twiddle5re, x6p23); let t_a4_7 = vmulq_f64(self.twiddle1re, x7p22); let t_a4_8 = vmulq_f64(self.twiddle3re, x8p21); let t_a4_9 = vmulq_f64(self.twiddle7re, x9p20); let t_a4_10 = vmulq_f64(self.twiddle11re, x10p19); let t_a4_11 = vmulq_f64(self.twiddle14re, x11p18); let t_a4_12 = vmulq_f64(self.twiddle10re, x12p17); let t_a4_13 = vmulq_f64(self.twiddle6re, x13p16); let t_a4_14 = vmulq_f64(self.twiddle2re, x14p15); let t_a5_1 = vmulq_f64(self.twiddle5re, x1p28); let t_a5_2 = vmulq_f64(self.twiddle10re, x2p27); let t_a5_3 = vmulq_f64(self.twiddle14re, x3p26); let t_a5_4 = vmulq_f64(self.twiddle9re, x4p25); let t_a5_5 = vmulq_f64(self.twiddle4re, x5p24); let t_a5_6 = vmulq_f64(self.twiddle1re, x6p23); let t_a5_7 = vmulq_f64(self.twiddle6re, x7p22); let t_a5_8 = vmulq_f64(self.twiddle11re, x8p21); let t_a5_9 = vmulq_f64(self.twiddle13re, x9p20); let t_a5_10 = vmulq_f64(self.twiddle8re, x10p19); let t_a5_11 = vmulq_f64(self.twiddle3re, x11p18); let t_a5_12 = vmulq_f64(self.twiddle2re, x12p17); let t_a5_13 = vmulq_f64(self.twiddle7re, x13p16); let t_a5_14 = vmulq_f64(self.twiddle12re, x14p15); let t_a6_1 = vmulq_f64(self.twiddle6re, x1p28); let t_a6_2 = vmulq_f64(self.twiddle12re, x2p27); let t_a6_3 = vmulq_f64(self.twiddle11re, x3p26); let t_a6_4 = vmulq_f64(self.twiddle5re, x4p25); let t_a6_5 = vmulq_f64(self.twiddle1re, x5p24); let t_a6_6 = vmulq_f64(self.twiddle7re, x6p23); let t_a6_7 = vmulq_f64(self.twiddle13re, x7p22); let t_a6_8 = vmulq_f64(self.twiddle10re, x8p21); let t_a6_9 = vmulq_f64(self.twiddle4re, x9p20); let t_a6_10 = vmulq_f64(self.twiddle2re, x10p19); let t_a6_11 = vmulq_f64(self.twiddle8re, x11p18); let t_a6_12 = vmulq_f64(self.twiddle14re, x12p17); let t_a6_13 = vmulq_f64(self.twiddle9re, x13p16); let t_a6_14 = vmulq_f64(self.twiddle3re, x14p15); let t_a7_1 = vmulq_f64(self.twiddle7re, x1p28); let t_a7_2 = vmulq_f64(self.twiddle14re, x2p27); let t_a7_3 = vmulq_f64(self.twiddle8re, x3p26); let t_a7_4 = vmulq_f64(self.twiddle1re, x4p25); let t_a7_5 = vmulq_f64(self.twiddle6re, x5p24); let t_a7_6 = vmulq_f64(self.twiddle13re, x6p23); let t_a7_7 = vmulq_f64(self.twiddle9re, x7p22); let t_a7_8 = vmulq_f64(self.twiddle2re, x8p21); let t_a7_9 = vmulq_f64(self.twiddle5re, x9p20); let t_a7_10 = vmulq_f64(self.twiddle12re, x10p19); let t_a7_11 = vmulq_f64(self.twiddle10re, x11p18); let t_a7_12 = vmulq_f64(self.twiddle3re, x12p17); let t_a7_13 = vmulq_f64(self.twiddle4re, x13p16); let t_a7_14 = vmulq_f64(self.twiddle11re, x14p15); let t_a8_1 = vmulq_f64(self.twiddle8re, x1p28); let t_a8_2 = vmulq_f64(self.twiddle13re, x2p27); let t_a8_3 = vmulq_f64(self.twiddle5re, x3p26); let t_a8_4 = vmulq_f64(self.twiddle3re, x4p25); let t_a8_5 = vmulq_f64(self.twiddle11re, x5p24); let t_a8_6 = vmulq_f64(self.twiddle10re, x6p23); let t_a8_7 = vmulq_f64(self.twiddle2re, x7p22); let t_a8_8 = vmulq_f64(self.twiddle6re, x8p21); let t_a8_9 = vmulq_f64(self.twiddle14re, x9p20); let t_a8_10 = vmulq_f64(self.twiddle7re, x10p19); let t_a8_11 = vmulq_f64(self.twiddle1re, x11p18); let t_a8_12 = vmulq_f64(self.twiddle9re, x12p17); let t_a8_13 = vmulq_f64(self.twiddle12re, x13p16); let t_a8_14 = vmulq_f64(self.twiddle4re, x14p15); let t_a9_1 = vmulq_f64(self.twiddle9re, x1p28); let t_a9_2 = vmulq_f64(self.twiddle11re, x2p27); let t_a9_3 = vmulq_f64(self.twiddle2re, x3p26); let t_a9_4 = vmulq_f64(self.twiddle7re, x4p25); let t_a9_5 = vmulq_f64(self.twiddle13re, x5p24); let t_a9_6 = vmulq_f64(self.twiddle4re, x6p23); let t_a9_7 = vmulq_f64(self.twiddle5re, x7p22); let t_a9_8 = vmulq_f64(self.twiddle14re, x8p21); let t_a9_9 = vmulq_f64(self.twiddle6re, x9p20); let t_a9_10 = vmulq_f64(self.twiddle3re, x10p19); let t_a9_11 = vmulq_f64(self.twiddle12re, x11p18); let t_a9_12 = vmulq_f64(self.twiddle8re, x12p17); let t_a9_13 = vmulq_f64(self.twiddle1re, x13p16); let t_a9_14 = vmulq_f64(self.twiddle10re, x14p15); let t_a10_1 = vmulq_f64(self.twiddle10re, x1p28); let t_a10_2 = vmulq_f64(self.twiddle9re, x2p27); let t_a10_3 = vmulq_f64(self.twiddle1re, x3p26); let t_a10_4 = vmulq_f64(self.twiddle11re, x4p25); let t_a10_5 = vmulq_f64(self.twiddle8re, x5p24); let t_a10_6 = vmulq_f64(self.twiddle2re, x6p23); let t_a10_7 = vmulq_f64(self.twiddle12re, x7p22); let t_a10_8 = vmulq_f64(self.twiddle7re, x8p21); let t_a10_9 = vmulq_f64(self.twiddle3re, x9p20); let t_a10_10 = vmulq_f64(self.twiddle13re, x10p19); let t_a10_11 = vmulq_f64(self.twiddle6re, x11p18); let t_a10_12 = vmulq_f64(self.twiddle4re, x12p17); let t_a10_13 = vmulq_f64(self.twiddle14re, x13p16); let t_a10_14 = vmulq_f64(self.twiddle5re, x14p15); let t_a11_1 = vmulq_f64(self.twiddle11re, x1p28); let t_a11_2 = vmulq_f64(self.twiddle7re, x2p27); let t_a11_3 = vmulq_f64(self.twiddle4re, x3p26); let t_a11_4 = vmulq_f64(self.twiddle14re, x4p25); let t_a11_5 = vmulq_f64(self.twiddle3re, x5p24); let t_a11_6 = vmulq_f64(self.twiddle8re, x6p23); let t_a11_7 = vmulq_f64(self.twiddle10re, x7p22); let t_a11_8 = vmulq_f64(self.twiddle1re, x8p21); let t_a11_9 = vmulq_f64(self.twiddle12re, x9p20); let t_a11_10 = vmulq_f64(self.twiddle6re, x10p19); let t_a11_11 = vmulq_f64(self.twiddle5re, x11p18); let t_a11_12 = vmulq_f64(self.twiddle13re, x12p17); let t_a11_13 = vmulq_f64(self.twiddle2re, x13p16); let t_a11_14 = vmulq_f64(self.twiddle9re, x14p15); let t_a12_1 = vmulq_f64(self.twiddle12re, x1p28); let t_a12_2 = vmulq_f64(self.twiddle5re, x2p27); let t_a12_3 = vmulq_f64(self.twiddle7re, x3p26); let t_a12_4 = vmulq_f64(self.twiddle10re, x4p25); let t_a12_5 = vmulq_f64(self.twiddle2re, x5p24); let t_a12_6 = vmulq_f64(self.twiddle14re, x6p23); let t_a12_7 = vmulq_f64(self.twiddle3re, x7p22); let t_a12_8 = vmulq_f64(self.twiddle9re, x8p21); let t_a12_9 = vmulq_f64(self.twiddle8re, x9p20); let t_a12_10 = vmulq_f64(self.twiddle4re, x10p19); let t_a12_11 = vmulq_f64(self.twiddle13re, x11p18); let t_a12_12 = vmulq_f64(self.twiddle1re, x12p17); let t_a12_13 = vmulq_f64(self.twiddle11re, x13p16); let t_a12_14 = vmulq_f64(self.twiddle6re, x14p15); let t_a13_1 = vmulq_f64(self.twiddle13re, x1p28); let t_a13_2 = vmulq_f64(self.twiddle3re, x2p27); let t_a13_3 = vmulq_f64(self.twiddle10re, x3p26); let t_a13_4 = vmulq_f64(self.twiddle6re, x4p25); let t_a13_5 = vmulq_f64(self.twiddle7re, x5p24); let t_a13_6 = vmulq_f64(self.twiddle9re, x6p23); let t_a13_7 = vmulq_f64(self.twiddle4re, x7p22); let t_a13_8 = vmulq_f64(self.twiddle12re, x8p21); let t_a13_9 = vmulq_f64(self.twiddle1re, x9p20); let t_a13_10 = vmulq_f64(self.twiddle14re, x10p19); let t_a13_11 = vmulq_f64(self.twiddle2re, x11p18); let t_a13_12 = vmulq_f64(self.twiddle11re, x12p17); let t_a13_13 = vmulq_f64(self.twiddle5re, x13p16); let t_a13_14 = vmulq_f64(self.twiddle8re, x14p15); let t_a14_1 = vmulq_f64(self.twiddle14re, x1p28); let t_a14_2 = vmulq_f64(self.twiddle1re, x2p27); let t_a14_3 = vmulq_f64(self.twiddle13re, x3p26); let t_a14_4 = vmulq_f64(self.twiddle2re, x4p25); let t_a14_5 = vmulq_f64(self.twiddle12re, x5p24); let t_a14_6 = vmulq_f64(self.twiddle3re, x6p23); let t_a14_7 = vmulq_f64(self.twiddle11re, x7p22); let t_a14_8 = vmulq_f64(self.twiddle4re, x8p21); let t_a14_9 = vmulq_f64(self.twiddle10re, x9p20); let t_a14_10 = vmulq_f64(self.twiddle5re, x10p19); let t_a14_11 = vmulq_f64(self.twiddle9re, x11p18); let t_a14_12 = vmulq_f64(self.twiddle6re, x12p17); let t_a14_13 = vmulq_f64(self.twiddle8re, x13p16); let t_a14_14 = vmulq_f64(self.twiddle7re, x14p15); let t_b1_1 = vmulq_f64(self.twiddle1im, x1m28); let t_b1_2 = vmulq_f64(self.twiddle2im, x2m27); let t_b1_3 = vmulq_f64(self.twiddle3im, x3m26); let t_b1_4 = vmulq_f64(self.twiddle4im, x4m25); let t_b1_5 = vmulq_f64(self.twiddle5im, x5m24); let t_b1_6 = vmulq_f64(self.twiddle6im, x6m23); let t_b1_7 = vmulq_f64(self.twiddle7im, x7m22); let t_b1_8 = vmulq_f64(self.twiddle8im, x8m21); let t_b1_9 = vmulq_f64(self.twiddle9im, x9m20); let t_b1_10 = vmulq_f64(self.twiddle10im, x10m19); let t_b1_11 = vmulq_f64(self.twiddle11im, x11m18); let t_b1_12 = vmulq_f64(self.twiddle12im, x12m17); let t_b1_13 = vmulq_f64(self.twiddle13im, x13m16); let t_b1_14 = vmulq_f64(self.twiddle14im, x14m15); let t_b2_1 = vmulq_f64(self.twiddle2im, x1m28); let t_b2_2 = vmulq_f64(self.twiddle4im, x2m27); let t_b2_3 = vmulq_f64(self.twiddle6im, x3m26); let t_b2_4 = vmulq_f64(self.twiddle8im, x4m25); let t_b2_5 = vmulq_f64(self.twiddle10im, x5m24); let t_b2_6 = vmulq_f64(self.twiddle12im, x6m23); let t_b2_7 = vmulq_f64(self.twiddle14im, x7m22); let t_b2_8 = vmulq_f64(self.twiddle13im, x8m21); let t_b2_9 = vmulq_f64(self.twiddle11im, x9m20); let t_b2_10 = vmulq_f64(self.twiddle9im, x10m19); let t_b2_11 = vmulq_f64(self.twiddle7im, x11m18); let t_b2_12 = vmulq_f64(self.twiddle5im, x12m17); let t_b2_13 = vmulq_f64(self.twiddle3im, x13m16); let t_b2_14 = vmulq_f64(self.twiddle1im, x14m15); let t_b3_1 = vmulq_f64(self.twiddle3im, x1m28); let t_b3_2 = vmulq_f64(self.twiddle6im, x2m27); let t_b3_3 = vmulq_f64(self.twiddle9im, x3m26); let t_b3_4 = vmulq_f64(self.twiddle12im, x4m25); let t_b3_5 = vmulq_f64(self.twiddle14im, x5m24); let t_b3_6 = vmulq_f64(self.twiddle11im, x6m23); let t_b3_7 = vmulq_f64(self.twiddle8im, x7m22); let t_b3_8 = vmulq_f64(self.twiddle5im, x8m21); let t_b3_9 = vmulq_f64(self.twiddle2im, x9m20); let t_b3_10 = vmulq_f64(self.twiddle1im, x10m19); let t_b3_11 = vmulq_f64(self.twiddle4im, x11m18); let t_b3_12 = vmulq_f64(self.twiddle7im, x12m17); let t_b3_13 = vmulq_f64(self.twiddle10im, x13m16); let t_b3_14 = vmulq_f64(self.twiddle13im, x14m15); let t_b4_1 = vmulq_f64(self.twiddle4im, x1m28); let t_b4_2 = vmulq_f64(self.twiddle8im, x2m27); let t_b4_3 = vmulq_f64(self.twiddle12im, x3m26); let t_b4_4 = vmulq_f64(self.twiddle13im, x4m25); let t_b4_5 = vmulq_f64(self.twiddle9im, x5m24); let t_b4_6 = vmulq_f64(self.twiddle5im, x6m23); let t_b4_7 = vmulq_f64(self.twiddle1im, x7m22); let t_b4_8 = vmulq_f64(self.twiddle3im, x8m21); let t_b4_9 = vmulq_f64(self.twiddle7im, x9m20); let t_b4_10 = vmulq_f64(self.twiddle11im, x10m19); let t_b4_11 = vmulq_f64(self.twiddle14im, x11m18); let t_b4_12 = vmulq_f64(self.twiddle10im, x12m17); let t_b4_13 = vmulq_f64(self.twiddle6im, x13m16); let t_b4_14 = vmulq_f64(self.twiddle2im, x14m15); let t_b5_1 = vmulq_f64(self.twiddle5im, x1m28); let t_b5_2 = vmulq_f64(self.twiddle10im, x2m27); let t_b5_3 = vmulq_f64(self.twiddle14im, x3m26); let t_b5_4 = vmulq_f64(self.twiddle9im, x4m25); let t_b5_5 = vmulq_f64(self.twiddle4im, x5m24); let t_b5_6 = vmulq_f64(self.twiddle1im, x6m23); let t_b5_7 = vmulq_f64(self.twiddle6im, x7m22); let t_b5_8 = vmulq_f64(self.twiddle11im, x8m21); let t_b5_9 = vmulq_f64(self.twiddle13im, x9m20); let t_b5_10 = vmulq_f64(self.twiddle8im, x10m19); let t_b5_11 = vmulq_f64(self.twiddle3im, x11m18); let t_b5_12 = vmulq_f64(self.twiddle2im, x12m17); let t_b5_13 = vmulq_f64(self.twiddle7im, x13m16); let t_b5_14 = vmulq_f64(self.twiddle12im, x14m15); let t_b6_1 = vmulq_f64(self.twiddle6im, x1m28); let t_b6_2 = vmulq_f64(self.twiddle12im, x2m27); let t_b6_3 = vmulq_f64(self.twiddle11im, x3m26); let t_b6_4 = vmulq_f64(self.twiddle5im, x4m25); let t_b6_5 = vmulq_f64(self.twiddle1im, x5m24); let t_b6_6 = vmulq_f64(self.twiddle7im, x6m23); let t_b6_7 = vmulq_f64(self.twiddle13im, x7m22); let t_b6_8 = vmulq_f64(self.twiddle10im, x8m21); let t_b6_9 = vmulq_f64(self.twiddle4im, x9m20); let t_b6_10 = vmulq_f64(self.twiddle2im, x10m19); let t_b6_11 = vmulq_f64(self.twiddle8im, x11m18); let t_b6_12 = vmulq_f64(self.twiddle14im, x12m17); let t_b6_13 = vmulq_f64(self.twiddle9im, x13m16); let t_b6_14 = vmulq_f64(self.twiddle3im, x14m15); let t_b7_1 = vmulq_f64(self.twiddle7im, x1m28); let t_b7_2 = vmulq_f64(self.twiddle14im, x2m27); let t_b7_3 = vmulq_f64(self.twiddle8im, x3m26); let t_b7_4 = vmulq_f64(self.twiddle1im, x4m25); let t_b7_5 = vmulq_f64(self.twiddle6im, x5m24); let t_b7_6 = vmulq_f64(self.twiddle13im, x6m23); let t_b7_7 = vmulq_f64(self.twiddle9im, x7m22); let t_b7_8 = vmulq_f64(self.twiddle2im, x8m21); let t_b7_9 = vmulq_f64(self.twiddle5im, x9m20); let t_b7_10 = vmulq_f64(self.twiddle12im, x10m19); let t_b7_11 = vmulq_f64(self.twiddle10im, x11m18); let t_b7_12 = vmulq_f64(self.twiddle3im, x12m17); let t_b7_13 = vmulq_f64(self.twiddle4im, x13m16); let t_b7_14 = vmulq_f64(self.twiddle11im, x14m15); let t_b8_1 = vmulq_f64(self.twiddle8im, x1m28); let t_b8_2 = vmulq_f64(self.twiddle13im, x2m27); let t_b8_3 = vmulq_f64(self.twiddle5im, x3m26); let t_b8_4 = vmulq_f64(self.twiddle3im, x4m25); let t_b8_5 = vmulq_f64(self.twiddle11im, x5m24); let t_b8_6 = vmulq_f64(self.twiddle10im, x6m23); let t_b8_7 = vmulq_f64(self.twiddle2im, x7m22); let t_b8_8 = vmulq_f64(self.twiddle6im, x8m21); let t_b8_9 = vmulq_f64(self.twiddle14im, x9m20); let t_b8_10 = vmulq_f64(self.twiddle7im, x10m19); let t_b8_11 = vmulq_f64(self.twiddle1im, x11m18); let t_b8_12 = vmulq_f64(self.twiddle9im, x12m17); let t_b8_13 = vmulq_f64(self.twiddle12im, x13m16); let t_b8_14 = vmulq_f64(self.twiddle4im, x14m15); let t_b9_1 = vmulq_f64(self.twiddle9im, x1m28); let t_b9_2 = vmulq_f64(self.twiddle11im, x2m27); let t_b9_3 = vmulq_f64(self.twiddle2im, x3m26); let t_b9_4 = vmulq_f64(self.twiddle7im, x4m25); let t_b9_5 = vmulq_f64(self.twiddle13im, x5m24); let t_b9_6 = vmulq_f64(self.twiddle4im, x6m23); let t_b9_7 = vmulq_f64(self.twiddle5im, x7m22); let t_b9_8 = vmulq_f64(self.twiddle14im, x8m21); let t_b9_9 = vmulq_f64(self.twiddle6im, x9m20); let t_b9_10 = vmulq_f64(self.twiddle3im, x10m19); let t_b9_11 = vmulq_f64(self.twiddle12im, x11m18); let t_b9_12 = vmulq_f64(self.twiddle8im, x12m17); let t_b9_13 = vmulq_f64(self.twiddle1im, x13m16); let t_b9_14 = vmulq_f64(self.twiddle10im, x14m15); let t_b10_1 = vmulq_f64(self.twiddle10im, x1m28); let t_b10_2 = vmulq_f64(self.twiddle9im, x2m27); let t_b10_3 = vmulq_f64(self.twiddle1im, x3m26); let t_b10_4 = vmulq_f64(self.twiddle11im, x4m25); let t_b10_5 = vmulq_f64(self.twiddle8im, x5m24); let t_b10_6 = vmulq_f64(self.twiddle2im, x6m23); let t_b10_7 = vmulq_f64(self.twiddle12im, x7m22); let t_b10_8 = vmulq_f64(self.twiddle7im, x8m21); let t_b10_9 = vmulq_f64(self.twiddle3im, x9m20); let t_b10_10 = vmulq_f64(self.twiddle13im, x10m19); let t_b10_11 = vmulq_f64(self.twiddle6im, x11m18); let t_b10_12 = vmulq_f64(self.twiddle4im, x12m17); let t_b10_13 = vmulq_f64(self.twiddle14im, x13m16); let t_b10_14 = vmulq_f64(self.twiddle5im, x14m15); let t_b11_1 = vmulq_f64(self.twiddle11im, x1m28); let t_b11_2 = vmulq_f64(self.twiddle7im, x2m27); let t_b11_3 = vmulq_f64(self.twiddle4im, x3m26); let t_b11_4 = vmulq_f64(self.twiddle14im, x4m25); let t_b11_5 = vmulq_f64(self.twiddle3im, x5m24); let t_b11_6 = vmulq_f64(self.twiddle8im, x6m23); let t_b11_7 = vmulq_f64(self.twiddle10im, x7m22); let t_b11_8 = vmulq_f64(self.twiddle1im, x8m21); let t_b11_9 = vmulq_f64(self.twiddle12im, x9m20); let t_b11_10 = vmulq_f64(self.twiddle6im, x10m19); let t_b11_11 = vmulq_f64(self.twiddle5im, x11m18); let t_b11_12 = vmulq_f64(self.twiddle13im, x12m17); let t_b11_13 = vmulq_f64(self.twiddle2im, x13m16); let t_b11_14 = vmulq_f64(self.twiddle9im, x14m15); let t_b12_1 = vmulq_f64(self.twiddle12im, x1m28); let t_b12_2 = vmulq_f64(self.twiddle5im, x2m27); let t_b12_3 = vmulq_f64(self.twiddle7im, x3m26); let t_b12_4 = vmulq_f64(self.twiddle10im, x4m25); let t_b12_5 = vmulq_f64(self.twiddle2im, x5m24); let t_b12_6 = vmulq_f64(self.twiddle14im, x6m23); let t_b12_7 = vmulq_f64(self.twiddle3im, x7m22); let t_b12_8 = vmulq_f64(self.twiddle9im, x8m21); let t_b12_9 = vmulq_f64(self.twiddle8im, x9m20); let t_b12_10 = vmulq_f64(self.twiddle4im, x10m19); let t_b12_11 = vmulq_f64(self.twiddle13im, x11m18); let t_b12_12 = vmulq_f64(self.twiddle1im, x12m17); let t_b12_13 = vmulq_f64(self.twiddle11im, x13m16); let t_b12_14 = vmulq_f64(self.twiddle6im, x14m15); let t_b13_1 = vmulq_f64(self.twiddle13im, x1m28); let t_b13_2 = vmulq_f64(self.twiddle3im, x2m27); let t_b13_3 = vmulq_f64(self.twiddle10im, x3m26); let t_b13_4 = vmulq_f64(self.twiddle6im, x4m25); let t_b13_5 = vmulq_f64(self.twiddle7im, x5m24); let t_b13_6 = vmulq_f64(self.twiddle9im, x6m23); let t_b13_7 = vmulq_f64(self.twiddle4im, x7m22); let t_b13_8 = vmulq_f64(self.twiddle12im, x8m21); let t_b13_9 = vmulq_f64(self.twiddle1im, x9m20); let t_b13_10 = vmulq_f64(self.twiddle14im, x10m19); let t_b13_11 = vmulq_f64(self.twiddle2im, x11m18); let t_b13_12 = vmulq_f64(self.twiddle11im, x12m17); let t_b13_13 = vmulq_f64(self.twiddle5im, x13m16); let t_b13_14 = vmulq_f64(self.twiddle8im, x14m15); let t_b14_1 = vmulq_f64(self.twiddle14im, x1m28); let t_b14_2 = vmulq_f64(self.twiddle1im, x2m27); let t_b14_3 = vmulq_f64(self.twiddle13im, x3m26); let t_b14_4 = vmulq_f64(self.twiddle2im, x4m25); let t_b14_5 = vmulq_f64(self.twiddle12im, x5m24); let t_b14_6 = vmulq_f64(self.twiddle3im, x6m23); let t_b14_7 = vmulq_f64(self.twiddle11im, x7m22); let t_b14_8 = vmulq_f64(self.twiddle4im, x8m21); let t_b14_9 = vmulq_f64(self.twiddle10im, x9m20); let t_b14_10 = vmulq_f64(self.twiddle5im, x10m19); let t_b14_11 = vmulq_f64(self.twiddle9im, x11m18); let t_b14_12 = vmulq_f64(self.twiddle6im, x12m17); let t_b14_13 = vmulq_f64(self.twiddle8im, x13m16); let t_b14_14 = vmulq_f64(self.twiddle7im, x14m15); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11 + t_a1_12 + t_a1_13 + t_a1_14); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11 + t_a2_12 + t_a2_13 + t_a2_14); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11 + t_a3_12 + t_a3_13 + t_a3_14); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11 + t_a4_12 + t_a4_13 + t_a4_14); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11 + t_a5_12 + t_a5_13 + t_a5_14); let t_a6 = calc_f64!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11 + t_a6_12 + t_a6_13 + t_a6_14); let t_a7 = calc_f64!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11 + t_a7_12 + t_a7_13 + t_a7_14); let t_a8 = calc_f64!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11 + t_a8_12 + t_a8_13 + t_a8_14); let t_a9 = calc_f64!(x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11 + t_a9_12 + t_a9_13 + t_a9_14); let t_a10 = calc_f64!(x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11 + t_a10_12 + t_a10_13 + t_a10_14); let t_a11 = calc_f64!(x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11 + t_a11_12 + t_a11_13 + t_a11_14); let t_a12 = calc_f64!(x0 + t_a12_1 + t_a12_2 + t_a12_3 + t_a12_4 + t_a12_5 + t_a12_6 + t_a12_7 + t_a12_8 + t_a12_9 + t_a12_10 + t_a12_11 + t_a12_12 + t_a12_13 + t_a12_14); let t_a13 = calc_f64!(x0 + t_a13_1 + t_a13_2 + t_a13_3 + t_a13_4 + t_a13_5 + t_a13_6 + t_a13_7 + t_a13_8 + t_a13_9 + t_a13_10 + t_a13_11 + t_a13_12 + t_a13_13 + t_a13_14); let t_a14 = calc_f64!(x0 + t_a14_1 + t_a14_2 + t_a14_3 + t_a14_4 + t_a14_5 + t_a14_6 + t_a14_7 + t_a14_8 + t_a14_9 + t_a14_10 + t_a14_11 + t_a14_12 + t_a14_13 + t_a14_14); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11 + t_b1_12 + t_b1_13 + t_b1_14); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 + t_b2_6 + t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11 - t_b2_12 - t_b2_13 - t_b2_14); let t_b3 = calc_f64!(t_b3_1 + t_b3_2 + t_b3_3 + t_b3_4 - t_b3_5 - t_b3_6 - t_b3_7 - t_b3_8 - t_b3_9 + t_b3_10 + t_b3_11 + t_b3_12 + t_b3_13 + t_b3_14); let t_b4 = calc_f64!(t_b4_1 + t_b4_2 + t_b4_3 - t_b4_4 - t_b4_5 - t_b4_6 - t_b4_7 + t_b4_8 + t_b4_9 + t_b4_10 - t_b4_11 - t_b4_12 - t_b4_13 - t_b4_14); let t_b5 = calc_f64!(t_b5_1 + t_b5_2 - t_b5_3 - t_b5_4 - t_b5_5 + t_b5_6 + t_b5_7 + t_b5_8 - t_b5_9 - t_b5_10 - t_b5_11 + t_b5_12 + t_b5_13 + t_b5_14); let t_b6 = calc_f64!(t_b6_1 + t_b6_2 - t_b6_3 - t_b6_4 + t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9 + t_b6_10 + t_b6_11 + t_b6_12 - t_b6_13 - t_b6_14); let t_b7 = calc_f64!(t_b7_1 + t_b7_2 - t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9 + t_b7_10 - t_b7_11 - t_b7_12 + t_b7_13 + t_b7_14); let t_b8 = calc_f64!(t_b8_1 - t_b8_2 - t_b8_3 + t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 + t_b8_9 - t_b8_10 + t_b8_11 + t_b8_12 - t_b8_13 - t_b8_14); let t_b9 = calc_f64!(t_b9_1 - t_b9_2 - t_b9_3 + t_b9_4 - t_b9_5 - t_b9_6 + t_b9_7 + t_b9_8 - t_b9_9 + t_b9_10 + t_b9_11 - t_b9_12 + t_b9_13 + t_b9_14); let t_b10 = calc_f64!(t_b10_1 - t_b10_2 + t_b10_3 + t_b10_4 - t_b10_5 + t_b10_6 + t_b10_7 - t_b10_8 + t_b10_9 + t_b10_10 - t_b10_11 + t_b10_12 + t_b10_13 - t_b10_14); let t_b11 = calc_f64!(t_b11_1 - t_b11_2 + t_b11_3 - t_b11_4 - t_b11_5 + t_b11_6 - t_b11_7 + t_b11_8 + t_b11_9 - t_b11_10 + t_b11_11 - t_b11_12 - t_b11_13 + t_b11_14); let t_b12 = calc_f64!(t_b12_1 - t_b12_2 + t_b12_3 - t_b12_4 + t_b12_5 + t_b12_6 - t_b12_7 + t_b12_8 - t_b12_9 + t_b12_10 - t_b12_11 - t_b12_12 + t_b12_13 - t_b12_14); let t_b13 = calc_f64!(t_b13_1 - t_b13_2 + t_b13_3 - t_b13_4 + t_b13_5 - t_b13_6 + t_b13_7 - t_b13_8 + t_b13_9 + t_b13_10 - t_b13_11 + t_b13_12 - t_b13_13 + t_b13_14); let t_b14 = calc_f64!(t_b14_1 - t_b14_2 + t_b14_3 - t_b14_4 + t_b14_5 - t_b14_6 + t_b14_7 - t_b14_8 + t_b14_9 - t_b14_10 + t_b14_11 - t_b14_12 + t_b14_13 - t_b14_14); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let t_b7_rot = self.rotate.rotate(t_b7); let t_b8_rot = self.rotate.rotate(t_b8); let t_b9_rot = self.rotate.rotate(t_b9); let t_b10_rot = self.rotate.rotate(t_b10); let t_b11_rot = self.rotate.rotate(t_b11); let t_b12_rot = self.rotate.rotate(t_b12); let t_b13_rot = self.rotate.rotate(t_b13); let t_b14_rot = self.rotate.rotate(t_b14); let y0 = calc_f64!(x0 + x1p28 + x2p27 + x3p26 + x4p25 + x5p24 + x6p23 + x7p22 + x8p21 + x9p20 + x10p19 + x11p18 + x12p17 + x13p16 + x14p15); let [y1, y28] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y27] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y26] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y25] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y24] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y23] = solo_fft2_f64(t_a6, t_b6_rot); let [y7, y22] = solo_fft2_f64(t_a7, t_b7_rot); let [y8, y21] = solo_fft2_f64(t_a8, t_b8_rot); let [y9, y20] = solo_fft2_f64(t_a9, t_b9_rot); let [y10, y19] = solo_fft2_f64(t_a10, t_b10_rot); let [y11, y18] = solo_fft2_f64(t_a11, t_b11_rot); let [y12, y17] = solo_fft2_f64(t_a12, t_b12_rot); let [y13, y16] = solo_fft2_f64(t_a13, t_b13_rot); let [y14, y15] = solo_fft2_f64(t_a14, t_b14_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22, y23, y24, y25, y26, y27, y28] } } // _____ _ _________ _ _ _ // |___ // | |___ /___ \| |__ (_) |_ // |_ \| | _____ |_ \ __) | '_ \| | __| // ___) | | |_____| ___) / __/| |_) | | |_ // |____/|_| |____/_____|_.__/|_|\__| // pub struct NeonF32Butterfly31 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: float32x4_t, twiddle1im: float32x4_t, twiddle2re: float32x4_t, twiddle2im: float32x4_t, twiddle3re: float32x4_t, twiddle3im: float32x4_t, twiddle4re: float32x4_t, twiddle4im: float32x4_t, twiddle5re: float32x4_t, twiddle5im: float32x4_t, twiddle6re: float32x4_t, twiddle6im: float32x4_t, twiddle7re: float32x4_t, twiddle7im: float32x4_t, twiddle8re: float32x4_t, twiddle8im: float32x4_t, twiddle9re: float32x4_t, twiddle9im: float32x4_t, twiddle10re: float32x4_t, twiddle10im: float32x4_t, twiddle11re: float32x4_t, twiddle11im: float32x4_t, twiddle12re: float32x4_t, twiddle12im: float32x4_t, twiddle13re: float32x4_t, twiddle13im: float32x4_t, twiddle14re: float32x4_t, twiddle14im: float32x4_t, twiddle15re: float32x4_t, twiddle15im: float32x4_t, } boilerplate_fft_neon_f32_butterfly!(NeonF32Butterfly31, 31, |this: &NeonF32Butterfly31<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF32Butterfly31, 31, |this: &NeonF32Butterfly31<_>| this .direction); impl NeonF32Butterfly31 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 31, direction); let tw2: Complex = twiddles::compute_twiddle(2, 31, direction); let tw3: Complex = twiddles::compute_twiddle(3, 31, direction); let tw4: Complex = twiddles::compute_twiddle(4, 31, direction); let tw5: Complex = twiddles::compute_twiddle(5, 31, direction); let tw6: Complex = twiddles::compute_twiddle(6, 31, direction); let tw7: Complex = twiddles::compute_twiddle(7, 31, direction); let tw8: Complex = twiddles::compute_twiddle(8, 31, direction); let tw9: Complex = twiddles::compute_twiddle(9, 31, direction); let tw10: Complex = twiddles::compute_twiddle(10, 31, direction); let tw11: Complex = twiddles::compute_twiddle(11, 31, direction); let tw12: Complex = twiddles::compute_twiddle(12, 31, direction); let tw13: Complex = twiddles::compute_twiddle(13, 31, direction); let tw14: Complex = twiddles::compute_twiddle(14, 31, direction); let tw15: Complex = twiddles::compute_twiddle(15, 31, direction); let twiddle1re = unsafe { vmovq_n_f32(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f32(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f32(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f32(tw2.im) }; let twiddle3re = unsafe { vmovq_n_f32(tw3.re) }; let twiddle3im = unsafe { vmovq_n_f32(tw3.im) }; let twiddle4re = unsafe { vmovq_n_f32(tw4.re) }; let twiddle4im = unsafe { vmovq_n_f32(tw4.im) }; let twiddle5re = unsafe { vmovq_n_f32(tw5.re) }; let twiddle5im = unsafe { vmovq_n_f32(tw5.im) }; let twiddle6re = unsafe { vmovq_n_f32(tw6.re) }; let twiddle6im = unsafe { vmovq_n_f32(tw6.im) }; let twiddle7re = unsafe { vmovq_n_f32(tw7.re) }; let twiddle7im = unsafe { vmovq_n_f32(tw7.im) }; let twiddle8re = unsafe { vmovq_n_f32(tw8.re) }; let twiddle8im = unsafe { vmovq_n_f32(tw8.im) }; let twiddle9re = unsafe { vmovq_n_f32(tw9.re) }; let twiddle9im = unsafe { vmovq_n_f32(tw9.im) }; let twiddle10re = unsafe { vmovq_n_f32(tw10.re) }; let twiddle10im = unsafe { vmovq_n_f32(tw10.im) }; let twiddle11re = unsafe { vmovq_n_f32(tw11.re) }; let twiddle11im = unsafe { vmovq_n_f32(tw11.im) }; let twiddle12re = unsafe { vmovq_n_f32(tw12.re) }; let twiddle12im = unsafe { vmovq_n_f32(tw12.im) }; let twiddle13re = unsafe { vmovq_n_f32(tw13.re) }; let twiddle13im = unsafe { vmovq_n_f32(tw13.im) }; let twiddle14re = unsafe { vmovq_n_f32(tw14.re) }; let twiddle14im = unsafe { vmovq_n_f32(tw14.im) }; let twiddle15re = unsafe { vmovq_n_f32(tw15.re) }; let twiddle15im = unsafe { vmovq_n_f32(tw15.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, twiddle12re, twiddle12im, twiddle13re, twiddle13im, twiddle14re, twiddle14im, twiddle15re, twiddle15im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[15]), extract_hi_lo_f32(input_packed[0], input_packed[16]), extract_lo_hi_f32(input_packed[1], input_packed[16]), extract_hi_lo_f32(input_packed[1], input_packed[17]), extract_lo_hi_f32(input_packed[2], input_packed[17]), extract_hi_lo_f32(input_packed[2], input_packed[18]), extract_lo_hi_f32(input_packed[3], input_packed[18]), extract_hi_lo_f32(input_packed[3], input_packed[19]), extract_lo_hi_f32(input_packed[4], input_packed[19]), extract_hi_lo_f32(input_packed[4], input_packed[20]), extract_lo_hi_f32(input_packed[5], input_packed[20]), extract_hi_lo_f32(input_packed[5], input_packed[21]), extract_lo_hi_f32(input_packed[6], input_packed[21]), extract_hi_lo_f32(input_packed[6], input_packed[22]), extract_lo_hi_f32(input_packed[7], input_packed[22]), extract_hi_lo_f32(input_packed[7], input_packed[23]), extract_lo_hi_f32(input_packed[8], input_packed[23]), extract_hi_lo_f32(input_packed[8], input_packed[24]), extract_lo_hi_f32(input_packed[9], input_packed[24]), extract_hi_lo_f32(input_packed[9], input_packed[25]), extract_lo_hi_f32(input_packed[10], input_packed[25]), extract_hi_lo_f32(input_packed[10], input_packed[26]), extract_lo_hi_f32(input_packed[11], input_packed[26]), extract_hi_lo_f32(input_packed[11], input_packed[27]), extract_lo_hi_f32(input_packed[12], input_packed[27]), extract_hi_lo_f32(input_packed[12], input_packed[28]), extract_lo_hi_f32(input_packed[13], input_packed[28]), extract_hi_lo_f32(input_packed[13], input_packed[29]), extract_lo_hi_f32(input_packed[14], input_packed[29]), extract_hi_lo_f32(input_packed[14], input_packed[30]), extract_lo_hi_f32(input_packed[15], input_packed[30]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_lo_f32(out[14], out[15]), extract_lo_lo_f32(out[16], out[17]), extract_lo_lo_f32(out[18], out[19]), extract_lo_lo_f32(out[20], out[21]), extract_lo_lo_f32(out[22], out[23]), extract_lo_lo_f32(out[24], out[25]), extract_lo_lo_f32(out[26], out[27]), extract_lo_lo_f32(out[28], out[29]), extract_lo_hi_f32(out[30], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), extract_hi_hi_f32(out[15], out[16]), extract_hi_hi_f32(out[17], out[18]), extract_hi_hi_f32(out[19], out[20]), extract_hi_hi_f32(out[21], out[22]), extract_hi_hi_f32(out[23], out[24]), extract_hi_hi_f32(out[25], out[26]), extract_hi_hi_f32(out[27], out[28]), extract_hi_hi_f32(out[29], out[30]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [float32x4_t; 31]) -> [float32x4_t; 31] { let [x1p30, x1m30] = parallel_fft2_interleaved_f32(values[1], values[30]); let [x2p29, x2m29] = parallel_fft2_interleaved_f32(values[2], values[29]); let [x3p28, x3m28] = parallel_fft2_interleaved_f32(values[3], values[28]); let [x4p27, x4m27] = parallel_fft2_interleaved_f32(values[4], values[27]); let [x5p26, x5m26] = parallel_fft2_interleaved_f32(values[5], values[26]); let [x6p25, x6m25] = parallel_fft2_interleaved_f32(values[6], values[25]); let [x7p24, x7m24] = parallel_fft2_interleaved_f32(values[7], values[24]); let [x8p23, x8m23] = parallel_fft2_interleaved_f32(values[8], values[23]); let [x9p22, x9m22] = parallel_fft2_interleaved_f32(values[9], values[22]); let [x10p21, x10m21] = parallel_fft2_interleaved_f32(values[10], values[21]); let [x11p20, x11m20] = parallel_fft2_interleaved_f32(values[11], values[20]); let [x12p19, x12m19] = parallel_fft2_interleaved_f32(values[12], values[19]); let [x13p18, x13m18] = parallel_fft2_interleaved_f32(values[13], values[18]); let [x14p17, x14m17] = parallel_fft2_interleaved_f32(values[14], values[17]); let [x15p16, x15m16] = parallel_fft2_interleaved_f32(values[15], values[16]); let t_a1_1 = vmulq_f32(self.twiddle1re, x1p30); let t_a1_2 = vmulq_f32(self.twiddle2re, x2p29); let t_a1_3 = vmulq_f32(self.twiddle3re, x3p28); let t_a1_4 = vmulq_f32(self.twiddle4re, x4p27); let t_a1_5 = vmulq_f32(self.twiddle5re, x5p26); let t_a1_6 = vmulq_f32(self.twiddle6re, x6p25); let t_a1_7 = vmulq_f32(self.twiddle7re, x7p24); let t_a1_8 = vmulq_f32(self.twiddle8re, x8p23); let t_a1_9 = vmulq_f32(self.twiddle9re, x9p22); let t_a1_10 = vmulq_f32(self.twiddle10re, x10p21); let t_a1_11 = vmulq_f32(self.twiddle11re, x11p20); let t_a1_12 = vmulq_f32(self.twiddle12re, x12p19); let t_a1_13 = vmulq_f32(self.twiddle13re, x13p18); let t_a1_14 = vmulq_f32(self.twiddle14re, x14p17); let t_a1_15 = vmulq_f32(self.twiddle15re, x15p16); let t_a2_1 = vmulq_f32(self.twiddle2re, x1p30); let t_a2_2 = vmulq_f32(self.twiddle4re, x2p29); let t_a2_3 = vmulq_f32(self.twiddle6re, x3p28); let t_a2_4 = vmulq_f32(self.twiddle8re, x4p27); let t_a2_5 = vmulq_f32(self.twiddle10re, x5p26); let t_a2_6 = vmulq_f32(self.twiddle12re, x6p25); let t_a2_7 = vmulq_f32(self.twiddle14re, x7p24); let t_a2_8 = vmulq_f32(self.twiddle15re, x8p23); let t_a2_9 = vmulq_f32(self.twiddle13re, x9p22); let t_a2_10 = vmulq_f32(self.twiddle11re, x10p21); let t_a2_11 = vmulq_f32(self.twiddle9re, x11p20); let t_a2_12 = vmulq_f32(self.twiddle7re, x12p19); let t_a2_13 = vmulq_f32(self.twiddle5re, x13p18); let t_a2_14 = vmulq_f32(self.twiddle3re, x14p17); let t_a2_15 = vmulq_f32(self.twiddle1re, x15p16); let t_a3_1 = vmulq_f32(self.twiddle3re, x1p30); let t_a3_2 = vmulq_f32(self.twiddle6re, x2p29); let t_a3_3 = vmulq_f32(self.twiddle9re, x3p28); let t_a3_4 = vmulq_f32(self.twiddle12re, x4p27); let t_a3_5 = vmulq_f32(self.twiddle15re, x5p26); let t_a3_6 = vmulq_f32(self.twiddle13re, x6p25); let t_a3_7 = vmulq_f32(self.twiddle10re, x7p24); let t_a3_8 = vmulq_f32(self.twiddle7re, x8p23); let t_a3_9 = vmulq_f32(self.twiddle4re, x9p22); let t_a3_10 = vmulq_f32(self.twiddle1re, x10p21); let t_a3_11 = vmulq_f32(self.twiddle2re, x11p20); let t_a3_12 = vmulq_f32(self.twiddle5re, x12p19); let t_a3_13 = vmulq_f32(self.twiddle8re, x13p18); let t_a3_14 = vmulq_f32(self.twiddle11re, x14p17); let t_a3_15 = vmulq_f32(self.twiddle14re, x15p16); let t_a4_1 = vmulq_f32(self.twiddle4re, x1p30); let t_a4_2 = vmulq_f32(self.twiddle8re, x2p29); let t_a4_3 = vmulq_f32(self.twiddle12re, x3p28); let t_a4_4 = vmulq_f32(self.twiddle15re, x4p27); let t_a4_5 = vmulq_f32(self.twiddle11re, x5p26); let t_a4_6 = vmulq_f32(self.twiddle7re, x6p25); let t_a4_7 = vmulq_f32(self.twiddle3re, x7p24); let t_a4_8 = vmulq_f32(self.twiddle1re, x8p23); let t_a4_9 = vmulq_f32(self.twiddle5re, x9p22); let t_a4_10 = vmulq_f32(self.twiddle9re, x10p21); let t_a4_11 = vmulq_f32(self.twiddle13re, x11p20); let t_a4_12 = vmulq_f32(self.twiddle14re, x12p19); let t_a4_13 = vmulq_f32(self.twiddle10re, x13p18); let t_a4_14 = vmulq_f32(self.twiddle6re, x14p17); let t_a4_15 = vmulq_f32(self.twiddle2re, x15p16); let t_a5_1 = vmulq_f32(self.twiddle5re, x1p30); let t_a5_2 = vmulq_f32(self.twiddle10re, x2p29); let t_a5_3 = vmulq_f32(self.twiddle15re, x3p28); let t_a5_4 = vmulq_f32(self.twiddle11re, x4p27); let t_a5_5 = vmulq_f32(self.twiddle6re, x5p26); let t_a5_6 = vmulq_f32(self.twiddle1re, x6p25); let t_a5_7 = vmulq_f32(self.twiddle4re, x7p24); let t_a5_8 = vmulq_f32(self.twiddle9re, x8p23); let t_a5_9 = vmulq_f32(self.twiddle14re, x9p22); let t_a5_10 = vmulq_f32(self.twiddle12re, x10p21); let t_a5_11 = vmulq_f32(self.twiddle7re, x11p20); let t_a5_12 = vmulq_f32(self.twiddle2re, x12p19); let t_a5_13 = vmulq_f32(self.twiddle3re, x13p18); let t_a5_14 = vmulq_f32(self.twiddle8re, x14p17); let t_a5_15 = vmulq_f32(self.twiddle13re, x15p16); let t_a6_1 = vmulq_f32(self.twiddle6re, x1p30); let t_a6_2 = vmulq_f32(self.twiddle12re, x2p29); let t_a6_3 = vmulq_f32(self.twiddle13re, x3p28); let t_a6_4 = vmulq_f32(self.twiddle7re, x4p27); let t_a6_5 = vmulq_f32(self.twiddle1re, x5p26); let t_a6_6 = vmulq_f32(self.twiddle5re, x6p25); let t_a6_7 = vmulq_f32(self.twiddle11re, x7p24); let t_a6_8 = vmulq_f32(self.twiddle14re, x8p23); let t_a6_9 = vmulq_f32(self.twiddle8re, x9p22); let t_a6_10 = vmulq_f32(self.twiddle2re, x10p21); let t_a6_11 = vmulq_f32(self.twiddle4re, x11p20); let t_a6_12 = vmulq_f32(self.twiddle10re, x12p19); let t_a6_13 = vmulq_f32(self.twiddle15re, x13p18); let t_a6_14 = vmulq_f32(self.twiddle9re, x14p17); let t_a6_15 = vmulq_f32(self.twiddle3re, x15p16); let t_a7_1 = vmulq_f32(self.twiddle7re, x1p30); let t_a7_2 = vmulq_f32(self.twiddle14re, x2p29); let t_a7_3 = vmulq_f32(self.twiddle10re, x3p28); let t_a7_4 = vmulq_f32(self.twiddle3re, x4p27); let t_a7_5 = vmulq_f32(self.twiddle4re, x5p26); let t_a7_6 = vmulq_f32(self.twiddle11re, x6p25); let t_a7_7 = vmulq_f32(self.twiddle13re, x7p24); let t_a7_8 = vmulq_f32(self.twiddle6re, x8p23); let t_a7_9 = vmulq_f32(self.twiddle1re, x9p22); let t_a7_10 = vmulq_f32(self.twiddle8re, x10p21); let t_a7_11 = vmulq_f32(self.twiddle15re, x11p20); let t_a7_12 = vmulq_f32(self.twiddle9re, x12p19); let t_a7_13 = vmulq_f32(self.twiddle2re, x13p18); let t_a7_14 = vmulq_f32(self.twiddle5re, x14p17); let t_a7_15 = vmulq_f32(self.twiddle12re, x15p16); let t_a8_1 = vmulq_f32(self.twiddle8re, x1p30); let t_a8_2 = vmulq_f32(self.twiddle15re, x2p29); let t_a8_3 = vmulq_f32(self.twiddle7re, x3p28); let t_a8_4 = vmulq_f32(self.twiddle1re, x4p27); let t_a8_5 = vmulq_f32(self.twiddle9re, x5p26); let t_a8_6 = vmulq_f32(self.twiddle14re, x6p25); let t_a8_7 = vmulq_f32(self.twiddle6re, x7p24); let t_a8_8 = vmulq_f32(self.twiddle2re, x8p23); let t_a8_9 = vmulq_f32(self.twiddle10re, x9p22); let t_a8_10 = vmulq_f32(self.twiddle13re, x10p21); let t_a8_11 = vmulq_f32(self.twiddle5re, x11p20); let t_a8_12 = vmulq_f32(self.twiddle3re, x12p19); let t_a8_13 = vmulq_f32(self.twiddle11re, x13p18); let t_a8_14 = vmulq_f32(self.twiddle12re, x14p17); let t_a8_15 = vmulq_f32(self.twiddle4re, x15p16); let t_a9_1 = vmulq_f32(self.twiddle9re, x1p30); let t_a9_2 = vmulq_f32(self.twiddle13re, x2p29); let t_a9_3 = vmulq_f32(self.twiddle4re, x3p28); let t_a9_4 = vmulq_f32(self.twiddle5re, x4p27); let t_a9_5 = vmulq_f32(self.twiddle14re, x5p26); let t_a9_6 = vmulq_f32(self.twiddle8re, x6p25); let t_a9_7 = vmulq_f32(self.twiddle1re, x7p24); let t_a9_8 = vmulq_f32(self.twiddle10re, x8p23); let t_a9_9 = vmulq_f32(self.twiddle12re, x9p22); let t_a9_10 = vmulq_f32(self.twiddle3re, x10p21); let t_a9_11 = vmulq_f32(self.twiddle6re, x11p20); let t_a9_12 = vmulq_f32(self.twiddle15re, x12p19); let t_a9_13 = vmulq_f32(self.twiddle7re, x13p18); let t_a9_14 = vmulq_f32(self.twiddle2re, x14p17); let t_a9_15 = vmulq_f32(self.twiddle11re, x15p16); let t_a10_1 = vmulq_f32(self.twiddle10re, x1p30); let t_a10_2 = vmulq_f32(self.twiddle11re, x2p29); let t_a10_3 = vmulq_f32(self.twiddle1re, x3p28); let t_a10_4 = vmulq_f32(self.twiddle9re, x4p27); let t_a10_5 = vmulq_f32(self.twiddle12re, x5p26); let t_a10_6 = vmulq_f32(self.twiddle2re, x6p25); let t_a10_7 = vmulq_f32(self.twiddle8re, x7p24); let t_a10_8 = vmulq_f32(self.twiddle13re, x8p23); let t_a10_9 = vmulq_f32(self.twiddle3re, x9p22); let t_a10_10 = vmulq_f32(self.twiddle7re, x10p21); let t_a10_11 = vmulq_f32(self.twiddle14re, x11p20); let t_a10_12 = vmulq_f32(self.twiddle4re, x12p19); let t_a10_13 = vmulq_f32(self.twiddle6re, x13p18); let t_a10_14 = vmulq_f32(self.twiddle15re, x14p17); let t_a10_15 = vmulq_f32(self.twiddle5re, x15p16); let t_a11_1 = vmulq_f32(self.twiddle11re, x1p30); let t_a11_2 = vmulq_f32(self.twiddle9re, x2p29); let t_a11_3 = vmulq_f32(self.twiddle2re, x3p28); let t_a11_4 = vmulq_f32(self.twiddle13re, x4p27); let t_a11_5 = vmulq_f32(self.twiddle7re, x5p26); let t_a11_6 = vmulq_f32(self.twiddle4re, x6p25); let t_a11_7 = vmulq_f32(self.twiddle15re, x7p24); let t_a11_8 = vmulq_f32(self.twiddle5re, x8p23); let t_a11_9 = vmulq_f32(self.twiddle6re, x9p22); let t_a11_10 = vmulq_f32(self.twiddle14re, x10p21); let t_a11_11 = vmulq_f32(self.twiddle3re, x11p20); let t_a11_12 = vmulq_f32(self.twiddle8re, x12p19); let t_a11_13 = vmulq_f32(self.twiddle12re, x13p18); let t_a11_14 = vmulq_f32(self.twiddle1re, x14p17); let t_a11_15 = vmulq_f32(self.twiddle10re, x15p16); let t_a12_1 = vmulq_f32(self.twiddle12re, x1p30); let t_a12_2 = vmulq_f32(self.twiddle7re, x2p29); let t_a12_3 = vmulq_f32(self.twiddle5re, x3p28); let t_a12_4 = vmulq_f32(self.twiddle14re, x4p27); let t_a12_5 = vmulq_f32(self.twiddle2re, x5p26); let t_a12_6 = vmulq_f32(self.twiddle10re, x6p25); let t_a12_7 = vmulq_f32(self.twiddle9re, x7p24); let t_a12_8 = vmulq_f32(self.twiddle3re, x8p23); let t_a12_9 = vmulq_f32(self.twiddle15re, x9p22); let t_a12_10 = vmulq_f32(self.twiddle4re, x10p21); let t_a12_11 = vmulq_f32(self.twiddle8re, x11p20); let t_a12_12 = vmulq_f32(self.twiddle11re, x12p19); let t_a12_13 = vmulq_f32(self.twiddle1re, x13p18); let t_a12_14 = vmulq_f32(self.twiddle13re, x14p17); let t_a12_15 = vmulq_f32(self.twiddle6re, x15p16); let t_a13_1 = vmulq_f32(self.twiddle13re, x1p30); let t_a13_2 = vmulq_f32(self.twiddle5re, x2p29); let t_a13_3 = vmulq_f32(self.twiddle8re, x3p28); let t_a13_4 = vmulq_f32(self.twiddle10re, x4p27); let t_a13_5 = vmulq_f32(self.twiddle3re, x5p26); let t_a13_6 = vmulq_f32(self.twiddle15re, x6p25); let t_a13_7 = vmulq_f32(self.twiddle2re, x7p24); let t_a13_8 = vmulq_f32(self.twiddle11re, x8p23); let t_a13_9 = vmulq_f32(self.twiddle7re, x9p22); let t_a13_10 = vmulq_f32(self.twiddle6re, x10p21); let t_a13_11 = vmulq_f32(self.twiddle12re, x11p20); let t_a13_12 = vmulq_f32(self.twiddle1re, x12p19); let t_a13_13 = vmulq_f32(self.twiddle14re, x13p18); let t_a13_14 = vmulq_f32(self.twiddle4re, x14p17); let t_a13_15 = vmulq_f32(self.twiddle9re, x15p16); let t_a14_1 = vmulq_f32(self.twiddle14re, x1p30); let t_a14_2 = vmulq_f32(self.twiddle3re, x2p29); let t_a14_3 = vmulq_f32(self.twiddle11re, x3p28); let t_a14_4 = vmulq_f32(self.twiddle6re, x4p27); let t_a14_5 = vmulq_f32(self.twiddle8re, x5p26); let t_a14_6 = vmulq_f32(self.twiddle9re, x6p25); let t_a14_7 = vmulq_f32(self.twiddle5re, x7p24); let t_a14_8 = vmulq_f32(self.twiddle12re, x8p23); let t_a14_9 = vmulq_f32(self.twiddle2re, x9p22); let t_a14_10 = vmulq_f32(self.twiddle15re, x10p21); let t_a14_11 = vmulq_f32(self.twiddle1re, x11p20); let t_a14_12 = vmulq_f32(self.twiddle13re, x12p19); let t_a14_13 = vmulq_f32(self.twiddle4re, x13p18); let t_a14_14 = vmulq_f32(self.twiddle10re, x14p17); let t_a14_15 = vmulq_f32(self.twiddle7re, x15p16); let t_a15_1 = vmulq_f32(self.twiddle15re, x1p30); let t_a15_2 = vmulq_f32(self.twiddle1re, x2p29); let t_a15_3 = vmulq_f32(self.twiddle14re, x3p28); let t_a15_4 = vmulq_f32(self.twiddle2re, x4p27); let t_a15_5 = vmulq_f32(self.twiddle13re, x5p26); let t_a15_6 = vmulq_f32(self.twiddle3re, x6p25); let t_a15_7 = vmulq_f32(self.twiddle12re, x7p24); let t_a15_8 = vmulq_f32(self.twiddle4re, x8p23); let t_a15_9 = vmulq_f32(self.twiddle11re, x9p22); let t_a15_10 = vmulq_f32(self.twiddle5re, x10p21); let t_a15_11 = vmulq_f32(self.twiddle10re, x11p20); let t_a15_12 = vmulq_f32(self.twiddle6re, x12p19); let t_a15_13 = vmulq_f32(self.twiddle9re, x13p18); let t_a15_14 = vmulq_f32(self.twiddle7re, x14p17); let t_a15_15 = vmulq_f32(self.twiddle8re, x15p16); let t_b1_1 = vmulq_f32(self.twiddle1im, x1m30); let t_b1_2 = vmulq_f32(self.twiddle2im, x2m29); let t_b1_3 = vmulq_f32(self.twiddle3im, x3m28); let t_b1_4 = vmulq_f32(self.twiddle4im, x4m27); let t_b1_5 = vmulq_f32(self.twiddle5im, x5m26); let t_b1_6 = vmulq_f32(self.twiddle6im, x6m25); let t_b1_7 = vmulq_f32(self.twiddle7im, x7m24); let t_b1_8 = vmulq_f32(self.twiddle8im, x8m23); let t_b1_9 = vmulq_f32(self.twiddle9im, x9m22); let t_b1_10 = vmulq_f32(self.twiddle10im, x10m21); let t_b1_11 = vmulq_f32(self.twiddle11im, x11m20); let t_b1_12 = vmulq_f32(self.twiddle12im, x12m19); let t_b1_13 = vmulq_f32(self.twiddle13im, x13m18); let t_b1_14 = vmulq_f32(self.twiddle14im, x14m17); let t_b1_15 = vmulq_f32(self.twiddle15im, x15m16); let t_b2_1 = vmulq_f32(self.twiddle2im, x1m30); let t_b2_2 = vmulq_f32(self.twiddle4im, x2m29); let t_b2_3 = vmulq_f32(self.twiddle6im, x3m28); let t_b2_4 = vmulq_f32(self.twiddle8im, x4m27); let t_b2_5 = vmulq_f32(self.twiddle10im, x5m26); let t_b2_6 = vmulq_f32(self.twiddle12im, x6m25); let t_b2_7 = vmulq_f32(self.twiddle14im, x7m24); let t_b2_8 = vmulq_f32(self.twiddle15im, x8m23); let t_b2_9 = vmulq_f32(self.twiddle13im, x9m22); let t_b2_10 = vmulq_f32(self.twiddle11im, x10m21); let t_b2_11 = vmulq_f32(self.twiddle9im, x11m20); let t_b2_12 = vmulq_f32(self.twiddle7im, x12m19); let t_b2_13 = vmulq_f32(self.twiddle5im, x13m18); let t_b2_14 = vmulq_f32(self.twiddle3im, x14m17); let t_b2_15 = vmulq_f32(self.twiddle1im, x15m16); let t_b3_1 = vmulq_f32(self.twiddle3im, x1m30); let t_b3_2 = vmulq_f32(self.twiddle6im, x2m29); let t_b3_3 = vmulq_f32(self.twiddle9im, x3m28); let t_b3_4 = vmulq_f32(self.twiddle12im, x4m27); let t_b3_5 = vmulq_f32(self.twiddle15im, x5m26); let t_b3_6 = vmulq_f32(self.twiddle13im, x6m25); let t_b3_7 = vmulq_f32(self.twiddle10im, x7m24); let t_b3_8 = vmulq_f32(self.twiddle7im, x8m23); let t_b3_9 = vmulq_f32(self.twiddle4im, x9m22); let t_b3_10 = vmulq_f32(self.twiddle1im, x10m21); let t_b3_11 = vmulq_f32(self.twiddle2im, x11m20); let t_b3_12 = vmulq_f32(self.twiddle5im, x12m19); let t_b3_13 = vmulq_f32(self.twiddle8im, x13m18); let t_b3_14 = vmulq_f32(self.twiddle11im, x14m17); let t_b3_15 = vmulq_f32(self.twiddle14im, x15m16); let t_b4_1 = vmulq_f32(self.twiddle4im, x1m30); let t_b4_2 = vmulq_f32(self.twiddle8im, x2m29); let t_b4_3 = vmulq_f32(self.twiddle12im, x3m28); let t_b4_4 = vmulq_f32(self.twiddle15im, x4m27); let t_b4_5 = vmulq_f32(self.twiddle11im, x5m26); let t_b4_6 = vmulq_f32(self.twiddle7im, x6m25); let t_b4_7 = vmulq_f32(self.twiddle3im, x7m24); let t_b4_8 = vmulq_f32(self.twiddle1im, x8m23); let t_b4_9 = vmulq_f32(self.twiddle5im, x9m22); let t_b4_10 = vmulq_f32(self.twiddle9im, x10m21); let t_b4_11 = vmulq_f32(self.twiddle13im, x11m20); let t_b4_12 = vmulq_f32(self.twiddle14im, x12m19); let t_b4_13 = vmulq_f32(self.twiddle10im, x13m18); let t_b4_14 = vmulq_f32(self.twiddle6im, x14m17); let t_b4_15 = vmulq_f32(self.twiddle2im, x15m16); let t_b5_1 = vmulq_f32(self.twiddle5im, x1m30); let t_b5_2 = vmulq_f32(self.twiddle10im, x2m29); let t_b5_3 = vmulq_f32(self.twiddle15im, x3m28); let t_b5_4 = vmulq_f32(self.twiddle11im, x4m27); let t_b5_5 = vmulq_f32(self.twiddle6im, x5m26); let t_b5_6 = vmulq_f32(self.twiddle1im, x6m25); let t_b5_7 = vmulq_f32(self.twiddle4im, x7m24); let t_b5_8 = vmulq_f32(self.twiddle9im, x8m23); let t_b5_9 = vmulq_f32(self.twiddle14im, x9m22); let t_b5_10 = vmulq_f32(self.twiddle12im, x10m21); let t_b5_11 = vmulq_f32(self.twiddle7im, x11m20); let t_b5_12 = vmulq_f32(self.twiddle2im, x12m19); let t_b5_13 = vmulq_f32(self.twiddle3im, x13m18); let t_b5_14 = vmulq_f32(self.twiddle8im, x14m17); let t_b5_15 = vmulq_f32(self.twiddle13im, x15m16); let t_b6_1 = vmulq_f32(self.twiddle6im, x1m30); let t_b6_2 = vmulq_f32(self.twiddle12im, x2m29); let t_b6_3 = vmulq_f32(self.twiddle13im, x3m28); let t_b6_4 = vmulq_f32(self.twiddle7im, x4m27); let t_b6_5 = vmulq_f32(self.twiddle1im, x5m26); let t_b6_6 = vmulq_f32(self.twiddle5im, x6m25); let t_b6_7 = vmulq_f32(self.twiddle11im, x7m24); let t_b6_8 = vmulq_f32(self.twiddle14im, x8m23); let t_b6_9 = vmulq_f32(self.twiddle8im, x9m22); let t_b6_10 = vmulq_f32(self.twiddle2im, x10m21); let t_b6_11 = vmulq_f32(self.twiddle4im, x11m20); let t_b6_12 = vmulq_f32(self.twiddle10im, x12m19); let t_b6_13 = vmulq_f32(self.twiddle15im, x13m18); let t_b6_14 = vmulq_f32(self.twiddle9im, x14m17); let t_b6_15 = vmulq_f32(self.twiddle3im, x15m16); let t_b7_1 = vmulq_f32(self.twiddle7im, x1m30); let t_b7_2 = vmulq_f32(self.twiddle14im, x2m29); let t_b7_3 = vmulq_f32(self.twiddle10im, x3m28); let t_b7_4 = vmulq_f32(self.twiddle3im, x4m27); let t_b7_5 = vmulq_f32(self.twiddle4im, x5m26); let t_b7_6 = vmulq_f32(self.twiddle11im, x6m25); let t_b7_7 = vmulq_f32(self.twiddle13im, x7m24); let t_b7_8 = vmulq_f32(self.twiddle6im, x8m23); let t_b7_9 = vmulq_f32(self.twiddle1im, x9m22); let t_b7_10 = vmulq_f32(self.twiddle8im, x10m21); let t_b7_11 = vmulq_f32(self.twiddle15im, x11m20); let t_b7_12 = vmulq_f32(self.twiddle9im, x12m19); let t_b7_13 = vmulq_f32(self.twiddle2im, x13m18); let t_b7_14 = vmulq_f32(self.twiddle5im, x14m17); let t_b7_15 = vmulq_f32(self.twiddle12im, x15m16); let t_b8_1 = vmulq_f32(self.twiddle8im, x1m30); let t_b8_2 = vmulq_f32(self.twiddle15im, x2m29); let t_b8_3 = vmulq_f32(self.twiddle7im, x3m28); let t_b8_4 = vmulq_f32(self.twiddle1im, x4m27); let t_b8_5 = vmulq_f32(self.twiddle9im, x5m26); let t_b8_6 = vmulq_f32(self.twiddle14im, x6m25); let t_b8_7 = vmulq_f32(self.twiddle6im, x7m24); let t_b8_8 = vmulq_f32(self.twiddle2im, x8m23); let t_b8_9 = vmulq_f32(self.twiddle10im, x9m22); let t_b8_10 = vmulq_f32(self.twiddle13im, x10m21); let t_b8_11 = vmulq_f32(self.twiddle5im, x11m20); let t_b8_12 = vmulq_f32(self.twiddle3im, x12m19); let t_b8_13 = vmulq_f32(self.twiddle11im, x13m18); let t_b8_14 = vmulq_f32(self.twiddle12im, x14m17); let t_b8_15 = vmulq_f32(self.twiddle4im, x15m16); let t_b9_1 = vmulq_f32(self.twiddle9im, x1m30); let t_b9_2 = vmulq_f32(self.twiddle13im, x2m29); let t_b9_3 = vmulq_f32(self.twiddle4im, x3m28); let t_b9_4 = vmulq_f32(self.twiddle5im, x4m27); let t_b9_5 = vmulq_f32(self.twiddle14im, x5m26); let t_b9_6 = vmulq_f32(self.twiddle8im, x6m25); let t_b9_7 = vmulq_f32(self.twiddle1im, x7m24); let t_b9_8 = vmulq_f32(self.twiddle10im, x8m23); let t_b9_9 = vmulq_f32(self.twiddle12im, x9m22); let t_b9_10 = vmulq_f32(self.twiddle3im, x10m21); let t_b9_11 = vmulq_f32(self.twiddle6im, x11m20); let t_b9_12 = vmulq_f32(self.twiddle15im, x12m19); let t_b9_13 = vmulq_f32(self.twiddle7im, x13m18); let t_b9_14 = vmulq_f32(self.twiddle2im, x14m17); let t_b9_15 = vmulq_f32(self.twiddle11im, x15m16); let t_b10_1 = vmulq_f32(self.twiddle10im, x1m30); let t_b10_2 = vmulq_f32(self.twiddle11im, x2m29); let t_b10_3 = vmulq_f32(self.twiddle1im, x3m28); let t_b10_4 = vmulq_f32(self.twiddle9im, x4m27); let t_b10_5 = vmulq_f32(self.twiddle12im, x5m26); let t_b10_6 = vmulq_f32(self.twiddle2im, x6m25); let t_b10_7 = vmulq_f32(self.twiddle8im, x7m24); let t_b10_8 = vmulq_f32(self.twiddle13im, x8m23); let t_b10_9 = vmulq_f32(self.twiddle3im, x9m22); let t_b10_10 = vmulq_f32(self.twiddle7im, x10m21); let t_b10_11 = vmulq_f32(self.twiddle14im, x11m20); let t_b10_12 = vmulq_f32(self.twiddle4im, x12m19); let t_b10_13 = vmulq_f32(self.twiddle6im, x13m18); let t_b10_14 = vmulq_f32(self.twiddle15im, x14m17); let t_b10_15 = vmulq_f32(self.twiddle5im, x15m16); let t_b11_1 = vmulq_f32(self.twiddle11im, x1m30); let t_b11_2 = vmulq_f32(self.twiddle9im, x2m29); let t_b11_3 = vmulq_f32(self.twiddle2im, x3m28); let t_b11_4 = vmulq_f32(self.twiddle13im, x4m27); let t_b11_5 = vmulq_f32(self.twiddle7im, x5m26); let t_b11_6 = vmulq_f32(self.twiddle4im, x6m25); let t_b11_7 = vmulq_f32(self.twiddle15im, x7m24); let t_b11_8 = vmulq_f32(self.twiddle5im, x8m23); let t_b11_9 = vmulq_f32(self.twiddle6im, x9m22); let t_b11_10 = vmulq_f32(self.twiddle14im, x10m21); let t_b11_11 = vmulq_f32(self.twiddle3im, x11m20); let t_b11_12 = vmulq_f32(self.twiddle8im, x12m19); let t_b11_13 = vmulq_f32(self.twiddle12im, x13m18); let t_b11_14 = vmulq_f32(self.twiddle1im, x14m17); let t_b11_15 = vmulq_f32(self.twiddle10im, x15m16); let t_b12_1 = vmulq_f32(self.twiddle12im, x1m30); let t_b12_2 = vmulq_f32(self.twiddle7im, x2m29); let t_b12_3 = vmulq_f32(self.twiddle5im, x3m28); let t_b12_4 = vmulq_f32(self.twiddle14im, x4m27); let t_b12_5 = vmulq_f32(self.twiddle2im, x5m26); let t_b12_6 = vmulq_f32(self.twiddle10im, x6m25); let t_b12_7 = vmulq_f32(self.twiddle9im, x7m24); let t_b12_8 = vmulq_f32(self.twiddle3im, x8m23); let t_b12_9 = vmulq_f32(self.twiddle15im, x9m22); let t_b12_10 = vmulq_f32(self.twiddle4im, x10m21); let t_b12_11 = vmulq_f32(self.twiddle8im, x11m20); let t_b12_12 = vmulq_f32(self.twiddle11im, x12m19); let t_b12_13 = vmulq_f32(self.twiddle1im, x13m18); let t_b12_14 = vmulq_f32(self.twiddle13im, x14m17); let t_b12_15 = vmulq_f32(self.twiddle6im, x15m16); let t_b13_1 = vmulq_f32(self.twiddle13im, x1m30); let t_b13_2 = vmulq_f32(self.twiddle5im, x2m29); let t_b13_3 = vmulq_f32(self.twiddle8im, x3m28); let t_b13_4 = vmulq_f32(self.twiddle10im, x4m27); let t_b13_5 = vmulq_f32(self.twiddle3im, x5m26); let t_b13_6 = vmulq_f32(self.twiddle15im, x6m25); let t_b13_7 = vmulq_f32(self.twiddle2im, x7m24); let t_b13_8 = vmulq_f32(self.twiddle11im, x8m23); let t_b13_9 = vmulq_f32(self.twiddle7im, x9m22); let t_b13_10 = vmulq_f32(self.twiddle6im, x10m21); let t_b13_11 = vmulq_f32(self.twiddle12im, x11m20); let t_b13_12 = vmulq_f32(self.twiddle1im, x12m19); let t_b13_13 = vmulq_f32(self.twiddle14im, x13m18); let t_b13_14 = vmulq_f32(self.twiddle4im, x14m17); let t_b13_15 = vmulq_f32(self.twiddle9im, x15m16); let t_b14_1 = vmulq_f32(self.twiddle14im, x1m30); let t_b14_2 = vmulq_f32(self.twiddle3im, x2m29); let t_b14_3 = vmulq_f32(self.twiddle11im, x3m28); let t_b14_4 = vmulq_f32(self.twiddle6im, x4m27); let t_b14_5 = vmulq_f32(self.twiddle8im, x5m26); let t_b14_6 = vmulq_f32(self.twiddle9im, x6m25); let t_b14_7 = vmulq_f32(self.twiddle5im, x7m24); let t_b14_8 = vmulq_f32(self.twiddle12im, x8m23); let t_b14_9 = vmulq_f32(self.twiddle2im, x9m22); let t_b14_10 = vmulq_f32(self.twiddle15im, x10m21); let t_b14_11 = vmulq_f32(self.twiddle1im, x11m20); let t_b14_12 = vmulq_f32(self.twiddle13im, x12m19); let t_b14_13 = vmulq_f32(self.twiddle4im, x13m18); let t_b14_14 = vmulq_f32(self.twiddle10im, x14m17); let t_b14_15 = vmulq_f32(self.twiddle7im, x15m16); let t_b15_1 = vmulq_f32(self.twiddle15im, x1m30); let t_b15_2 = vmulq_f32(self.twiddle1im, x2m29); let t_b15_3 = vmulq_f32(self.twiddle14im, x3m28); let t_b15_4 = vmulq_f32(self.twiddle2im, x4m27); let t_b15_5 = vmulq_f32(self.twiddle13im, x5m26); let t_b15_6 = vmulq_f32(self.twiddle3im, x6m25); let t_b15_7 = vmulq_f32(self.twiddle12im, x7m24); let t_b15_8 = vmulq_f32(self.twiddle4im, x8m23); let t_b15_9 = vmulq_f32(self.twiddle11im, x9m22); let t_b15_10 = vmulq_f32(self.twiddle5im, x10m21); let t_b15_11 = vmulq_f32(self.twiddle10im, x11m20); let t_b15_12 = vmulq_f32(self.twiddle6im, x12m19); let t_b15_13 = vmulq_f32(self.twiddle9im, x13m18); let t_b15_14 = vmulq_f32(self.twiddle7im, x14m17); let t_b15_15 = vmulq_f32(self.twiddle8im, x15m16); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11 + t_a1_12 + t_a1_13 + t_a1_14 + t_a1_15); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11 + t_a2_12 + t_a2_13 + t_a2_14 + t_a2_15); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11 + t_a3_12 + t_a3_13 + t_a3_14 + t_a3_15); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11 + t_a4_12 + t_a4_13 + t_a4_14 + t_a4_15); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11 + t_a5_12 + t_a5_13 + t_a5_14 + t_a5_15); let t_a6 = calc_f32!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11 + t_a6_12 + t_a6_13 + t_a6_14 + t_a6_15); let t_a7 = calc_f32!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11 + t_a7_12 + t_a7_13 + t_a7_14 + t_a7_15); let t_a8 = calc_f32!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11 + t_a8_12 + t_a8_13 + t_a8_14 + t_a8_15); let t_a9 = calc_f32!(x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11 + t_a9_12 + t_a9_13 + t_a9_14 + t_a9_15); let t_a10 = calc_f32!(x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11 + t_a10_12 + t_a10_13 + t_a10_14 + t_a10_15); let t_a11 = calc_f32!(x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11 + t_a11_12 + t_a11_13 + t_a11_14 + t_a11_15); let t_a12 = calc_f32!(x0 + t_a12_1 + t_a12_2 + t_a12_3 + t_a12_4 + t_a12_5 + t_a12_6 + t_a12_7 + t_a12_8 + t_a12_9 + t_a12_10 + t_a12_11 + t_a12_12 + t_a12_13 + t_a12_14 + t_a12_15); let t_a13 = calc_f32!(x0 + t_a13_1 + t_a13_2 + t_a13_3 + t_a13_4 + t_a13_5 + t_a13_6 + t_a13_7 + t_a13_8 + t_a13_9 + t_a13_10 + t_a13_11 + t_a13_12 + t_a13_13 + t_a13_14 + t_a13_15); let t_a14 = calc_f32!(x0 + t_a14_1 + t_a14_2 + t_a14_3 + t_a14_4 + t_a14_5 + t_a14_6 + t_a14_7 + t_a14_8 + t_a14_9 + t_a14_10 + t_a14_11 + t_a14_12 + t_a14_13 + t_a14_14 + t_a14_15); let t_a15 = calc_f32!(x0 + t_a15_1 + t_a15_2 + t_a15_3 + t_a15_4 + t_a15_5 + t_a15_6 + t_a15_7 + t_a15_8 + t_a15_9 + t_a15_10 + t_a15_11 + t_a15_12 + t_a15_13 + t_a15_14 + t_a15_15); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11 + t_b1_12 + t_b1_13 + t_b1_14 + t_b1_15); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 + t_b2_6 + t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11 - t_b2_12 - t_b2_13 - t_b2_14 - t_b2_15); let t_b3 = calc_f32!(t_b3_1 + t_b3_2 + t_b3_3 + t_b3_4 + t_b3_5 - t_b3_6 - t_b3_7 - t_b3_8 - t_b3_9 - t_b3_10 + t_b3_11 + t_b3_12 + t_b3_13 + t_b3_14 + t_b3_15); let t_b4 = calc_f32!(t_b4_1 + t_b4_2 + t_b4_3 - t_b4_4 - t_b4_5 - t_b4_6 - t_b4_7 + t_b4_8 + t_b4_9 + t_b4_10 + t_b4_11 - t_b4_12 - t_b4_13 - t_b4_14 - t_b4_15); let t_b5 = calc_f32!(t_b5_1 + t_b5_2 + t_b5_3 - t_b5_4 - t_b5_5 - t_b5_6 + t_b5_7 + t_b5_8 + t_b5_9 - t_b5_10 - t_b5_11 - t_b5_12 + t_b5_13 + t_b5_14 + t_b5_15); let t_b6 = calc_f32!(t_b6_1 + t_b6_2 - t_b6_3 - t_b6_4 - t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9 - t_b6_10 + t_b6_11 + t_b6_12 - t_b6_13 - t_b6_14 - t_b6_15); let t_b7 = calc_f32!(t_b7_1 + t_b7_2 - t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9 + t_b7_10 + t_b7_11 - t_b7_12 - t_b7_13 + t_b7_14 + t_b7_15); let t_b8 = calc_f32!(t_b8_1 - t_b8_2 - t_b8_3 + t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 + t_b8_9 - t_b8_10 - t_b8_11 + t_b8_12 + t_b8_13 - t_b8_14 - t_b8_15); let t_b9 = calc_f32!(t_b9_1 - t_b9_2 - t_b9_3 + t_b9_4 + t_b9_5 - t_b9_6 + t_b9_7 + t_b9_8 - t_b9_9 - t_b9_10 + t_b9_11 + t_b9_12 - t_b9_13 + t_b9_14 + t_b9_15); let t_b10 = calc_f32!(t_b10_1 - t_b10_2 - t_b10_3 + t_b10_4 - t_b10_5 - t_b10_6 + t_b10_7 - t_b10_8 - t_b10_9 + t_b10_10 - t_b10_11 - t_b10_12 + t_b10_13 - t_b10_14 - t_b10_15); let t_b11 = calc_f32!(t_b11_1 - t_b11_2 + t_b11_3 + t_b11_4 - t_b11_5 + t_b11_6 + t_b11_7 - t_b11_8 + t_b11_9 - t_b11_10 - t_b11_11 + t_b11_12 - t_b11_13 - t_b11_14 + t_b11_15); let t_b12 = calc_f32!(t_b12_1 - t_b12_2 + t_b12_3 - t_b12_4 - t_b12_5 + t_b12_6 - t_b12_7 + t_b12_8 + t_b12_9 - t_b12_10 + t_b12_11 - t_b12_12 + t_b12_13 + t_b12_14 - t_b12_15); let t_b13 = calc_f32!(t_b13_1 - t_b13_2 + t_b13_3 - t_b13_4 + t_b13_5 - t_b13_6 - t_b13_7 + t_b13_8 - t_b13_9 + t_b13_10 - t_b13_11 + t_b13_12 + t_b13_13 - t_b13_14 + t_b13_15); let t_b14 = calc_f32!(t_b14_1 - t_b14_2 + t_b14_3 - t_b14_4 + t_b14_5 - t_b14_6 + t_b14_7 - t_b14_8 + t_b14_9 - t_b14_10 - t_b14_11 + t_b14_12 - t_b14_13 + t_b14_14 - t_b14_15); let t_b15 = calc_f32!(t_b15_1 - t_b15_2 + t_b15_3 - t_b15_4 + t_b15_5 - t_b15_6 + t_b15_7 - t_b15_8 + t_b15_9 - t_b15_10 + t_b15_11 - t_b15_12 + t_b15_13 - t_b15_14 + t_b15_15); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let t_b7_rot = self.rotate.rotate_both(t_b7); let t_b8_rot = self.rotate.rotate_both(t_b8); let t_b9_rot = self.rotate.rotate_both(t_b9); let t_b10_rot = self.rotate.rotate_both(t_b10); let t_b11_rot = self.rotate.rotate_both(t_b11); let t_b12_rot = self.rotate.rotate_both(t_b12); let t_b13_rot = self.rotate.rotate_both(t_b13); let t_b14_rot = self.rotate.rotate_both(t_b14); let t_b15_rot = self.rotate.rotate_both(t_b15); let y0 = calc_f32!(x0 + x1p30 + x2p29 + x3p28 + x4p27 + x5p26 + x6p25 + x7p24 + x8p23 + x9p22 + x10p21 + x11p20 + x12p19 + x13p18 + x14p17 + x15p16); let [y1, y30] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y29] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y28] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y27] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y26] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y25] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); let [y7, y24] = parallel_fft2_interleaved_f32(t_a7, t_b7_rot); let [y8, y23] = parallel_fft2_interleaved_f32(t_a8, t_b8_rot); let [y9, y22] = parallel_fft2_interleaved_f32(t_a9, t_b9_rot); let [y10, y21] = parallel_fft2_interleaved_f32(t_a10, t_b10_rot); let [y11, y20] = parallel_fft2_interleaved_f32(t_a11, t_b11_rot); let [y12, y19] = parallel_fft2_interleaved_f32(t_a12, t_b12_rot); let [y13, y18] = parallel_fft2_interleaved_f32(t_a13, t_b13_rot); let [y14, y17] = parallel_fft2_interleaved_f32(t_a14, t_b14_rot); let [y15, y16] = parallel_fft2_interleaved_f32(t_a15, t_b15_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22, y23, y24, y25, y26, y27, y28, y29, y30] } } // _____ _ __ _ _ _ _ _ // |___ // | / /_ | || | | |__ (_) |_ // |_ \| | _____ | '_ \| || |_| '_ \| | __| // ___) | | |_____| | (_) |__ _| |_) | | |_ // |____/|_| \___/ |_| |_.__/|_|\__| // pub struct NeonF64Butterfly31 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: float64x2_t, twiddle1im: float64x2_t, twiddle2re: float64x2_t, twiddle2im: float64x2_t, twiddle3re: float64x2_t, twiddle3im: float64x2_t, twiddle4re: float64x2_t, twiddle4im: float64x2_t, twiddle5re: float64x2_t, twiddle5im: float64x2_t, twiddle6re: float64x2_t, twiddle6im: float64x2_t, twiddle7re: float64x2_t, twiddle7im: float64x2_t, twiddle8re: float64x2_t, twiddle8im: float64x2_t, twiddle9re: float64x2_t, twiddle9im: float64x2_t, twiddle10re: float64x2_t, twiddle10im: float64x2_t, twiddle11re: float64x2_t, twiddle11im: float64x2_t, twiddle12re: float64x2_t, twiddle12im: float64x2_t, twiddle13re: float64x2_t, twiddle13im: float64x2_t, twiddle14re: float64x2_t, twiddle14im: float64x2_t, twiddle15re: float64x2_t, twiddle15im: float64x2_t, } boilerplate_fft_neon_f64_butterfly!(NeonF64Butterfly31, 31, |this: &NeonF64Butterfly31<_>| this .direction); boilerplate_fft_neon_common_butterfly!(NeonF64Butterfly31, 31, |this: &NeonF64Butterfly31<_>| this .direction); impl NeonF64Butterfly31 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 31, direction); let tw2: Complex = twiddles::compute_twiddle(2, 31, direction); let tw3: Complex = twiddles::compute_twiddle(3, 31, direction); let tw4: Complex = twiddles::compute_twiddle(4, 31, direction); let tw5: Complex = twiddles::compute_twiddle(5, 31, direction); let tw6: Complex = twiddles::compute_twiddle(6, 31, direction); let tw7: Complex = twiddles::compute_twiddle(7, 31, direction); let tw8: Complex = twiddles::compute_twiddle(8, 31, direction); let tw9: Complex = twiddles::compute_twiddle(9, 31, direction); let tw10: Complex = twiddles::compute_twiddle(10, 31, direction); let tw11: Complex = twiddles::compute_twiddle(11, 31, direction); let tw12: Complex = twiddles::compute_twiddle(12, 31, direction); let tw13: Complex = twiddles::compute_twiddle(13, 31, direction); let tw14: Complex = twiddles::compute_twiddle(14, 31, direction); let tw15: Complex = twiddles::compute_twiddle(15, 31, direction); let twiddle1re = unsafe { vmovq_n_f64(tw1.re) }; let twiddle1im = unsafe { vmovq_n_f64(tw1.im) }; let twiddle2re = unsafe { vmovq_n_f64(tw2.re) }; let twiddle2im = unsafe { vmovq_n_f64(tw2.im) }; let twiddle3re = unsafe { vmovq_n_f64(tw3.re) }; let twiddle3im = unsafe { vmovq_n_f64(tw3.im) }; let twiddle4re = unsafe { vmovq_n_f64(tw4.re) }; let twiddle4im = unsafe { vmovq_n_f64(tw4.im) }; let twiddle5re = unsafe { vmovq_n_f64(tw5.re) }; let twiddle5im = unsafe { vmovq_n_f64(tw5.im) }; let twiddle6re = unsafe { vmovq_n_f64(tw6.re) }; let twiddle6im = unsafe { vmovq_n_f64(tw6.im) }; let twiddle7re = unsafe { vmovq_n_f64(tw7.re) }; let twiddle7im = unsafe { vmovq_n_f64(tw7.im) }; let twiddle8re = unsafe { vmovq_n_f64(tw8.re) }; let twiddle8im = unsafe { vmovq_n_f64(tw8.im) }; let twiddle9re = unsafe { vmovq_n_f64(tw9.re) }; let twiddle9im = unsafe { vmovq_n_f64(tw9.im) }; let twiddle10re = unsafe { vmovq_n_f64(tw10.re) }; let twiddle10im = unsafe { vmovq_n_f64(tw10.im) }; let twiddle11re = unsafe { vmovq_n_f64(tw11.re) }; let twiddle11im = unsafe { vmovq_n_f64(tw11.im) }; let twiddle12re = unsafe { vmovq_n_f64(tw12.re) }; let twiddle12im = unsafe { vmovq_n_f64(tw12.im) }; let twiddle13re = unsafe { vmovq_n_f64(tw13.re) }; let twiddle13im = unsafe { vmovq_n_f64(tw13.im) }; let twiddle14re = unsafe { vmovq_n_f64(tw14.re) }; let twiddle14im = unsafe { vmovq_n_f64(tw14.im) }; let twiddle15re = unsafe { vmovq_n_f64(tw15.re) }; let twiddle15im = unsafe { vmovq_n_f64(tw15.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, twiddle12re, twiddle12im, twiddle13re, twiddle13im, twiddle14re, twiddle14im, twiddle15re, twiddle15im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl NeonArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [float64x2_t; 31]) -> [float64x2_t; 31] { let [x1p30, x1m30] = solo_fft2_f64(values[1], values[30]); let [x2p29, x2m29] = solo_fft2_f64(values[2], values[29]); let [x3p28, x3m28] = solo_fft2_f64(values[3], values[28]); let [x4p27, x4m27] = solo_fft2_f64(values[4], values[27]); let [x5p26, x5m26] = solo_fft2_f64(values[5], values[26]); let [x6p25, x6m25] = solo_fft2_f64(values[6], values[25]); let [x7p24, x7m24] = solo_fft2_f64(values[7], values[24]); let [x8p23, x8m23] = solo_fft2_f64(values[8], values[23]); let [x9p22, x9m22] = solo_fft2_f64(values[9], values[22]); let [x10p21, x10m21] = solo_fft2_f64(values[10], values[21]); let [x11p20, x11m20] = solo_fft2_f64(values[11], values[20]); let [x12p19, x12m19] = solo_fft2_f64(values[12], values[19]); let [x13p18, x13m18] = solo_fft2_f64(values[13], values[18]); let [x14p17, x14m17] = solo_fft2_f64(values[14], values[17]); let [x15p16, x15m16] = solo_fft2_f64(values[15], values[16]); let t_a1_1 = vmulq_f64(self.twiddle1re, x1p30); let t_a1_2 = vmulq_f64(self.twiddle2re, x2p29); let t_a1_3 = vmulq_f64(self.twiddle3re, x3p28); let t_a1_4 = vmulq_f64(self.twiddle4re, x4p27); let t_a1_5 = vmulq_f64(self.twiddle5re, x5p26); let t_a1_6 = vmulq_f64(self.twiddle6re, x6p25); let t_a1_7 = vmulq_f64(self.twiddle7re, x7p24); let t_a1_8 = vmulq_f64(self.twiddle8re, x8p23); let t_a1_9 = vmulq_f64(self.twiddle9re, x9p22); let t_a1_10 = vmulq_f64(self.twiddle10re, x10p21); let t_a1_11 = vmulq_f64(self.twiddle11re, x11p20); let t_a1_12 = vmulq_f64(self.twiddle12re, x12p19); let t_a1_13 = vmulq_f64(self.twiddle13re, x13p18); let t_a1_14 = vmulq_f64(self.twiddle14re, x14p17); let t_a1_15 = vmulq_f64(self.twiddle15re, x15p16); let t_a2_1 = vmulq_f64(self.twiddle2re, x1p30); let t_a2_2 = vmulq_f64(self.twiddle4re, x2p29); let t_a2_3 = vmulq_f64(self.twiddle6re, x3p28); let t_a2_4 = vmulq_f64(self.twiddle8re, x4p27); let t_a2_5 = vmulq_f64(self.twiddle10re, x5p26); let t_a2_6 = vmulq_f64(self.twiddle12re, x6p25); let t_a2_7 = vmulq_f64(self.twiddle14re, x7p24); let t_a2_8 = vmulq_f64(self.twiddle15re, x8p23); let t_a2_9 = vmulq_f64(self.twiddle13re, x9p22); let t_a2_10 = vmulq_f64(self.twiddle11re, x10p21); let t_a2_11 = vmulq_f64(self.twiddle9re, x11p20); let t_a2_12 = vmulq_f64(self.twiddle7re, x12p19); let t_a2_13 = vmulq_f64(self.twiddle5re, x13p18); let t_a2_14 = vmulq_f64(self.twiddle3re, x14p17); let t_a2_15 = vmulq_f64(self.twiddle1re, x15p16); let t_a3_1 = vmulq_f64(self.twiddle3re, x1p30); let t_a3_2 = vmulq_f64(self.twiddle6re, x2p29); let t_a3_3 = vmulq_f64(self.twiddle9re, x3p28); let t_a3_4 = vmulq_f64(self.twiddle12re, x4p27); let t_a3_5 = vmulq_f64(self.twiddle15re, x5p26); let t_a3_6 = vmulq_f64(self.twiddle13re, x6p25); let t_a3_7 = vmulq_f64(self.twiddle10re, x7p24); let t_a3_8 = vmulq_f64(self.twiddle7re, x8p23); let t_a3_9 = vmulq_f64(self.twiddle4re, x9p22); let t_a3_10 = vmulq_f64(self.twiddle1re, x10p21); let t_a3_11 = vmulq_f64(self.twiddle2re, x11p20); let t_a3_12 = vmulq_f64(self.twiddle5re, x12p19); let t_a3_13 = vmulq_f64(self.twiddle8re, x13p18); let t_a3_14 = vmulq_f64(self.twiddle11re, x14p17); let t_a3_15 = vmulq_f64(self.twiddle14re, x15p16); let t_a4_1 = vmulq_f64(self.twiddle4re, x1p30); let t_a4_2 = vmulq_f64(self.twiddle8re, x2p29); let t_a4_3 = vmulq_f64(self.twiddle12re, x3p28); let t_a4_4 = vmulq_f64(self.twiddle15re, x4p27); let t_a4_5 = vmulq_f64(self.twiddle11re, x5p26); let t_a4_6 = vmulq_f64(self.twiddle7re, x6p25); let t_a4_7 = vmulq_f64(self.twiddle3re, x7p24); let t_a4_8 = vmulq_f64(self.twiddle1re, x8p23); let t_a4_9 = vmulq_f64(self.twiddle5re, x9p22); let t_a4_10 = vmulq_f64(self.twiddle9re, x10p21); let t_a4_11 = vmulq_f64(self.twiddle13re, x11p20); let t_a4_12 = vmulq_f64(self.twiddle14re, x12p19); let t_a4_13 = vmulq_f64(self.twiddle10re, x13p18); let t_a4_14 = vmulq_f64(self.twiddle6re, x14p17); let t_a4_15 = vmulq_f64(self.twiddle2re, x15p16); let t_a5_1 = vmulq_f64(self.twiddle5re, x1p30); let t_a5_2 = vmulq_f64(self.twiddle10re, x2p29); let t_a5_3 = vmulq_f64(self.twiddle15re, x3p28); let t_a5_4 = vmulq_f64(self.twiddle11re, x4p27); let t_a5_5 = vmulq_f64(self.twiddle6re, x5p26); let t_a5_6 = vmulq_f64(self.twiddle1re, x6p25); let t_a5_7 = vmulq_f64(self.twiddle4re, x7p24); let t_a5_8 = vmulq_f64(self.twiddle9re, x8p23); let t_a5_9 = vmulq_f64(self.twiddle14re, x9p22); let t_a5_10 = vmulq_f64(self.twiddle12re, x10p21); let t_a5_11 = vmulq_f64(self.twiddle7re, x11p20); let t_a5_12 = vmulq_f64(self.twiddle2re, x12p19); let t_a5_13 = vmulq_f64(self.twiddle3re, x13p18); let t_a5_14 = vmulq_f64(self.twiddle8re, x14p17); let t_a5_15 = vmulq_f64(self.twiddle13re, x15p16); let t_a6_1 = vmulq_f64(self.twiddle6re, x1p30); let t_a6_2 = vmulq_f64(self.twiddle12re, x2p29); let t_a6_3 = vmulq_f64(self.twiddle13re, x3p28); let t_a6_4 = vmulq_f64(self.twiddle7re, x4p27); let t_a6_5 = vmulq_f64(self.twiddle1re, x5p26); let t_a6_6 = vmulq_f64(self.twiddle5re, x6p25); let t_a6_7 = vmulq_f64(self.twiddle11re, x7p24); let t_a6_8 = vmulq_f64(self.twiddle14re, x8p23); let t_a6_9 = vmulq_f64(self.twiddle8re, x9p22); let t_a6_10 = vmulq_f64(self.twiddle2re, x10p21); let t_a6_11 = vmulq_f64(self.twiddle4re, x11p20); let t_a6_12 = vmulq_f64(self.twiddle10re, x12p19); let t_a6_13 = vmulq_f64(self.twiddle15re, x13p18); let t_a6_14 = vmulq_f64(self.twiddle9re, x14p17); let t_a6_15 = vmulq_f64(self.twiddle3re, x15p16); let t_a7_1 = vmulq_f64(self.twiddle7re, x1p30); let t_a7_2 = vmulq_f64(self.twiddle14re, x2p29); let t_a7_3 = vmulq_f64(self.twiddle10re, x3p28); let t_a7_4 = vmulq_f64(self.twiddle3re, x4p27); let t_a7_5 = vmulq_f64(self.twiddle4re, x5p26); let t_a7_6 = vmulq_f64(self.twiddle11re, x6p25); let t_a7_7 = vmulq_f64(self.twiddle13re, x7p24); let t_a7_8 = vmulq_f64(self.twiddle6re, x8p23); let t_a7_9 = vmulq_f64(self.twiddle1re, x9p22); let t_a7_10 = vmulq_f64(self.twiddle8re, x10p21); let t_a7_11 = vmulq_f64(self.twiddle15re, x11p20); let t_a7_12 = vmulq_f64(self.twiddle9re, x12p19); let t_a7_13 = vmulq_f64(self.twiddle2re, x13p18); let t_a7_14 = vmulq_f64(self.twiddle5re, x14p17); let t_a7_15 = vmulq_f64(self.twiddle12re, x15p16); let t_a8_1 = vmulq_f64(self.twiddle8re, x1p30); let t_a8_2 = vmulq_f64(self.twiddle15re, x2p29); let t_a8_3 = vmulq_f64(self.twiddle7re, x3p28); let t_a8_4 = vmulq_f64(self.twiddle1re, x4p27); let t_a8_5 = vmulq_f64(self.twiddle9re, x5p26); let t_a8_6 = vmulq_f64(self.twiddle14re, x6p25); let t_a8_7 = vmulq_f64(self.twiddle6re, x7p24); let t_a8_8 = vmulq_f64(self.twiddle2re, x8p23); let t_a8_9 = vmulq_f64(self.twiddle10re, x9p22); let t_a8_10 = vmulq_f64(self.twiddle13re, x10p21); let t_a8_11 = vmulq_f64(self.twiddle5re, x11p20); let t_a8_12 = vmulq_f64(self.twiddle3re, x12p19); let t_a8_13 = vmulq_f64(self.twiddle11re, x13p18); let t_a8_14 = vmulq_f64(self.twiddle12re, x14p17); let t_a8_15 = vmulq_f64(self.twiddle4re, x15p16); let t_a9_1 = vmulq_f64(self.twiddle9re, x1p30); let t_a9_2 = vmulq_f64(self.twiddle13re, x2p29); let t_a9_3 = vmulq_f64(self.twiddle4re, x3p28); let t_a9_4 = vmulq_f64(self.twiddle5re, x4p27); let t_a9_5 = vmulq_f64(self.twiddle14re, x5p26); let t_a9_6 = vmulq_f64(self.twiddle8re, x6p25); let t_a9_7 = vmulq_f64(self.twiddle1re, x7p24); let t_a9_8 = vmulq_f64(self.twiddle10re, x8p23); let t_a9_9 = vmulq_f64(self.twiddle12re, x9p22); let t_a9_10 = vmulq_f64(self.twiddle3re, x10p21); let t_a9_11 = vmulq_f64(self.twiddle6re, x11p20); let t_a9_12 = vmulq_f64(self.twiddle15re, x12p19); let t_a9_13 = vmulq_f64(self.twiddle7re, x13p18); let t_a9_14 = vmulq_f64(self.twiddle2re, x14p17); let t_a9_15 = vmulq_f64(self.twiddle11re, x15p16); let t_a10_1 = vmulq_f64(self.twiddle10re, x1p30); let t_a10_2 = vmulq_f64(self.twiddle11re, x2p29); let t_a10_3 = vmulq_f64(self.twiddle1re, x3p28); let t_a10_4 = vmulq_f64(self.twiddle9re, x4p27); let t_a10_5 = vmulq_f64(self.twiddle12re, x5p26); let t_a10_6 = vmulq_f64(self.twiddle2re, x6p25); let t_a10_7 = vmulq_f64(self.twiddle8re, x7p24); let t_a10_8 = vmulq_f64(self.twiddle13re, x8p23); let t_a10_9 = vmulq_f64(self.twiddle3re, x9p22); let t_a10_10 = vmulq_f64(self.twiddle7re, x10p21); let t_a10_11 = vmulq_f64(self.twiddle14re, x11p20); let t_a10_12 = vmulq_f64(self.twiddle4re, x12p19); let t_a10_13 = vmulq_f64(self.twiddle6re, x13p18); let t_a10_14 = vmulq_f64(self.twiddle15re, x14p17); let t_a10_15 = vmulq_f64(self.twiddle5re, x15p16); let t_a11_1 = vmulq_f64(self.twiddle11re, x1p30); let t_a11_2 = vmulq_f64(self.twiddle9re, x2p29); let t_a11_3 = vmulq_f64(self.twiddle2re, x3p28); let t_a11_4 = vmulq_f64(self.twiddle13re, x4p27); let t_a11_5 = vmulq_f64(self.twiddle7re, x5p26); let t_a11_6 = vmulq_f64(self.twiddle4re, x6p25); let t_a11_7 = vmulq_f64(self.twiddle15re, x7p24); let t_a11_8 = vmulq_f64(self.twiddle5re, x8p23); let t_a11_9 = vmulq_f64(self.twiddle6re, x9p22); let t_a11_10 = vmulq_f64(self.twiddle14re, x10p21); let t_a11_11 = vmulq_f64(self.twiddle3re, x11p20); let t_a11_12 = vmulq_f64(self.twiddle8re, x12p19); let t_a11_13 = vmulq_f64(self.twiddle12re, x13p18); let t_a11_14 = vmulq_f64(self.twiddle1re, x14p17); let t_a11_15 = vmulq_f64(self.twiddle10re, x15p16); let t_a12_1 = vmulq_f64(self.twiddle12re, x1p30); let t_a12_2 = vmulq_f64(self.twiddle7re, x2p29); let t_a12_3 = vmulq_f64(self.twiddle5re, x3p28); let t_a12_4 = vmulq_f64(self.twiddle14re, x4p27); let t_a12_5 = vmulq_f64(self.twiddle2re, x5p26); let t_a12_6 = vmulq_f64(self.twiddle10re, x6p25); let t_a12_7 = vmulq_f64(self.twiddle9re, x7p24); let t_a12_8 = vmulq_f64(self.twiddle3re, x8p23); let t_a12_9 = vmulq_f64(self.twiddle15re, x9p22); let t_a12_10 = vmulq_f64(self.twiddle4re, x10p21); let t_a12_11 = vmulq_f64(self.twiddle8re, x11p20); let t_a12_12 = vmulq_f64(self.twiddle11re, x12p19); let t_a12_13 = vmulq_f64(self.twiddle1re, x13p18); let t_a12_14 = vmulq_f64(self.twiddle13re, x14p17); let t_a12_15 = vmulq_f64(self.twiddle6re, x15p16); let t_a13_1 = vmulq_f64(self.twiddle13re, x1p30); let t_a13_2 = vmulq_f64(self.twiddle5re, x2p29); let t_a13_3 = vmulq_f64(self.twiddle8re, x3p28); let t_a13_4 = vmulq_f64(self.twiddle10re, x4p27); let t_a13_5 = vmulq_f64(self.twiddle3re, x5p26); let t_a13_6 = vmulq_f64(self.twiddle15re, x6p25); let t_a13_7 = vmulq_f64(self.twiddle2re, x7p24); let t_a13_8 = vmulq_f64(self.twiddle11re, x8p23); let t_a13_9 = vmulq_f64(self.twiddle7re, x9p22); let t_a13_10 = vmulq_f64(self.twiddle6re, x10p21); let t_a13_11 = vmulq_f64(self.twiddle12re, x11p20); let t_a13_12 = vmulq_f64(self.twiddle1re, x12p19); let t_a13_13 = vmulq_f64(self.twiddle14re, x13p18); let t_a13_14 = vmulq_f64(self.twiddle4re, x14p17); let t_a13_15 = vmulq_f64(self.twiddle9re, x15p16); let t_a14_1 = vmulq_f64(self.twiddle14re, x1p30); let t_a14_2 = vmulq_f64(self.twiddle3re, x2p29); let t_a14_3 = vmulq_f64(self.twiddle11re, x3p28); let t_a14_4 = vmulq_f64(self.twiddle6re, x4p27); let t_a14_5 = vmulq_f64(self.twiddle8re, x5p26); let t_a14_6 = vmulq_f64(self.twiddle9re, x6p25); let t_a14_7 = vmulq_f64(self.twiddle5re, x7p24); let t_a14_8 = vmulq_f64(self.twiddle12re, x8p23); let t_a14_9 = vmulq_f64(self.twiddle2re, x9p22); let t_a14_10 = vmulq_f64(self.twiddle15re, x10p21); let t_a14_11 = vmulq_f64(self.twiddle1re, x11p20); let t_a14_12 = vmulq_f64(self.twiddle13re, x12p19); let t_a14_13 = vmulq_f64(self.twiddle4re, x13p18); let t_a14_14 = vmulq_f64(self.twiddle10re, x14p17); let t_a14_15 = vmulq_f64(self.twiddle7re, x15p16); let t_a15_1 = vmulq_f64(self.twiddle15re, x1p30); let t_a15_2 = vmulq_f64(self.twiddle1re, x2p29); let t_a15_3 = vmulq_f64(self.twiddle14re, x3p28); let t_a15_4 = vmulq_f64(self.twiddle2re, x4p27); let t_a15_5 = vmulq_f64(self.twiddle13re, x5p26); let t_a15_6 = vmulq_f64(self.twiddle3re, x6p25); let t_a15_7 = vmulq_f64(self.twiddle12re, x7p24); let t_a15_8 = vmulq_f64(self.twiddle4re, x8p23); let t_a15_9 = vmulq_f64(self.twiddle11re, x9p22); let t_a15_10 = vmulq_f64(self.twiddle5re, x10p21); let t_a15_11 = vmulq_f64(self.twiddle10re, x11p20); let t_a15_12 = vmulq_f64(self.twiddle6re, x12p19); let t_a15_13 = vmulq_f64(self.twiddle9re, x13p18); let t_a15_14 = vmulq_f64(self.twiddle7re, x14p17); let t_a15_15 = vmulq_f64(self.twiddle8re, x15p16); let t_b1_1 = vmulq_f64(self.twiddle1im, x1m30); let t_b1_2 = vmulq_f64(self.twiddle2im, x2m29); let t_b1_3 = vmulq_f64(self.twiddle3im, x3m28); let t_b1_4 = vmulq_f64(self.twiddle4im, x4m27); let t_b1_5 = vmulq_f64(self.twiddle5im, x5m26); let t_b1_6 = vmulq_f64(self.twiddle6im, x6m25); let t_b1_7 = vmulq_f64(self.twiddle7im, x7m24); let t_b1_8 = vmulq_f64(self.twiddle8im, x8m23); let t_b1_9 = vmulq_f64(self.twiddle9im, x9m22); let t_b1_10 = vmulq_f64(self.twiddle10im, x10m21); let t_b1_11 = vmulq_f64(self.twiddle11im, x11m20); let t_b1_12 = vmulq_f64(self.twiddle12im, x12m19); let t_b1_13 = vmulq_f64(self.twiddle13im, x13m18); let t_b1_14 = vmulq_f64(self.twiddle14im, x14m17); let t_b1_15 = vmulq_f64(self.twiddle15im, x15m16); let t_b2_1 = vmulq_f64(self.twiddle2im, x1m30); let t_b2_2 = vmulq_f64(self.twiddle4im, x2m29); let t_b2_3 = vmulq_f64(self.twiddle6im, x3m28); let t_b2_4 = vmulq_f64(self.twiddle8im, x4m27); let t_b2_5 = vmulq_f64(self.twiddle10im, x5m26); let t_b2_6 = vmulq_f64(self.twiddle12im, x6m25); let t_b2_7 = vmulq_f64(self.twiddle14im, x7m24); let t_b2_8 = vmulq_f64(self.twiddle15im, x8m23); let t_b2_9 = vmulq_f64(self.twiddle13im, x9m22); let t_b2_10 = vmulq_f64(self.twiddle11im, x10m21); let t_b2_11 = vmulq_f64(self.twiddle9im, x11m20); let t_b2_12 = vmulq_f64(self.twiddle7im, x12m19); let t_b2_13 = vmulq_f64(self.twiddle5im, x13m18); let t_b2_14 = vmulq_f64(self.twiddle3im, x14m17); let t_b2_15 = vmulq_f64(self.twiddle1im, x15m16); let t_b3_1 = vmulq_f64(self.twiddle3im, x1m30); let t_b3_2 = vmulq_f64(self.twiddle6im, x2m29); let t_b3_3 = vmulq_f64(self.twiddle9im, x3m28); let t_b3_4 = vmulq_f64(self.twiddle12im, x4m27); let t_b3_5 = vmulq_f64(self.twiddle15im, x5m26); let t_b3_6 = vmulq_f64(self.twiddle13im, x6m25); let t_b3_7 = vmulq_f64(self.twiddle10im, x7m24); let t_b3_8 = vmulq_f64(self.twiddle7im, x8m23); let t_b3_9 = vmulq_f64(self.twiddle4im, x9m22); let t_b3_10 = vmulq_f64(self.twiddle1im, x10m21); let t_b3_11 = vmulq_f64(self.twiddle2im, x11m20); let t_b3_12 = vmulq_f64(self.twiddle5im, x12m19); let t_b3_13 = vmulq_f64(self.twiddle8im, x13m18); let t_b3_14 = vmulq_f64(self.twiddle11im, x14m17); let t_b3_15 = vmulq_f64(self.twiddle14im, x15m16); let t_b4_1 = vmulq_f64(self.twiddle4im, x1m30); let t_b4_2 = vmulq_f64(self.twiddle8im, x2m29); let t_b4_3 = vmulq_f64(self.twiddle12im, x3m28); let t_b4_4 = vmulq_f64(self.twiddle15im, x4m27); let t_b4_5 = vmulq_f64(self.twiddle11im, x5m26); let t_b4_6 = vmulq_f64(self.twiddle7im, x6m25); let t_b4_7 = vmulq_f64(self.twiddle3im, x7m24); let t_b4_8 = vmulq_f64(self.twiddle1im, x8m23); let t_b4_9 = vmulq_f64(self.twiddle5im, x9m22); let t_b4_10 = vmulq_f64(self.twiddle9im, x10m21); let t_b4_11 = vmulq_f64(self.twiddle13im, x11m20); let t_b4_12 = vmulq_f64(self.twiddle14im, x12m19); let t_b4_13 = vmulq_f64(self.twiddle10im, x13m18); let t_b4_14 = vmulq_f64(self.twiddle6im, x14m17); let t_b4_15 = vmulq_f64(self.twiddle2im, x15m16); let t_b5_1 = vmulq_f64(self.twiddle5im, x1m30); let t_b5_2 = vmulq_f64(self.twiddle10im, x2m29); let t_b5_3 = vmulq_f64(self.twiddle15im, x3m28); let t_b5_4 = vmulq_f64(self.twiddle11im, x4m27); let t_b5_5 = vmulq_f64(self.twiddle6im, x5m26); let t_b5_6 = vmulq_f64(self.twiddle1im, x6m25); let t_b5_7 = vmulq_f64(self.twiddle4im, x7m24); let t_b5_8 = vmulq_f64(self.twiddle9im, x8m23); let t_b5_9 = vmulq_f64(self.twiddle14im, x9m22); let t_b5_10 = vmulq_f64(self.twiddle12im, x10m21); let t_b5_11 = vmulq_f64(self.twiddle7im, x11m20); let t_b5_12 = vmulq_f64(self.twiddle2im, x12m19); let t_b5_13 = vmulq_f64(self.twiddle3im, x13m18); let t_b5_14 = vmulq_f64(self.twiddle8im, x14m17); let t_b5_15 = vmulq_f64(self.twiddle13im, x15m16); let t_b6_1 = vmulq_f64(self.twiddle6im, x1m30); let t_b6_2 = vmulq_f64(self.twiddle12im, x2m29); let t_b6_3 = vmulq_f64(self.twiddle13im, x3m28); let t_b6_4 = vmulq_f64(self.twiddle7im, x4m27); let t_b6_5 = vmulq_f64(self.twiddle1im, x5m26); let t_b6_6 = vmulq_f64(self.twiddle5im, x6m25); let t_b6_7 = vmulq_f64(self.twiddle11im, x7m24); let t_b6_8 = vmulq_f64(self.twiddle14im, x8m23); let t_b6_9 = vmulq_f64(self.twiddle8im, x9m22); let t_b6_10 = vmulq_f64(self.twiddle2im, x10m21); let t_b6_11 = vmulq_f64(self.twiddle4im, x11m20); let t_b6_12 = vmulq_f64(self.twiddle10im, x12m19); let t_b6_13 = vmulq_f64(self.twiddle15im, x13m18); let t_b6_14 = vmulq_f64(self.twiddle9im, x14m17); let t_b6_15 = vmulq_f64(self.twiddle3im, x15m16); let t_b7_1 = vmulq_f64(self.twiddle7im, x1m30); let t_b7_2 = vmulq_f64(self.twiddle14im, x2m29); let t_b7_3 = vmulq_f64(self.twiddle10im, x3m28); let t_b7_4 = vmulq_f64(self.twiddle3im, x4m27); let t_b7_5 = vmulq_f64(self.twiddle4im, x5m26); let t_b7_6 = vmulq_f64(self.twiddle11im, x6m25); let t_b7_7 = vmulq_f64(self.twiddle13im, x7m24); let t_b7_8 = vmulq_f64(self.twiddle6im, x8m23); let t_b7_9 = vmulq_f64(self.twiddle1im, x9m22); let t_b7_10 = vmulq_f64(self.twiddle8im, x10m21); let t_b7_11 = vmulq_f64(self.twiddle15im, x11m20); let t_b7_12 = vmulq_f64(self.twiddle9im, x12m19); let t_b7_13 = vmulq_f64(self.twiddle2im, x13m18); let t_b7_14 = vmulq_f64(self.twiddle5im, x14m17); let t_b7_15 = vmulq_f64(self.twiddle12im, x15m16); let t_b8_1 = vmulq_f64(self.twiddle8im, x1m30); let t_b8_2 = vmulq_f64(self.twiddle15im, x2m29); let t_b8_3 = vmulq_f64(self.twiddle7im, x3m28); let t_b8_4 = vmulq_f64(self.twiddle1im, x4m27); let t_b8_5 = vmulq_f64(self.twiddle9im, x5m26); let t_b8_6 = vmulq_f64(self.twiddle14im, x6m25); let t_b8_7 = vmulq_f64(self.twiddle6im, x7m24); let t_b8_8 = vmulq_f64(self.twiddle2im, x8m23); let t_b8_9 = vmulq_f64(self.twiddle10im, x9m22); let t_b8_10 = vmulq_f64(self.twiddle13im, x10m21); let t_b8_11 = vmulq_f64(self.twiddle5im, x11m20); let t_b8_12 = vmulq_f64(self.twiddle3im, x12m19); let t_b8_13 = vmulq_f64(self.twiddle11im, x13m18); let t_b8_14 = vmulq_f64(self.twiddle12im, x14m17); let t_b8_15 = vmulq_f64(self.twiddle4im, x15m16); let t_b9_1 = vmulq_f64(self.twiddle9im, x1m30); let t_b9_2 = vmulq_f64(self.twiddle13im, x2m29); let t_b9_3 = vmulq_f64(self.twiddle4im, x3m28); let t_b9_4 = vmulq_f64(self.twiddle5im, x4m27); let t_b9_5 = vmulq_f64(self.twiddle14im, x5m26); let t_b9_6 = vmulq_f64(self.twiddle8im, x6m25); let t_b9_7 = vmulq_f64(self.twiddle1im, x7m24); let t_b9_8 = vmulq_f64(self.twiddle10im, x8m23); let t_b9_9 = vmulq_f64(self.twiddle12im, x9m22); let t_b9_10 = vmulq_f64(self.twiddle3im, x10m21); let t_b9_11 = vmulq_f64(self.twiddle6im, x11m20); let t_b9_12 = vmulq_f64(self.twiddle15im, x12m19); let t_b9_13 = vmulq_f64(self.twiddle7im, x13m18); let t_b9_14 = vmulq_f64(self.twiddle2im, x14m17); let t_b9_15 = vmulq_f64(self.twiddle11im, x15m16); let t_b10_1 = vmulq_f64(self.twiddle10im, x1m30); let t_b10_2 = vmulq_f64(self.twiddle11im, x2m29); let t_b10_3 = vmulq_f64(self.twiddle1im, x3m28); let t_b10_4 = vmulq_f64(self.twiddle9im, x4m27); let t_b10_5 = vmulq_f64(self.twiddle12im, x5m26); let t_b10_6 = vmulq_f64(self.twiddle2im, x6m25); let t_b10_7 = vmulq_f64(self.twiddle8im, x7m24); let t_b10_8 = vmulq_f64(self.twiddle13im, x8m23); let t_b10_9 = vmulq_f64(self.twiddle3im, x9m22); let t_b10_10 = vmulq_f64(self.twiddle7im, x10m21); let t_b10_11 = vmulq_f64(self.twiddle14im, x11m20); let t_b10_12 = vmulq_f64(self.twiddle4im, x12m19); let t_b10_13 = vmulq_f64(self.twiddle6im, x13m18); let t_b10_14 = vmulq_f64(self.twiddle15im, x14m17); let t_b10_15 = vmulq_f64(self.twiddle5im, x15m16); let t_b11_1 = vmulq_f64(self.twiddle11im, x1m30); let t_b11_2 = vmulq_f64(self.twiddle9im, x2m29); let t_b11_3 = vmulq_f64(self.twiddle2im, x3m28); let t_b11_4 = vmulq_f64(self.twiddle13im, x4m27); let t_b11_5 = vmulq_f64(self.twiddle7im, x5m26); let t_b11_6 = vmulq_f64(self.twiddle4im, x6m25); let t_b11_7 = vmulq_f64(self.twiddle15im, x7m24); let t_b11_8 = vmulq_f64(self.twiddle5im, x8m23); let t_b11_9 = vmulq_f64(self.twiddle6im, x9m22); let t_b11_10 = vmulq_f64(self.twiddle14im, x10m21); let t_b11_11 = vmulq_f64(self.twiddle3im, x11m20); let t_b11_12 = vmulq_f64(self.twiddle8im, x12m19); let t_b11_13 = vmulq_f64(self.twiddle12im, x13m18); let t_b11_14 = vmulq_f64(self.twiddle1im, x14m17); let t_b11_15 = vmulq_f64(self.twiddle10im, x15m16); let t_b12_1 = vmulq_f64(self.twiddle12im, x1m30); let t_b12_2 = vmulq_f64(self.twiddle7im, x2m29); let t_b12_3 = vmulq_f64(self.twiddle5im, x3m28); let t_b12_4 = vmulq_f64(self.twiddle14im, x4m27); let t_b12_5 = vmulq_f64(self.twiddle2im, x5m26); let t_b12_6 = vmulq_f64(self.twiddle10im, x6m25); let t_b12_7 = vmulq_f64(self.twiddle9im, x7m24); let t_b12_8 = vmulq_f64(self.twiddle3im, x8m23); let t_b12_9 = vmulq_f64(self.twiddle15im, x9m22); let t_b12_10 = vmulq_f64(self.twiddle4im, x10m21); let t_b12_11 = vmulq_f64(self.twiddle8im, x11m20); let t_b12_12 = vmulq_f64(self.twiddle11im, x12m19); let t_b12_13 = vmulq_f64(self.twiddle1im, x13m18); let t_b12_14 = vmulq_f64(self.twiddle13im, x14m17); let t_b12_15 = vmulq_f64(self.twiddle6im, x15m16); let t_b13_1 = vmulq_f64(self.twiddle13im, x1m30); let t_b13_2 = vmulq_f64(self.twiddle5im, x2m29); let t_b13_3 = vmulq_f64(self.twiddle8im, x3m28); let t_b13_4 = vmulq_f64(self.twiddle10im, x4m27); let t_b13_5 = vmulq_f64(self.twiddle3im, x5m26); let t_b13_6 = vmulq_f64(self.twiddle15im, x6m25); let t_b13_7 = vmulq_f64(self.twiddle2im, x7m24); let t_b13_8 = vmulq_f64(self.twiddle11im, x8m23); let t_b13_9 = vmulq_f64(self.twiddle7im, x9m22); let t_b13_10 = vmulq_f64(self.twiddle6im, x10m21); let t_b13_11 = vmulq_f64(self.twiddle12im, x11m20); let t_b13_12 = vmulq_f64(self.twiddle1im, x12m19); let t_b13_13 = vmulq_f64(self.twiddle14im, x13m18); let t_b13_14 = vmulq_f64(self.twiddle4im, x14m17); let t_b13_15 = vmulq_f64(self.twiddle9im, x15m16); let t_b14_1 = vmulq_f64(self.twiddle14im, x1m30); let t_b14_2 = vmulq_f64(self.twiddle3im, x2m29); let t_b14_3 = vmulq_f64(self.twiddle11im, x3m28); let t_b14_4 = vmulq_f64(self.twiddle6im, x4m27); let t_b14_5 = vmulq_f64(self.twiddle8im, x5m26); let t_b14_6 = vmulq_f64(self.twiddle9im, x6m25); let t_b14_7 = vmulq_f64(self.twiddle5im, x7m24); let t_b14_8 = vmulq_f64(self.twiddle12im, x8m23); let t_b14_9 = vmulq_f64(self.twiddle2im, x9m22); let t_b14_10 = vmulq_f64(self.twiddle15im, x10m21); let t_b14_11 = vmulq_f64(self.twiddle1im, x11m20); let t_b14_12 = vmulq_f64(self.twiddle13im, x12m19); let t_b14_13 = vmulq_f64(self.twiddle4im, x13m18); let t_b14_14 = vmulq_f64(self.twiddle10im, x14m17); let t_b14_15 = vmulq_f64(self.twiddle7im, x15m16); let t_b15_1 = vmulq_f64(self.twiddle15im, x1m30); let t_b15_2 = vmulq_f64(self.twiddle1im, x2m29); let t_b15_3 = vmulq_f64(self.twiddle14im, x3m28); let t_b15_4 = vmulq_f64(self.twiddle2im, x4m27); let t_b15_5 = vmulq_f64(self.twiddle13im, x5m26); let t_b15_6 = vmulq_f64(self.twiddle3im, x6m25); let t_b15_7 = vmulq_f64(self.twiddle12im, x7m24); let t_b15_8 = vmulq_f64(self.twiddle4im, x8m23); let t_b15_9 = vmulq_f64(self.twiddle11im, x9m22); let t_b15_10 = vmulq_f64(self.twiddle5im, x10m21); let t_b15_11 = vmulq_f64(self.twiddle10im, x11m20); let t_b15_12 = vmulq_f64(self.twiddle6im, x12m19); let t_b15_13 = vmulq_f64(self.twiddle9im, x13m18); let t_b15_14 = vmulq_f64(self.twiddle7im, x14m17); let t_b15_15 = vmulq_f64(self.twiddle8im, x15m16); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11 + t_a1_12 + t_a1_13 + t_a1_14 + t_a1_15); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11 + t_a2_12 + t_a2_13 + t_a2_14 + t_a2_15); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11 + t_a3_12 + t_a3_13 + t_a3_14 + t_a3_15); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11 + t_a4_12 + t_a4_13 + t_a4_14 + t_a4_15); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11 + t_a5_12 + t_a5_13 + t_a5_14 + t_a5_15); let t_a6 = calc_f64!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11 + t_a6_12 + t_a6_13 + t_a6_14 + t_a6_15); let t_a7 = calc_f64!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11 + t_a7_12 + t_a7_13 + t_a7_14 + t_a7_15); let t_a8 = calc_f64!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11 + t_a8_12 + t_a8_13 + t_a8_14 + t_a8_15); let t_a9 = calc_f64!(x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11 + t_a9_12 + t_a9_13 + t_a9_14 + t_a9_15); let t_a10 = calc_f64!(x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11 + t_a10_12 + t_a10_13 + t_a10_14 + t_a10_15); let t_a11 = calc_f64!(x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11 + t_a11_12 + t_a11_13 + t_a11_14 + t_a11_15); let t_a12 = calc_f64!(x0 + t_a12_1 + t_a12_2 + t_a12_3 + t_a12_4 + t_a12_5 + t_a12_6 + t_a12_7 + t_a12_8 + t_a12_9 + t_a12_10 + t_a12_11 + t_a12_12 + t_a12_13 + t_a12_14 + t_a12_15); let t_a13 = calc_f64!(x0 + t_a13_1 + t_a13_2 + t_a13_3 + t_a13_4 + t_a13_5 + t_a13_6 + t_a13_7 + t_a13_8 + t_a13_9 + t_a13_10 + t_a13_11 + t_a13_12 + t_a13_13 + t_a13_14 + t_a13_15); let t_a14 = calc_f64!(x0 + t_a14_1 + t_a14_2 + t_a14_3 + t_a14_4 + t_a14_5 + t_a14_6 + t_a14_7 + t_a14_8 + t_a14_9 + t_a14_10 + t_a14_11 + t_a14_12 + t_a14_13 + t_a14_14 + t_a14_15); let t_a15 = calc_f64!(x0 + t_a15_1 + t_a15_2 + t_a15_3 + t_a15_4 + t_a15_5 + t_a15_6 + t_a15_7 + t_a15_8 + t_a15_9 + t_a15_10 + t_a15_11 + t_a15_12 + t_a15_13 + t_a15_14 + t_a15_15); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11 + t_b1_12 + t_b1_13 + t_b1_14 + t_b1_15); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 + t_b2_6 + t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11 - t_b2_12 - t_b2_13 - t_b2_14 - t_b2_15); let t_b3 = calc_f64!(t_b3_1 + t_b3_2 + t_b3_3 + t_b3_4 + t_b3_5 - t_b3_6 - t_b3_7 - t_b3_8 - t_b3_9 - t_b3_10 + t_b3_11 + t_b3_12 + t_b3_13 + t_b3_14 + t_b3_15); let t_b4 = calc_f64!(t_b4_1 + t_b4_2 + t_b4_3 - t_b4_4 - t_b4_5 - t_b4_6 - t_b4_7 + t_b4_8 + t_b4_9 + t_b4_10 + t_b4_11 - t_b4_12 - t_b4_13 - t_b4_14 - t_b4_15); let t_b5 = calc_f64!(t_b5_1 + t_b5_2 + t_b5_3 - t_b5_4 - t_b5_5 - t_b5_6 + t_b5_7 + t_b5_8 + t_b5_9 - t_b5_10 - t_b5_11 - t_b5_12 + t_b5_13 + t_b5_14 + t_b5_15); let t_b6 = calc_f64!(t_b6_1 + t_b6_2 - t_b6_3 - t_b6_4 - t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9 - t_b6_10 + t_b6_11 + t_b6_12 - t_b6_13 - t_b6_14 - t_b6_15); let t_b7 = calc_f64!(t_b7_1 + t_b7_2 - t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9 + t_b7_10 + t_b7_11 - t_b7_12 - t_b7_13 + t_b7_14 + t_b7_15); let t_b8 = calc_f64!(t_b8_1 - t_b8_2 - t_b8_3 + t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 + t_b8_9 - t_b8_10 - t_b8_11 + t_b8_12 + t_b8_13 - t_b8_14 - t_b8_15); let t_b9 = calc_f64!(t_b9_1 - t_b9_2 - t_b9_3 + t_b9_4 + t_b9_5 - t_b9_6 + t_b9_7 + t_b9_8 - t_b9_9 - t_b9_10 + t_b9_11 + t_b9_12 - t_b9_13 + t_b9_14 + t_b9_15); let t_b10 = calc_f64!(t_b10_1 - t_b10_2 - t_b10_3 + t_b10_4 - t_b10_5 - t_b10_6 + t_b10_7 - t_b10_8 - t_b10_9 + t_b10_10 - t_b10_11 - t_b10_12 + t_b10_13 - t_b10_14 - t_b10_15); let t_b11 = calc_f64!(t_b11_1 - t_b11_2 + t_b11_3 + t_b11_4 - t_b11_5 + t_b11_6 + t_b11_7 - t_b11_8 + t_b11_9 - t_b11_10 - t_b11_11 + t_b11_12 - t_b11_13 - t_b11_14 + t_b11_15); let t_b12 = calc_f64!(t_b12_1 - t_b12_2 + t_b12_3 - t_b12_4 - t_b12_5 + t_b12_6 - t_b12_7 + t_b12_8 + t_b12_9 - t_b12_10 + t_b12_11 - t_b12_12 + t_b12_13 + t_b12_14 - t_b12_15); let t_b13 = calc_f64!(t_b13_1 - t_b13_2 + t_b13_3 - t_b13_4 + t_b13_5 - t_b13_6 - t_b13_7 + t_b13_8 - t_b13_9 + t_b13_10 - t_b13_11 + t_b13_12 + t_b13_13 - t_b13_14 + t_b13_15); let t_b14 = calc_f64!(t_b14_1 - t_b14_2 + t_b14_3 - t_b14_4 + t_b14_5 - t_b14_6 + t_b14_7 - t_b14_8 + t_b14_9 - t_b14_10 - t_b14_11 + t_b14_12 - t_b14_13 + t_b14_14 - t_b14_15); let t_b15 = calc_f64!(t_b15_1 - t_b15_2 + t_b15_3 - t_b15_4 + t_b15_5 - t_b15_6 + t_b15_7 - t_b15_8 + t_b15_9 - t_b15_10 + t_b15_11 - t_b15_12 + t_b15_13 - t_b15_14 + t_b15_15); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let t_b7_rot = self.rotate.rotate(t_b7); let t_b8_rot = self.rotate.rotate(t_b8); let t_b9_rot = self.rotate.rotate(t_b9); let t_b10_rot = self.rotate.rotate(t_b10); let t_b11_rot = self.rotate.rotate(t_b11); let t_b12_rot = self.rotate.rotate(t_b12); let t_b13_rot = self.rotate.rotate(t_b13); let t_b14_rot = self.rotate.rotate(t_b14); let t_b15_rot = self.rotate.rotate(t_b15); let y0 = calc_f64!(x0 + x1p30 + x2p29 + x3p28 + x4p27 + x5p26 + x6p25 + x7p24 + x8p23 + x9p22 + x10p21 + x11p20 + x12p19 + x13p18 + x14p17 + x15p16); let [y1, y30] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y29] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y28] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y27] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y26] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y25] = solo_fft2_f64(t_a6, t_b6_rot); let [y7, y24] = solo_fft2_f64(t_a7, t_b7_rot); let [y8, y23] = solo_fft2_f64(t_a8, t_b8_rot); let [y9, y22] = solo_fft2_f64(t_a9, t_b9_rot); let [y10, y21] = solo_fft2_f64(t_a10, t_b10_rot); let [y11, y20] = solo_fft2_f64(t_a11, t_b11_rot); let [y12, y19] = solo_fft2_f64(t_a12, t_b12_rot); let [y13, y18] = solo_fft2_f64(t_a13, t_b13_rot); let [y14, y17] = solo_fft2_f64(t_a14, t_b14_rot); let [y15, y16] = solo_fft2_f64(t_a15, t_b15_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22, y23, y24, y25, y26, y27, y28, y29, y30] } } // _____ _____ ____ _____ ____ // |_ _| ____/ ___|_ _/ ___| // | | | _| \___ \ | | \___ \ // | | | |___ ___) || | ___) | // |_| |_____|____/ |_| |____/ // #[cfg(test)] mod unit_tests { use super::*; use crate::test_utils::check_fft_algorithm; //the tests for all butterflies will be identical except for the identifiers used and size //so it's ideal for a macro macro_rules! test_butterfly_32_func { ($test_name:ident, $struct_name:ident, $size:expr) => { #[test] fn $test_name() { let butterfly = $struct_name::new(FftDirection::Forward); check_fft_algorithm::(&butterfly, $size, FftDirection::Forward); let butterfly_direction = $struct_name::new(FftDirection::Inverse); check_fft_algorithm::(&butterfly_direction, $size, FftDirection::Inverse); } }; } test_butterfly_32_func!(test_neonf32_butterfly7, NeonF32Butterfly7, 7); test_butterfly_32_func!(test_neonf32_butterfly11, NeonF32Butterfly11, 11); test_butterfly_32_func!(test_neonf32_butterfly13, NeonF32Butterfly13, 13); test_butterfly_32_func!(test_neonf32_butterfly17, NeonF32Butterfly17, 17); test_butterfly_32_func!(test_neonf32_butterfly19, NeonF32Butterfly19, 19); test_butterfly_32_func!(test_neonf32_butterfly23, NeonF32Butterfly23, 23); test_butterfly_32_func!(test_neonf32_butterfly29, NeonF32Butterfly29, 29); test_butterfly_32_func!(test_neonf32_butterfly31, NeonF32Butterfly31, 31); //the tests for all butterflies will be identical except for the identifiers used and size //so it's ideal for a macro macro_rules! test_butterfly_64_func { ($test_name:ident, $struct_name:ident, $size:expr) => { #[test] fn $test_name() { let butterfly = $struct_name::new(FftDirection::Forward); check_fft_algorithm::(&butterfly, $size, FftDirection::Forward); let butterfly_direction = $struct_name::new(FftDirection::Inverse); check_fft_algorithm::(&butterfly_direction, $size, FftDirection::Inverse); } }; } test_butterfly_64_func!(test_neonf64_butterfly7, NeonF64Butterfly7, 7); test_butterfly_64_func!(test_neonf64_butterfly11, NeonF64Butterfly11, 11); test_butterfly_64_func!(test_neonf64_butterfly13, NeonF64Butterfly13, 13); test_butterfly_64_func!(test_neonf64_butterfly17, NeonF64Butterfly17, 17); test_butterfly_64_func!(test_neonf64_butterfly19, NeonF64Butterfly19, 19); test_butterfly_64_func!(test_neonf64_butterfly23, NeonF64Butterfly23, 23); test_butterfly_64_func!(test_neonf64_butterfly29, NeonF64Butterfly29, 29); test_butterfly_64_func!(test_neonf64_butterfly31, NeonF64Butterfly31, 31); } rustfft-6.2.0/src/neon/neon_radix4.rs000064400000000000000000000421520072674642500156700ustar 00000000000000use num_complex::Complex; use core::arch::aarch64::*; use crate::algorithm::bitreversed_transpose; use crate::array_utils; use crate::array_utils::workaround_transmute_mut; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::neon::neon_butterflies::{ NeonF32Butterfly1, NeonF32Butterfly16, NeonF32Butterfly2, NeonF32Butterfly32, NeonF32Butterfly4, NeonF32Butterfly8, }; use crate::neon::neon_butterflies::{ NeonF64Butterfly1, NeonF64Butterfly16, NeonF64Butterfly2, NeonF64Butterfly32, NeonF64Butterfly4, NeonF64Butterfly8, }; use crate::{common::FftNum, twiddles, FftDirection}; use crate::{Direction, Fft, Length}; use super::neon_common::{assert_f32, assert_f64}; use super::neon_utils::*; use super::neon_vector::{NeonArray, NeonArrayMut}; /// FFT algorithm optimized for power-of-two sizes, Neon accelerated version. /// This is designed to be used via a Planner, and not created directly. const USE_BUTTERFLY32_FROM: usize = 262144; // Use length 32 butterfly starting from this length enum Neon32Butterfly { Len1(NeonF32Butterfly1), Len2(NeonF32Butterfly2), Len4(NeonF32Butterfly4), Len8(NeonF32Butterfly8), Len16(NeonF32Butterfly16), Len32(NeonF32Butterfly32), } enum Neon64Butterfly { Len1(NeonF64Butterfly1), Len2(NeonF64Butterfly2), Len4(NeonF64Butterfly4), Len8(NeonF64Butterfly8), Len16(NeonF64Butterfly16), Len32(NeonF64Butterfly32), } pub struct Neon32Radix4 { _phantom: std::marker::PhantomData, twiddles: Box<[float32x4_t]>, base_fft: Neon32Butterfly, base_len: usize, len: usize, direction: FftDirection, bf4: NeonF32Butterfly4, } impl Neon32Radix4 { /// Preallocates necessary arrays and precomputes necessary data to efficiently compute the power-of-two FFT pub fn new(len: usize, direction: FftDirection) -> Self { assert!( len.is_power_of_two(), "Radix4 algorithm requires a power-of-two input size. Got {}", len ); assert_f32::(); // figure out which base length we're going to use let num_bits = len.trailing_zeros(); let (base_len, base_fft) = match num_bits { 0 => ( len, Neon32Butterfly::Len1(NeonF32Butterfly1::new(direction)), ), 1 => ( len, Neon32Butterfly::Len2(NeonF32Butterfly2::new(direction)), ), 2 => ( len, Neon32Butterfly::Len4(NeonF32Butterfly4::new(direction)), ), 3 => ( len, Neon32Butterfly::Len8(NeonF32Butterfly8::new(direction)), ), _ => { if num_bits % 2 == 1 { if len < USE_BUTTERFLY32_FROM { (8, Neon32Butterfly::Len8(NeonF32Butterfly8::new(direction))) } else { ( 32, Neon32Butterfly::Len32(NeonF32Butterfly32::new(direction)), ) } } else { ( 16, Neon32Butterfly::Len16(NeonF32Butterfly16::new(direction)), ) } } }; // precompute the twiddle factors this algorithm will use. // we're doing the same precomputation of twiddle factors as the mixed radix algorithm where width=4 and height=len/4 // but mixed radix only does one step and then calls itself recusrively, and this algorithm does every layer all the way down // so we're going to pack all the "layers" of twiddle factors into a single array, starting with the bottom layer and going up let mut twiddle_stride = len / (base_len * 4); let mut twiddle_factors = Vec::with_capacity(len * 2); while twiddle_stride > 0 { let num_rows = len / (twiddle_stride * 4); for i in 0..num_rows / 2 { for k in 1..4 { let twiddle_a = twiddles::compute_twiddle::( 2 * i * k * twiddle_stride, len, direction, ); let twiddle_b = twiddles::compute_twiddle::( (2 * i + 1) * k * twiddle_stride, len, direction, ); let twiddles_packed = unsafe { [twiddle_a, twiddle_b].as_slice().load_complex(0) }; twiddle_factors.push(twiddles_packed); } } twiddle_stride >>= 2; } Self { twiddles: twiddle_factors.into_boxed_slice(), base_fft, base_len, len, direction, _phantom: std::marker::PhantomData, bf4: NeonF32Butterfly4::::new(direction), } } //#[target_feature(enable = "neon")] unsafe fn perform_fft_out_of_place( &self, signal: &[Complex], spectrum: &mut [Complex], _scratch: &mut [Complex], ) { // copy the data into the spectrum vector if self.len() == self.base_len { spectrum.copy_from_slice(signal); } else { bitreversed_transpose(self.base_len, signal, spectrum); } // Base-level FFTs match &self.base_fft { Neon32Butterfly::Len1(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Neon32Butterfly::Len2(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Neon32Butterfly::Len4(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Neon32Butterfly::Len8(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Neon32Butterfly::Len16(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Neon32Butterfly::Len32(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), }; // cross-FFTs let mut current_size = self.base_len * 4; let mut layer_twiddles: &[float32x4_t] = &self.twiddles; while current_size <= signal.len() { let num_rows = signal.len() / current_size; for i in 0..num_rows { butterfly_4_32( &mut spectrum[i * current_size..], layer_twiddles, current_size / 4, &self.bf4, ) } //skip past all the twiddle factors used in this layer let twiddle_offset = (current_size * 3) / 8; layer_twiddles = &layer_twiddles[twiddle_offset..]; current_size *= 4; } } } boilerplate_fft_neon_oop!(Neon32Radix4, |this: &Neon32Radix4<_>| this.len); //#[target_feature(enable = "neon")] unsafe fn butterfly_4_32( data: &mut [Complex], twiddles: &[float32x4_t], num_ffts: usize, bf4: &NeonF32Butterfly4, ) { let mut idx = 0usize; let mut buffer: &mut [Complex] = workaround_transmute_mut(data); for tw in twiddles.chunks_exact(6).take(num_ffts / 4) { let scratch0 = buffer.load_complex(idx); let scratch0b = buffer.load_complex(idx + 2); let mut scratch1 = buffer.load_complex(idx + 1 * num_ffts); let mut scratch1b = buffer.load_complex(idx + 2 + 1 * num_ffts); let mut scratch2 = buffer.load_complex(idx + 2 * num_ffts); let mut scratch2b = buffer.load_complex(idx + 2 + 2 * num_ffts); let mut scratch3 = buffer.load_complex(idx + 3 * num_ffts); let mut scratch3b = buffer.load_complex(idx + 2 + 3 * num_ffts); scratch1 = mul_complex_f32(scratch1, tw[0]); scratch2 = mul_complex_f32(scratch2, tw[1]); scratch3 = mul_complex_f32(scratch3, tw[2]); scratch1b = mul_complex_f32(scratch1b, tw[3]); scratch2b = mul_complex_f32(scratch2b, tw[4]); scratch3b = mul_complex_f32(scratch3b, tw[5]); let scratch = bf4.perform_parallel_fft_direct(scratch0, scratch1, scratch2, scratch3); let scratchb = bf4.perform_parallel_fft_direct(scratch0b, scratch1b, scratch2b, scratch3b); buffer.store_complex(scratch[0], idx); buffer.store_complex(scratchb[0], idx + 2); buffer.store_complex(scratch[1], idx + 1 * num_ffts); buffer.store_complex(scratchb[1], idx + 2 + 1 * num_ffts); buffer.store_complex(scratch[2], idx + 2 * num_ffts); buffer.store_complex(scratchb[2], idx + 2 + 2 * num_ffts); buffer.store_complex(scratch[3], idx + 3 * num_ffts); buffer.store_complex(scratchb[3], idx + 2 + 3 * num_ffts); idx += 4; } } pub struct Neon64Radix4 { _phantom: std::marker::PhantomData, twiddles: Box<[float64x2_t]>, base_fft: Neon64Butterfly, base_len: usize, len: usize, direction: FftDirection, bf4: NeonF64Butterfly4, } impl Neon64Radix4 { /// Preallocates necessary arrays and precomputes necessary data to efficiently compute the power-of-two FFT pub fn new(len: usize, direction: FftDirection) -> Self { assert!( len.is_power_of_two(), "Radix4 algorithm requires a power-of-two input size. Got {}", len ); assert_f64::(); // figure out which base length we're going to use let num_bits = len.trailing_zeros(); let (base_len, base_fft) = match num_bits { 0 => ( len, Neon64Butterfly::Len1(NeonF64Butterfly1::new(direction)), ), 1 => ( len, Neon64Butterfly::Len2(NeonF64Butterfly2::new(direction)), ), 2 => ( len, Neon64Butterfly::Len4(NeonF64Butterfly4::new(direction)), ), 3 => ( len, Neon64Butterfly::Len8(NeonF64Butterfly8::new(direction)), ), _ => { if num_bits % 2 == 1 { if len < USE_BUTTERFLY32_FROM { (8, Neon64Butterfly::Len8(NeonF64Butterfly8::new(direction))) } else { ( 32, Neon64Butterfly::Len32(NeonF64Butterfly32::new(direction)), ) } } else { ( 16, Neon64Butterfly::Len16(NeonF64Butterfly16::new(direction)), ) } } }; // precompute the twiddle factors this algorithm will use. // we're doing the same precomputation of twiddle factors as the mixed radix algorithm where width=4 and height=len/4 // but mixed radix only does one step and then calls itself recusrively, and this algorithm does every layer all the way down // so we're going to pack all the "layers" of twiddle factors into a single array, starting with the bottom layer and going up let mut twiddle_stride = len / (base_len * 4); let mut twiddle_factors = Vec::with_capacity(len * 2); while twiddle_stride > 0 { let num_rows = len / (twiddle_stride * 4); for i in 0..num_rows { for k in 1..4 { let twiddle = twiddles::compute_twiddle::(i * k * twiddle_stride, len, direction); let twiddle_packed = unsafe { [twiddle].as_slice().load_complex(0) }; twiddle_factors.push(twiddle_packed); } } twiddle_stride >>= 2; } Self { twiddles: twiddle_factors.into_boxed_slice(), base_fft, base_len, len, direction, _phantom: std::marker::PhantomData, bf4: NeonF64Butterfly4::::new(direction), } } //#[target_feature(enable = "neon")] unsafe fn perform_fft_out_of_place( &self, signal: &[Complex], spectrum: &mut [Complex], _scratch: &mut [Complex], ) { // copy the data into the spectrum vector if self.len() == self.base_len { spectrum.copy_from_slice(signal); } else { bitreversed_transpose(self.base_len, signal, spectrum); } // Base-level FFTs match &self.base_fft { Neon64Butterfly::Len1(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Neon64Butterfly::Len2(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Neon64Butterfly::Len4(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Neon64Butterfly::Len8(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Neon64Butterfly::Len16(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Neon64Butterfly::Len32(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), } // cross-FFTs let mut current_size = self.base_len * 4; let mut layer_twiddles: &[float64x2_t] = &self.twiddles; while current_size <= signal.len() { let num_rows = signal.len() / current_size; for i in 0..num_rows { butterfly_4_64( &mut spectrum[i * current_size..], layer_twiddles, current_size / 4, &self.bf4, ) } //skip past all the twiddle factors used in this layer let twiddle_offset = (current_size * 3) / 4; layer_twiddles = &layer_twiddles[twiddle_offset..]; current_size *= 4; } } } boilerplate_fft_neon_oop!(Neon64Radix4, |this: &Neon64Radix4<_>| this.len); //#[target_feature(enable = "neon")] unsafe fn butterfly_4_64( data: &mut [Complex], twiddles: &[float64x2_t], num_ffts: usize, bf4: &NeonF64Butterfly4, ) { let mut idx = 0usize; let mut buffer: &mut [Complex] = workaround_transmute_mut(data); for tw in twiddles.chunks_exact(6).take(num_ffts / 2) { let scratch0 = buffer.load_complex(idx); let scratch0b = buffer.load_complex(idx + 1); let mut scratch1 = buffer.load_complex(idx + 1 * num_ffts); let mut scratch1b = buffer.load_complex(idx + 1 + 1 * num_ffts); let mut scratch2 = buffer.load_complex(idx + 2 * num_ffts); let mut scratch2b = buffer.load_complex(idx + 1 + 2 * num_ffts); let mut scratch3 = buffer.load_complex(idx + 3 * num_ffts); let mut scratch3b = buffer.load_complex(idx + 1 + 3 * num_ffts); scratch1 = mul_complex_f64(scratch1, tw[0]); scratch2 = mul_complex_f64(scratch2, tw[1]); scratch3 = mul_complex_f64(scratch3, tw[2]); scratch1b = mul_complex_f64(scratch1b, tw[3]); scratch2b = mul_complex_f64(scratch2b, tw[4]); scratch3b = mul_complex_f64(scratch3b, tw[5]); let scratch = bf4.perform_fft_direct(scratch0, scratch1, scratch2, scratch3); let scratchb = bf4.perform_fft_direct(scratch0b, scratch1b, scratch2b, scratch3b); buffer.store_complex(scratch[0], idx); buffer.store_complex(scratchb[0], idx + 1); buffer.store_complex(scratch[1], idx + 1 * num_ffts); buffer.store_complex(scratchb[1], idx + 1 + 1 * num_ffts); buffer.store_complex(scratch[2], idx + 2 * num_ffts); buffer.store_complex(scratchb[2], idx + 1 + 2 * num_ffts); buffer.store_complex(scratch[3], idx + 3 * num_ffts); buffer.store_complex(scratchb[3], idx + 1 + 3 * num_ffts); idx += 2; } } #[cfg(test)] mod unit_tests { use super::*; use crate::test_utils::check_fft_algorithm; #[test] fn test_neon_radix4_64() { for pow in 4..12 { let len = 1 << pow; test_neon_radix4_64_with_length(len, FftDirection::Forward); test_neon_radix4_64_with_length(len, FftDirection::Inverse); } } fn test_neon_radix4_64_with_length(len: usize, direction: FftDirection) { let fft = Neon64Radix4::new(len, direction); check_fft_algorithm::(&fft, len, direction); } #[test] fn test_neon_radix4_32() { for pow in 0..12 { let len = 1 << pow; test_neon_radix4_32_with_length(len, FftDirection::Forward); test_neon_radix4_32_with_length(len, FftDirection::Inverse); } } fn test_neon_radix4_32_with_length(len: usize, direction: FftDirection) { let fft = Neon32Radix4::new(len, direction); check_fft_algorithm::(&fft, len, direction); } } rustfft-6.2.0/src/neon/neon_utils.rs000064400000000000000000000227270072674642500156430ustar 00000000000000use core::arch::aarch64::*; // __ __ _ _ _________ _ _ _ // | \/ | __ _| |_| |__ |___ /___ \| |__ (_) |_ // | |\/| |/ _` | __| '_ \ _____ |_ \ __) | '_ \| | __| // | | | | (_| | |_| | | | |_____| ___) / __/| |_) | | |_ // |_| |_|\__,_|\__|_| |_| |____/_____|_.__/|_|\__| // pub struct Rotate90F32 { //sign_lo: float32x4_t, sign_hi: float32x2_t, sign_both: float32x4_t, } impl Rotate90F32 { pub fn new(positive: bool) -> Self { // There doesn't seem to be any need for rotating just the first element, but let's keep the code just in case //let sign_lo = unsafe { // if positive { // _mm_set_ps(0.0, 0.0, 0.0, -0.0) // } // else { // _mm_set_ps(0.0, 0.0, -0.0, 0.0) // } //}; let sign_hi = unsafe { if positive { vld1_f32([-0.0, 0.0].as_ptr()) } else { vld1_f32([0.0, -0.0].as_ptr()) } }; let sign_both = unsafe { if positive { vld1q_f32([-0.0, 0.0, -0.0, 0.0].as_ptr()) } else { vld1q_f32([0.0, -0.0, 0.0, -0.0].as_ptr()) } }; Self { //sign_lo, sign_hi, sign_both, } } #[inline(always)] pub unsafe fn rotate_hi(&self, values: float32x4_t) -> float32x4_t { vcombine_f32( vget_low_f32(values), vreinterpret_f32_u32(veor_u32( vrev64_u32(vreinterpret_u32_f32(vget_high_f32(values))), vreinterpret_u32_f32(self.sign_hi), )), ) } // There doesn't seem to be any need for rotating just the first element, but let's keep the code just in case //#[inline(always)] //pub unsafe fn rotate_lo(&self, values: __m128) -> __m128 { // let temp = _mm_shuffle_ps(values, values, 0xE1); // _mm_xor_ps(temp, self.sign_lo) //} #[inline(always)] pub unsafe fn rotate_both(&self, values: float32x4_t) -> float32x4_t { let temp = vrev64q_f32(values); vreinterpretq_f32_u32(veorq_u32( vreinterpretq_u32_f32(temp), vreinterpretq_u32_f32(self.sign_both), )) } } // Pack low (1st) complex // left: r1.re, r1.im, r2.re, r2.im // right: l1.re, l1.im, l2.re, l2.im // --> r1.re, r1.im, l1.re, l1.im #[inline(always)] pub unsafe fn extract_lo_lo_f32(left: float32x4_t, right: float32x4_t) -> float32x4_t { //_mm_shuffle_ps(left, right, 0x44) vreinterpretq_f32_f64(vtrn1q_f64( vreinterpretq_f64_f32(left), vreinterpretq_f64_f32(right), )) } // Pack high (2nd) complex // left: r1.re, r1.im, r2.re, r2.im // right: l1.re, l1.im, l2.re, l2.im // --> r2.re, r2.im, l2.re, l2.im #[inline(always)] pub unsafe fn extract_hi_hi_f32(left: float32x4_t, right: float32x4_t) -> float32x4_t { vreinterpretq_f32_f64(vtrn2q_f64( vreinterpretq_f64_f32(left), vreinterpretq_f64_f32(right), )) } // Pack low (1st) and high (2nd) complex // left: r1.re, r1.im, r2.re, r2.im // right: l1.re, l1.im, l2.re, l2.im // --> r1.re, r1.im, l2.re, l2.im #[inline(always)] pub unsafe fn extract_lo_hi_f32(left: float32x4_t, right: float32x4_t) -> float32x4_t { vcombine_f32(vget_low_f32(left), vget_high_f32(right)) } // Pack high (2nd) and low (1st) complex // left: r1.re, r1.im, r2.re, r2.im // right: l1.re, l1.im, l2.re, l2.im // --> r2.re, r2.im, l1.re, l1.im #[inline(always)] pub unsafe fn extract_hi_lo_f32(left: float32x4_t, right: float32x4_t) -> float32x4_t { vcombine_f32(vget_high_f32(left), vget_low_f32(right)) } // Reverse complex // values: a.re, a.im, b.re, b.im // --> b.re, b.im, a.re, a.im #[inline(always)] pub unsafe fn reverse_complex_elements_f32(values: float32x4_t) -> float32x4_t { vcombine_f32(vget_high_f32(values), vget_low_f32(values)) } // Reverse complex and then negate hi complex // values: a.re, a.im, b.re, b.im // --> b.re, b.im, -a.re, -a.im #[inline(always)] pub unsafe fn reverse_complex_and_negate_hi_f32(values: float32x4_t) -> float32x4_t { vcombine_f32(vget_high_f32(values), vneg_f32(vget_low_f32(values))) } // Invert sign of high (2nd) complex // values: a.re, a.im, b.re, b.im // --> a.re, a.im, -b.re, -b.im //#[inline(always)] //pub unsafe fn negate_hi_f32(values: float32x4_t) -> float32x4_t { // vcombine_f32(vget_low_f32(values), vneg_f32(vget_high_f32(values))) //} // Duplicate low (1st) complex // values: a.re, a.im, b.re, b.im // --> a.re, a.im, a.re, a.im #[inline(always)] pub unsafe fn duplicate_lo_f32(values: float32x4_t) -> float32x4_t { vreinterpretq_f32_f64(vtrn1q_f64( vreinterpretq_f64_f32(values), vreinterpretq_f64_f32(values), )) } // Duplicate high (2nd) complex // values: a.re, a.im, b.re, b.im // --> b.re, b.im, b.re, b.im #[inline(always)] pub unsafe fn duplicate_hi_f32(values: float32x4_t) -> float32x4_t { vreinterpretq_f32_f64(vtrn2q_f64( vreinterpretq_f64_f32(values), vreinterpretq_f64_f32(values), )) } // transpose a 2x2 complex matrix given as [x0, x1], [x2, x3] // result is [x0, x2], [x1, x3] #[inline(always)] pub unsafe fn transpose_complex_2x2_f32(left: float32x4_t, right: float32x4_t) -> [float32x4_t; 2] { let temp02 = extract_lo_lo_f32(left, right); let temp13 = extract_hi_hi_f32(left, right); [temp02, temp13] } // Complex multiplication. // Each input contains two complex values, which are multiplied in parallel. #[inline(always)] pub unsafe fn mul_complex_f32(left: float32x4_t, right: float32x4_t) -> float32x4_t { // ARMv8.2-A introduced vcmulq_f32 and vcmlaq_f32 for complex multiplication, these intrinsics are not yet available. let temp1 = vtrn1q_f32(right, right); let temp2 = vtrn2q_f32(right, vnegq_f32(right)); let temp3 = vmulq_f32(temp2, left); let temp4 = vrev64q_f32(temp3); vfmaq_f32(temp4, temp1, left) } // __ __ _ _ __ _ _ _ _ _ // | \/ | __ _| |_| |__ / /_ | || | | |__ (_) |_ // | |\/| |/ _` | __| '_ \ _____ | '_ \| || |_| '_ \| | __| // | | | | (_| | |_| | | | |_____| | (_) |__ _| |_) | | |_ // |_| |_|\__,_|\__|_| |_| \___/ |_| |_.__/|_|\__| // pub(crate) struct Rotate90F64 { sign: float64x2_t, } impl Rotate90F64 { pub fn new(positive: bool) -> Self { let sign = unsafe { if positive { vld1q_f64([-0.0, 0.0].as_ptr()) } else { vld1q_f64([0.0, -0.0].as_ptr()) } }; Self { sign } } #[inline(always)] pub unsafe fn rotate(&self, values: float64x2_t) -> float64x2_t { let temp = vcombine_f64(vget_high_f64(values), vget_low_f64(values)); vreinterpretq_f64_u64(veorq_u64( vreinterpretq_u64_f64(temp), vreinterpretq_u64_f64(self.sign), )) } } #[inline(always)] pub unsafe fn mul_complex_f64(left: float64x2_t, right: float64x2_t) -> float64x2_t { // ARMv8.2-A introduced vcmulq_f64 and vcmlaq_f64 for complex multiplication, these intrinsics are not yet available. let temp = vcombine_f64(vneg_f64(vget_high_f64(left)), vget_low_f64(left)); let sum = vmulq_laneq_f64::<0>(left, right); vfmaq_laneq_f64::<1>(sum, temp, right) } #[cfg(test)] mod unit_tests { use super::*; use num_complex::Complex; #[test] fn test_mul_complex_f64() { unsafe { let right = vld1q_f64([1.0, 2.0].as_ptr()); let left = vld1q_f64([5.0, 7.0].as_ptr()); let res = mul_complex_f64(left, right); let expected = vld1q_f64([1.0 * 5.0 - 2.0 * 7.0, 1.0 * 7.0 + 2.0 * 5.0].as_ptr()); assert_eq!( std::mem::transmute::>(res), std::mem::transmute::>(expected) ); } } #[test] fn test_mul_complex_f32() { unsafe { let val1 = Complex::::new(1.0, 2.5); let val2 = Complex::::new(3.2, 4.75); let val3 = Complex::::new(5.75, 6.25); let val4 = Complex::::new(7.4, 8.5); let nbr2 = vld1q_f32([val3, val4].as_ptr() as *const f32); let nbr1 = vld1q_f32([val1, val2].as_ptr() as *const f32); let res = mul_complex_f32(nbr1, nbr2); let res = std::mem::transmute::; 2]>(res); let expected = [val1 * val3, val2 * val4]; assert_eq!(res, expected); } } #[test] fn test_pack() { unsafe { let nbr2 = vld1q_f32([5.0, 6.0, 7.0, 8.0].as_ptr()); let nbr1 = vld1q_f32([1.0, 2.0, 3.0, 4.0].as_ptr()); let first = extract_lo_lo_f32(nbr1, nbr2); let second = extract_hi_hi_f32(nbr1, nbr2); let first = std::mem::transmute::; 2]>(first); let second = std::mem::transmute::; 2]>(second); let first_expected = [Complex::new(1.0, 2.0), Complex::new(5.0, 6.0)]; let second_expected = [Complex::new(3.0, 4.0), Complex::new(7.0, 8.0)]; assert_eq!(first, first_expected); assert_eq!(second, second_expected); } } } rustfft-6.2.0/src/neon/neon_vector.rs000064400000000000000000000333770072674642500160100ustar 00000000000000use core::arch::aarch64::*; use num_complex::Complex; use std::ops::{Deref, DerefMut}; use crate::array_utils::DoubleBuf; // Read these indexes from an NeonArray and build an array of simd vectors. // Takes a name of a vector to read from, and a list of indexes to read. // This statement: // ``` // let values = read_complex_to_array!(input, {0, 1, 2, 3}); // ``` // is equivalent to: // ``` // let values = [ // input.load_complex(0), // input.load_complex(1), // input.load_complex(2), // input.load_complex(3), // ]; // ``` macro_rules! read_complex_to_array { ($input:ident, { $($idx:literal),* }) => { [ $( $input.load_complex($idx), )* ] } } // Read these indexes from an NeonArray and build an array or partially filled simd vectors. // Takes a name of a vector to read from, and a list of indexes to read. // This statement: // ``` // let values = read_partial1_complex_to_array!(input, {0, 1, 2, 3}); // ``` // is equivalent to: // ``` // let values = [ // input.load1_complex(0), // input.load1_complex(1), // input.load1_complex(2), // input.load1_complex(3), // ]; // ``` macro_rules! read_partial1_complex_to_array { ($input:ident, { $($idx:literal),* }) => { [ $( $input.load1_complex($idx), )* ] } } // Write these indexes of an array of simd vectors to the same indexes of an NeonArray. // Takes a name of a vector to read from, one to write to, and a list of indexes. // This statement: // ``` // let values = write_complex_to_array!(input, output, {0, 1, 2, 3}); // ``` // is equivalent to: // ``` // let values = [ // output.store_complex(input[0], 0), // output.store_complex(input[1], 1), // output.store_complex(input[2], 2), // output.store_complex(input[3], 3), // ]; // ``` macro_rules! write_complex_to_array { ($input:ident, $output:ident, { $($idx:literal),* }) => { $( $output.store_complex($input[$idx], $idx); )* } } // Write the low half of these indexes of an array of simd vectors to the same indexes of an NeonArray. // Takes a name of a vector to read from, one to write to, and a list of indexes. // This statement: // ``` // let values = write_partial_lo_complex_to_array!(input, output, {0, 1, 2, 3}); // ``` // is equivalent to: // ``` // let values = [ // output.store_partial_lo_complex(input[0], 0), // output.store_partial_lo_complex(input[1], 1), // output.store_partial_lo_complex(input[2], 2), // output.store_partial_lo_complex(input[3], 3), // ]; // ``` macro_rules! write_partial_lo_complex_to_array { ($input:ident, $output:ident, { $($idx:literal),* }) => { $( $output.store_partial_lo_complex($input[$idx], $idx); )* } } // Write these indexes of an array of simd vectors to the same indexes, multiplied by a stride, of an NeonArray. // Takes a name of a vector to read from, one to write to, an integer stride, and a list of indexes. // This statement: // ``` // let values = write_complex_to_array_separate!(input, output, {0, 1, 2, 3}); // ``` // is equivalent to: // ``` // let values = [ // output.store_complex(input[0], 0), // output.store_complex(input[1], 2), // output.store_complex(input[2], 4), // output.store_complex(input[3], 6), // ]; // ``` macro_rules! write_complex_to_array_strided { ($input:ident, $output:ident, $stride:literal, { $($idx:literal),* }) => { $( $output.store_complex($input[$idx], $idx*$stride); )* } } pub trait NeonNum { type VectorType; const COMPLEX_PER_VECTOR: usize; } impl NeonNum for f32 { type VectorType = float32x4_t; const COMPLEX_PER_VECTOR: usize = 2; } impl NeonNum for f64 { type VectorType = float64x2_t; const COMPLEX_PER_VECTOR: usize = 1; } // A trait to handle reading from an array of complex floats into Neon vectors. // Neon works with 128-bit vectors, meaning a vector can hold two complex f32, // or a single complex f64. pub trait NeonArray: Deref { // Load complex numbers from the array to fill a Neon vector. unsafe fn load_complex(&self, index: usize) -> T::VectorType; // Load a single complex number from the array into a Neon vector, setting the unused elements to zero. unsafe fn load_partial1_complex(&self, index: usize) -> T::VectorType; // Load a single complex number from the array, and copy it to all elements of a Neon vector. unsafe fn load1_complex(&self, index: usize) -> T::VectorType; } impl NeonArray for &[Complex] { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); vld1q_f32(self.as_ptr().add(index) as *const f32) } #[inline(always)] unsafe fn load_partial1_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + 1); let temp = vmovq_n_f32(0.0); vreinterpretq_f32_u64(vld1q_lane_u64::<0>( self.as_ptr().add(index) as *const u64, vreinterpretq_u64_f32(temp), )) } #[inline(always)] unsafe fn load1_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + 1); vreinterpretq_f32_u64(vld1q_dup_u64(self.as_ptr().add(index) as *const u64)) } } impl NeonArray for &mut [Complex] { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); vld1q_f32(self.as_ptr().add(index) as *const f32) } #[inline(always)] unsafe fn load_partial1_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + 1); let temp = vmovq_n_f32(0.0); vreinterpretq_f32_u64(vld1q_lane_u64::<0>( self.as_ptr().add(index) as *const u64, vreinterpretq_u64_f32(temp), )) } #[inline(always)] unsafe fn load1_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + 1); vreinterpretq_f32_u64(vld1q_dup_u64(self.as_ptr().add(index) as *const u64)) } } impl NeonArray for &[Complex] { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); vld1q_f64(self.as_ptr().add(index) as *const f64) } #[inline(always)] unsafe fn load_partial1_complex(&self, _index: usize) -> ::VectorType { unimplemented!("Impossible to do a partial load of complex f64's"); } #[inline(always)] unsafe fn load1_complex(&self, _index: usize) -> ::VectorType { unimplemented!("Impossible to do a partial load of complex f64's"); } } impl NeonArray for &mut [Complex] { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); vld1q_f64(self.as_ptr().add(index) as *const f64) } #[inline(always)] unsafe fn load_partial1_complex(&self, _index: usize) -> ::VectorType { unimplemented!("Impossible to do a partial load of complex f64's"); } #[inline(always)] unsafe fn load1_complex(&self, _index: usize) -> ::VectorType { unimplemented!("Impossible to do a partial load of complex f64's"); } } impl<'a, T: NeonNum> NeonArray for DoubleBuf<'a, T> where &'a [Complex]: NeonArray, { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> T::VectorType { self.input.load_complex(index) } #[inline(always)] unsafe fn load_partial1_complex(&self, index: usize) -> T::VectorType { self.input.load_partial1_complex(index) } #[inline(always)] unsafe fn load1_complex(&self, index: usize) -> T::VectorType { self.input.load1_complex(index) } } // A trait to handle writing to an array of complex floats from Neon vectors. // Neon works with 128-bit vectors, meaning a vector can hold two complex f32, // or a single complex f64. pub trait NeonArrayMut: NeonArray + DerefMut { // Store all complex numbers from a Neon vector to the array. unsafe fn store_complex(&mut self, vector: T::VectorType, index: usize); // Store the low complex number from a Neon vector to the array. unsafe fn store_partial_lo_complex(&mut self, vector: T::VectorType, index: usize); // Store the high complex number from a Neon vector to the array. unsafe fn store_partial_hi_complex(&mut self, vector: T::VectorType, index: usize); } impl NeonArrayMut for &mut [Complex] { #[inline(always)] unsafe fn store_complex(&mut self, vector: ::VectorType, index: usize) { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); vst1q_f32(self.as_mut_ptr().add(index) as *mut f32, vector); } #[inline(always)] unsafe fn store_partial_hi_complex( &mut self, vector: ::VectorType, index: usize, ) { debug_assert!(self.len() >= index + 1); let high = vget_high_f32(vector); vst1_f32(self.as_mut_ptr().add(index) as *mut f32, high); } #[inline(always)] unsafe fn store_partial_lo_complex( &mut self, vector: ::VectorType, index: usize, ) { debug_assert!(self.len() >= index + 1); let low = vget_low_f32(vector); vst1_f32(self.as_mut_ptr().add(index) as *mut f32, low); } } impl NeonArrayMut for &mut [Complex] { #[inline(always)] unsafe fn store_complex(&mut self, vector: ::VectorType, index: usize) { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); vst1q_f64(self.as_mut_ptr().add(index) as *mut f64, vector); } #[inline(always)] unsafe fn store_partial_hi_complex( &mut self, _vector: ::VectorType, _index: usize, ) { unimplemented!("Impossible to do a partial store of complex f64's"); } #[inline(always)] unsafe fn store_partial_lo_complex( &mut self, _vector: ::VectorType, _index: usize, ) { unimplemented!("Impossible to do a partial store of complex f64's"); } } impl<'a, T: NeonNum> NeonArrayMut for DoubleBuf<'a, T> where Self: NeonArray, &'a mut [Complex]: NeonArrayMut, { #[inline(always)] unsafe fn store_complex(&mut self, vector: T::VectorType, index: usize) { self.output.store_complex(vector, index); } #[inline(always)] unsafe fn store_partial_hi_complex(&mut self, vector: T::VectorType, index: usize) { self.output.store_partial_hi_complex(vector, index); } #[inline(always)] unsafe fn store_partial_lo_complex(&mut self, vector: T::VectorType, index: usize) { self.output.store_partial_lo_complex(vector, index); } } #[cfg(test)] mod unit_tests { use super::*; use num_complex::Complex; #[test] fn test_load_f64() { unsafe { let val1: Complex = Complex::new(1.0, 2.0); let val2: Complex = Complex::new(3.0, 4.0); let val3: Complex = Complex::new(5.0, 6.0); let val4: Complex = Complex::new(7.0, 8.0); let values = vec![val1, val2, val3, val4]; let slice = values.as_slice(); let load1 = slice.load_complex(0); let load2 = slice.load_complex(1); let load3 = slice.load_complex(2); let load4 = slice.load_complex(3); assert_eq!( val1, std::mem::transmute::>(load1) ); assert_eq!( val2, std::mem::transmute::>(load2) ); assert_eq!( val3, std::mem::transmute::>(load3) ); assert_eq!( val4, std::mem::transmute::>(load4) ); } } #[test] fn test_store_f64() { unsafe { let val1: Complex = Complex::new(1.0, 2.0); let val2: Complex = Complex::new(3.0, 4.0); let val3: Complex = Complex::new(5.0, 6.0); let val4: Complex = Complex::new(7.0, 8.0); let nbr1 = vld1q_f64(&val1 as *const _ as *const f64); let nbr2 = vld1q_f64(&val2 as *const _ as *const f64); let nbr3 = vld1q_f64(&val3 as *const _ as *const f64); let nbr4 = vld1q_f64(&val4 as *const _ as *const f64); let mut values: Vec> = vec![Complex::new(0.0, 0.0); 4]; let mut slice = values.as_mut_slice(); slice.store_complex(nbr1, 0); slice.store_complex(nbr2, 1); slice.store_complex(nbr3, 2); slice.store_complex(nbr4, 3); assert_eq!(val1, values[0]); assert_eq!(val2, values[1]); assert_eq!(val3, values[2]); assert_eq!(val4, values[3]); } } } rustfft-6.2.0/src/plan.rs000064400000000000000000000763610072674642500134620ustar 00000000000000use num_integer::gcd; use std::collections::HashMap; use std::sync::Arc; use crate::wasm_simd::wasm_simd_planner::FftPlannerWasmSimd; use crate::{common::FftNum, fft_cache::FftCache, FftDirection}; use crate::algorithm::butterflies::*; use crate::algorithm::*; use crate::Fft; use crate::FftPlannerAvx; use crate::FftPlannerNeon; use crate::FftPlannerSse; use crate::math_utils::{PrimeFactor, PrimeFactors}; enum ChosenFftPlanner { Scalar(FftPlannerScalar), Avx(FftPlannerAvx), Sse(FftPlannerSse), Neon(FftPlannerNeon), WasmSimd(FftPlannerWasmSimd), // todo: If we add NEON, avx-512 etc support, add more enum variants for them here } /// The FFT planner creates new FFT algorithm instances. /// /// RustFFT has several FFT algorithms available. For a given FFT size, the `FftPlanner` decides which of the /// available FFT algorithms to use and then initializes them. /// /// ~~~ /// // Perform a forward Fft of size 1234 /// use std::sync::Arc; /// use rustfft::{FftPlanner, num_complex::Complex}; /// /// let mut planner = FftPlanner::new(); /// let fft = planner.plan_fft_forward(1234); /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 1234]; /// fft.process(&mut buffer); /// /// // The FFT instance returned by the planner has the type `Arc>`, /// // where T is the numeric type, ie f32 or f64, so it's cheap to clone /// let fft_clone = Arc::clone(&fft); /// ~~~ /// /// If you plan on creating multiple FFT instances, it is recommended to reuse the same planner for all of them. This /// is because the planner re-uses internal data across FFT instances wherever possible, saving memory and reducing /// setup time. (FFT instances created with one planner will never re-use data and buffers with FFT instances created /// by a different planner) /// /// Each FFT instance owns [`Arc`s](std::sync::Arc) to its internal data, rather than borrowing it from the planner, so it's perfectly /// safe to drop the planner after creating Fft instances. /// /// In the constructor, the FftPlanner will detect available CPU features. If AVX, SSE, Neon, or WASM SIMD are available, it will set itself up to plan FFTs with the fastest available instruction set. /// If no SIMD instruction sets are available, the planner will seamlessly fall back to planning non-SIMD FFTs. /// /// If you'd prefer not to compute a FFT at all if a certain SIMD instruction set isn't available, or otherwise specify your own custom fallback, RustFFT exposes dedicated planners for each instruction set: /// - [`FftPlannerAvx`](crate::FftPlannerAvx) /// - [`FftPlannerSse`](crate::FftPlannerSse) /// - [`FftPlannerNeon`](crate::FftPlannerNeon) /// - [`FftPlannerWasmSimd`](crate::FftPlannerWasmSimd) /// /// If you'd prefer to opt out of SIMD algorithms, consider creating a [`FftPlannerScalar`](crate::FftPlannerScalar) instead. pub struct FftPlanner { chosen_planner: ChosenFftPlanner, } impl FftPlanner { /// Creates a new `FftPlanner` instance. pub fn new() -> Self { if let Ok(avx_planner) = FftPlannerAvx::new() { Self { chosen_planner: ChosenFftPlanner::Avx(avx_planner), } } else if let Ok(sse_planner) = FftPlannerSse::new() { Self { chosen_planner: ChosenFftPlanner::Sse(sse_planner), } } else if let Ok(neon_planner) = FftPlannerNeon::new() { Self { chosen_planner: ChosenFftPlanner::Neon(neon_planner), } } else if let Ok(wasm_simd_planner) = FftPlannerWasmSimd::new() { Self { chosen_planner: ChosenFftPlanner::WasmSimd(wasm_simd_planner), } } else { Self { chosen_planner: ChosenFftPlanner::Scalar(FftPlannerScalar::new()), } } } /// Returns a `Fft` instance which computes FFTs of size `len`. /// /// If the provided `direction` is `FftDirection::Forward`, the returned instance will compute forward FFTs. If it's `FftDirection::Inverse`, it will compute inverse FFTs. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft(&mut self, len: usize, direction: FftDirection) -> Arc> { match &mut self.chosen_planner { ChosenFftPlanner::Scalar(scalar_planner) => scalar_planner.plan_fft(len, direction), ChosenFftPlanner::Avx(avx_planner) => avx_planner.plan_fft(len, direction), ChosenFftPlanner::Sse(sse_planner) => sse_planner.plan_fft(len, direction), ChosenFftPlanner::Neon(neon_planner) => neon_planner.plan_fft(len, direction), ChosenFftPlanner::WasmSimd(wasm_simd_planner) => { wasm_simd_planner.plan_fft(len, direction) } } } /// Returns a `Fft` instance which computes forward FFTs of size `len` /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_forward(&mut self, len: usize) -> Arc> { self.plan_fft(len, FftDirection::Forward) } /// Returns a `Fft` instance which computes inverse FFTs of size `len` /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_inverse(&mut self, len: usize) -> Arc> { self.plan_fft(len, FftDirection::Inverse) } } const MIN_RADIX4_BITS: u32 = 5; // smallest size to consider radix 4 an option is 2^5 = 32 const MIN_RADIX3_FACTORS: u32 = 4; // smallest number of factors of 3 to consider radix 4 an option is 3^4=81. any smaller and we want to use butterflies directly. const MAX_RADER_PRIME_FACTOR: usize = 23; // don't use Raders if the inner fft length has prime factor larger than this const MIN_BLUESTEIN_MIXED_RADIX_LEN: usize = 90; // only use mixed radix for the inner fft of Bluestein if length is larger than this /// A Recipe is a structure that describes the design of a FFT, without actually creating it. /// It is used as a middle step in the planning process. #[derive(Debug, PartialEq, Clone)] pub enum Recipe { Dft(usize), MixedRadix { left_fft: Arc, right_fft: Arc, }, #[allow(dead_code)] GoodThomasAlgorithm { left_fft: Arc, right_fft: Arc, }, MixedRadixSmall { left_fft: Arc, right_fft: Arc, }, GoodThomasAlgorithmSmall { left_fft: Arc, right_fft: Arc, }, RadersAlgorithm { inner_fft: Arc, }, BluesteinsAlgorithm { len: usize, inner_fft: Arc, }, Radix3(usize), Radix4(usize), Butterfly2, Butterfly3, Butterfly4, Butterfly5, Butterfly6, Butterfly7, Butterfly8, Butterfly9, Butterfly11, Butterfly13, Butterfly16, Butterfly17, Butterfly19, Butterfly23, Butterfly27, Butterfly29, Butterfly31, Butterfly32, } impl Recipe { pub fn len(&self) -> usize { match self { Recipe::Dft(length) => *length, Recipe::Radix3(length) => *length, Recipe::Radix4(length) => *length, Recipe::Butterfly2 => 2, Recipe::Butterfly3 => 3, Recipe::Butterfly4 => 4, Recipe::Butterfly5 => 5, Recipe::Butterfly6 => 6, Recipe::Butterfly7 => 7, Recipe::Butterfly8 => 8, Recipe::Butterfly9 => 9, Recipe::Butterfly11 => 11, Recipe::Butterfly13 => 13, Recipe::Butterfly16 => 16, Recipe::Butterfly17 => 17, Recipe::Butterfly19 => 19, Recipe::Butterfly23 => 23, Recipe::Butterfly27 => 27, Recipe::Butterfly29 => 29, Recipe::Butterfly31 => 31, Recipe::Butterfly32 => 32, Recipe::MixedRadix { left_fft, right_fft, } => left_fft.len() * right_fft.len(), Recipe::GoodThomasAlgorithm { left_fft, right_fft, } => left_fft.len() * right_fft.len(), Recipe::MixedRadixSmall { left_fft, right_fft, } => left_fft.len() * right_fft.len(), Recipe::GoodThomasAlgorithmSmall { left_fft, right_fft, } => left_fft.len() * right_fft.len(), Recipe::RadersAlgorithm { inner_fft } => inner_fft.len() + 1, Recipe::BluesteinsAlgorithm { len, .. } => *len, } } } /// The Scalar FFT planner creates new FFT algorithm instances using non-SIMD algorithms. /// /// RustFFT has several FFT algorithms available. For a given FFT size, the `FftPlannerScalar` decides which of the /// available FFT algorithms to use and then initializes them. /// /// Use `FftPlannerScalar` instead of [`FftPlanner`](crate::FftPlanner) or [`FftPlannerAvx`](crate::FftPlannerAvx) when you want to explicitly opt out of using any SIMD-accelerated algorithms. /// /// ~~~ /// // Perform a forward Fft of size 1234 /// use std::sync::Arc; /// use rustfft::{FftPlannerScalar, num_complex::Complex}; /// /// let mut planner = FftPlannerScalar::new(); /// let fft = planner.plan_fft_forward(1234); /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 1234]; /// fft.process(&mut buffer); /// /// // The FFT instance returned by the planner has the type `Arc>`, /// // where T is the numeric type, ie f32 or f64, so it's cheap to clone /// let fft_clone = Arc::clone(&fft); /// ~~~ /// /// If you plan on creating multiple FFT instances, it is recommended to reuse the same planner for all of them. This /// is because the planner re-uses internal data across FFT instances wherever possible, saving memory and reducing /// setup time. (FFT instances created with one planner will never re-use data and buffers with FFT instances created /// by a different planner) /// /// Each FFT instance owns [`Arc`s](std::sync::Arc) to its internal data, rather than borrowing it from the planner, so it's perfectly /// safe to drop the planner after creating Fft instances. pub struct FftPlannerScalar { algorithm_cache: FftCache, recipe_cache: HashMap>, } impl FftPlannerScalar { /// Creates a new `FftPlannerScalar` instance. pub fn new() -> Self { Self { algorithm_cache: FftCache::new(), recipe_cache: HashMap::new(), } } /// Returns a `Fft` instance which computes FFTs of size `len`. /// /// If the provided `direction` is `FftDirection::Forward`, the returned instance will compute forward FFTs. If it's `FftDirection::Inverse`, it will compute inverse FFTs. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft(&mut self, len: usize, direction: FftDirection) -> Arc> { // Step 1: Create a "recipe" for this FFT, which will tell us exactly which combination of algorithms to use let recipe = self.design_fft_for_len(len); // Step 2: Use our recipe to construct a Fft trait object self.build_fft(&recipe, direction) } /// Returns a `Fft` instance which computes forward FFTs of size `len` /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_forward(&mut self, len: usize) -> Arc> { self.plan_fft(len, FftDirection::Forward) } /// Returns a `Fft` instance which computes inverse FFTs of size `len` /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_inverse(&mut self, len: usize) -> Arc> { self.plan_fft(len, FftDirection::Inverse) } // Make a recipe for a length fn design_fft_for_len(&mut self, len: usize) -> Arc { if len < 2 { Arc::new(Recipe::Dft(len)) } else if let Some(recipe) = self.recipe_cache.get(&len) { Arc::clone(&recipe) } else { let factors = PrimeFactors::compute(len); let recipe = self.design_fft_with_factors(len, factors); self.recipe_cache.insert(len, Arc::clone(&recipe)); recipe } } // Create the fft from a recipe, take from cache if possible fn build_fft(&mut self, recipe: &Recipe, direction: FftDirection) -> Arc> { let len = recipe.len(); if let Some(instance) = self.algorithm_cache.get(len, direction) { instance } else { let fft = self.build_new_fft(recipe, direction); self.algorithm_cache.insert(&fft); fft } } // Create a new fft from a recipe fn build_new_fft(&mut self, recipe: &Recipe, direction: FftDirection) -> Arc> { match recipe { Recipe::Dft(len) => Arc::new(Dft::new(*len, direction)) as Arc>, Recipe::Radix3(len) => Arc::new(Radix3::new(*len, direction)) as Arc>, Recipe::Radix4(len) => Arc::new(Radix4::new(*len, direction)) as Arc>, Recipe::Butterfly2 => Arc::new(Butterfly2::new(direction)) as Arc>, Recipe::Butterfly3 => Arc::new(Butterfly3::new(direction)) as Arc>, Recipe::Butterfly4 => Arc::new(Butterfly4::new(direction)) as Arc>, Recipe::Butterfly5 => Arc::new(Butterfly5::new(direction)) as Arc>, Recipe::Butterfly6 => Arc::new(Butterfly6::new(direction)) as Arc>, Recipe::Butterfly7 => Arc::new(Butterfly7::new(direction)) as Arc>, Recipe::Butterfly8 => Arc::new(Butterfly8::new(direction)) as Arc>, Recipe::Butterfly9 => Arc::new(Butterfly9::new(direction)) as Arc>, Recipe::Butterfly11 => Arc::new(Butterfly11::new(direction)) as Arc>, Recipe::Butterfly13 => Arc::new(Butterfly13::new(direction)) as Arc>, Recipe::Butterfly16 => Arc::new(Butterfly16::new(direction)) as Arc>, Recipe::Butterfly17 => Arc::new(Butterfly17::new(direction)) as Arc>, Recipe::Butterfly19 => Arc::new(Butterfly19::new(direction)) as Arc>, Recipe::Butterfly23 => Arc::new(Butterfly23::new(direction)) as Arc>, Recipe::Butterfly27 => Arc::new(Butterfly27::new(direction)) as Arc>, Recipe::Butterfly29 => Arc::new(Butterfly29::new(direction)) as Arc>, Recipe::Butterfly31 => Arc::new(Butterfly31::new(direction)) as Arc>, Recipe::Butterfly32 => Arc::new(Butterfly32::new(direction)) as Arc>, Recipe::MixedRadix { left_fft, right_fft, } => { let left_fft = self.build_fft(&left_fft, direction); let right_fft = self.build_fft(&right_fft, direction); Arc::new(MixedRadix::new(left_fft, right_fft)) as Arc> } Recipe::GoodThomasAlgorithm { left_fft, right_fft, } => { let left_fft = self.build_fft(&left_fft, direction); let right_fft = self.build_fft(&right_fft, direction); Arc::new(GoodThomasAlgorithm::new(left_fft, right_fft)) as Arc> } Recipe::MixedRadixSmall { left_fft, right_fft, } => { let left_fft = self.build_fft(&left_fft, direction); let right_fft = self.build_fft(&right_fft, direction); Arc::new(MixedRadixSmall::new(left_fft, right_fft)) as Arc> } Recipe::GoodThomasAlgorithmSmall { left_fft, right_fft, } => { let left_fft = self.build_fft(&left_fft, direction); let right_fft = self.build_fft(&right_fft, direction); Arc::new(GoodThomasAlgorithmSmall::new(left_fft, right_fft)) as Arc> } Recipe::RadersAlgorithm { inner_fft } => { let inner_fft = self.build_fft(&inner_fft, direction); Arc::new(RadersAlgorithm::new(inner_fft)) as Arc> } Recipe::BluesteinsAlgorithm { len, inner_fft } => { let inner_fft = self.build_fft(&inner_fft, direction); Arc::new(BluesteinsAlgorithm::new(*len, inner_fft)) as Arc> } } } fn design_fft_with_factors(&mut self, len: usize, factors: PrimeFactors) -> Arc { if let Some(fft_instance) = self.design_butterfly_algorithm(len) { fft_instance } else if factors.is_prime() { self.design_prime(len) } else if len.trailing_zeros() >= MIN_RADIX4_BITS { if len.is_power_of_two() { Arc::new(Recipe::Radix4(len)) } else { let non_power_of_two = factors .remove_factors(PrimeFactor { value: 2, count: len.trailing_zeros(), }) .unwrap(); let power_of_two = PrimeFactors::compute(1 << len.trailing_zeros()); self.design_mixed_radix(power_of_two, non_power_of_two) } } else if factors.get_power_of_three() >= MIN_RADIX3_FACTORS { if factors.is_power_of_three() { Arc::new(Recipe::Radix3(len)) } else { let power3 = factors.get_power_of_three(); let non_power_of_three = factors .remove_factors(PrimeFactor { value: 3, count: power3, }) .unwrap(); let power_of_three = PrimeFactors::compute(3usize.pow(power3)); self.design_mixed_radix(power_of_three, non_power_of_three) } } else { let (left_factors, right_factors) = factors.partition_factors(); self.design_mixed_radix(left_factors, right_factors) } } fn design_mixed_radix( &mut self, left_factors: PrimeFactors, right_factors: PrimeFactors, ) -> Arc { let left_len = left_factors.get_product(); let right_len = right_factors.get_product(); //neither size is a butterfly, so go with the normal algorithm let left_fft = self.design_fft_with_factors(left_len, left_factors); let right_fft = self.design_fft_with_factors(right_len, right_factors); //if both left_len and right_len are small, use algorithms optimized for small FFTs if left_len < 31 && right_len < 31 { // for small FFTs, if gcd is 1, good-thomas is faster if gcd(left_len, right_len) == 1 { Arc::new(Recipe::GoodThomasAlgorithmSmall { left_fft, right_fft, }) } else { Arc::new(Recipe::MixedRadixSmall { left_fft, right_fft, }) } } else { Arc::new(Recipe::MixedRadix { left_fft, right_fft, }) } } // Returns Some(instance) if we have a butterfly available for this size. Returns None if there is no butterfly available for this size fn design_butterfly_algorithm(&mut self, len: usize) -> Option> { match len { 2 => Some(Arc::new(Recipe::Butterfly2)), 3 => Some(Arc::new(Recipe::Butterfly3)), 4 => Some(Arc::new(Recipe::Butterfly4)), 5 => Some(Arc::new(Recipe::Butterfly5)), 6 => Some(Arc::new(Recipe::Butterfly6)), 7 => Some(Arc::new(Recipe::Butterfly7)), 8 => Some(Arc::new(Recipe::Butterfly8)), 9 => Some(Arc::new(Recipe::Butterfly9)), 11 => Some(Arc::new(Recipe::Butterfly11)), 13 => Some(Arc::new(Recipe::Butterfly13)), 16 => Some(Arc::new(Recipe::Butterfly16)), 17 => Some(Arc::new(Recipe::Butterfly17)), 19 => Some(Arc::new(Recipe::Butterfly19)), 23 => Some(Arc::new(Recipe::Butterfly23)), 27 => Some(Arc::new(Recipe::Butterfly27)), 29 => Some(Arc::new(Recipe::Butterfly29)), 31 => Some(Arc::new(Recipe::Butterfly31)), 32 => Some(Arc::new(Recipe::Butterfly32)), _ => None, } } fn design_prime(&mut self, len: usize) -> Arc { let inner_fft_len_rader = len - 1; let raders_factors = PrimeFactors::compute(inner_fft_len_rader); // If any of the prime factors is too large, Rader's gets slow and Bluestein's is the better choice if raders_factors .get_other_factors() .iter() .any(|val| val.value > MAX_RADER_PRIME_FACTOR) { let inner_fft_len_pow2 = (2 * len - 1).checked_next_power_of_two().unwrap(); // for long ffts a mixed radix inner fft is faster than a longer radix4 let min_inner_len = 2 * len - 1; let mixed_radix_len = 3 * inner_fft_len_pow2 / 4; let inner_fft = if mixed_radix_len >= min_inner_len && len >= MIN_BLUESTEIN_MIXED_RADIX_LEN { let mixed_radix_factors = PrimeFactors::compute(mixed_radix_len); self.design_fft_with_factors(mixed_radix_len, mixed_radix_factors) } else { Arc::new(Recipe::Radix4(inner_fft_len_pow2)) }; Arc::new(Recipe::BluesteinsAlgorithm { len, inner_fft }) } else { let inner_fft = self.design_fft_with_factors(inner_fft_len_rader, raders_factors); Arc::new(Recipe::RadersAlgorithm { inner_fft }) } } } #[cfg(test)] mod unit_tests { use super::*; fn is_mixedradix(plan: &Recipe) -> bool { match plan { &Recipe::MixedRadix { .. } => true, _ => false, } } fn is_mixedradixsmall(plan: &Recipe) -> bool { match plan { &Recipe::MixedRadixSmall { .. } => true, _ => false, } } fn is_goodthomassmall(plan: &Recipe) -> bool { match plan { &Recipe::GoodThomasAlgorithmSmall { .. } => true, _ => false, } } fn is_raders(plan: &Recipe) -> bool { match plan { &Recipe::RadersAlgorithm { .. } => true, _ => false, } } fn is_bluesteins(plan: &Recipe) -> bool { match plan { &Recipe::BluesteinsAlgorithm { .. } => true, _ => false, } } #[test] fn test_plan_scalar_trivial() { // Length 0 and 1 should use Dft let mut planner = FftPlannerScalar::::new(); for len in 0..2 { let plan = planner.design_fft_for_len(len); assert_eq!(*plan, Recipe::Dft(len)); assert_eq!(plan.len(), len, "Recipe reports wrong length"); } } #[test] fn test_plan_scalar_largepoweroftwo() { // Powers of 2 above 64 should use Radix4 let mut planner = FftPlannerScalar::::new(); for pow in 6..32 { let len = 1 << pow; let plan = planner.design_fft_for_len(len); assert_eq!(*plan, Recipe::Radix4(len)); assert_eq!(plan.len(), len, "Recipe reports wrong length"); } } #[test] fn test_plan_scalar_butterflies() { // Check that all butterflies are used let mut planner = FftPlannerScalar::::new(); assert_eq!(*planner.design_fft_for_len(2), Recipe::Butterfly2); assert_eq!(*planner.design_fft_for_len(3), Recipe::Butterfly3); assert_eq!(*planner.design_fft_for_len(4), Recipe::Butterfly4); assert_eq!(*planner.design_fft_for_len(5), Recipe::Butterfly5); assert_eq!(*planner.design_fft_for_len(6), Recipe::Butterfly6); assert_eq!(*planner.design_fft_for_len(7), Recipe::Butterfly7); assert_eq!(*planner.design_fft_for_len(8), Recipe::Butterfly8); assert_eq!(*planner.design_fft_for_len(11), Recipe::Butterfly11); assert_eq!(*planner.design_fft_for_len(13), Recipe::Butterfly13); assert_eq!(*planner.design_fft_for_len(16), Recipe::Butterfly16); assert_eq!(*planner.design_fft_for_len(17), Recipe::Butterfly17); assert_eq!(*planner.design_fft_for_len(19), Recipe::Butterfly19); assert_eq!(*planner.design_fft_for_len(23), Recipe::Butterfly23); assert_eq!(*planner.design_fft_for_len(29), Recipe::Butterfly29); assert_eq!(*planner.design_fft_for_len(31), Recipe::Butterfly31); assert_eq!(*planner.design_fft_for_len(32), Recipe::Butterfly32); } #[test] fn test_plan_scalar_mixedradix() { // Products of several different primes should become MixedRadix let mut planner = FftPlannerScalar::::new(); for pow2 in 2..5 { for pow3 in 2..5 { for pow5 in 2..5 { for pow7 in 2..5 { let len = 2usize.pow(pow2) * 3usize.pow(pow3) * 5usize.pow(pow5) * 7usize.pow(pow7); let plan = planner.design_fft_for_len(len); assert!(is_mixedradix(&plan), "Expected MixedRadix, got {:?}", plan); assert_eq!(plan.len(), len, "Recipe reports wrong length"); } } } } } #[test] fn test_plan_scalar_mixedradixsmall() { // Products of two "small" lengths < 31 that have a common divisor >1, and isn't a power of 2 should be MixedRadixSmall let mut planner = FftPlannerScalar::::new(); for len in [5 * 20, 5 * 25].iter() { let plan = planner.design_fft_for_len(*len); assert!( is_mixedradixsmall(&plan), "Expected MixedRadixSmall, got {:?}", plan ); assert_eq!(plan.len(), *len, "Recipe reports wrong length"); } } #[test] fn test_plan_scalar_goodthomasbutterfly() { let mut planner = FftPlannerScalar::::new(); for len in [3 * 4, 3 * 5, 3 * 7, 5 * 7, 11 * 13].iter() { let plan = planner.design_fft_for_len(*len); assert!( is_goodthomassmall(&plan), "Expected GoodThomasAlgorithmSmall, got {:?}", plan ); assert_eq!(plan.len(), *len, "Recipe reports wrong length"); } } #[test] fn test_plan_scalar_bluestein_vs_rader() { let difficultprimes: [usize; 11] = [59, 83, 107, 149, 167, 173, 179, 359, 719, 1439, 2879]; let easyprimes: [usize; 24] = [ 53, 61, 67, 71, 73, 79, 89, 97, 101, 103, 109, 113, 127, 131, 137, 139, 151, 157, 163, 181, 191, 193, 197, 199, ]; let mut planner = FftPlannerScalar::::new(); for len in difficultprimes.iter() { let plan = planner.design_fft_for_len(*len); assert!( is_bluesteins(&plan), "Expected BluesteinsAlgorithm, got {:?}", plan ); assert_eq!(plan.len(), *len, "Recipe reports wrong length"); } for len in easyprimes.iter() { let plan = planner.design_fft_for_len(*len); assert!(is_raders(&plan), "Expected RadersAlgorithm, got {:?}", plan); assert_eq!(plan.len(), *len, "Recipe reports wrong length"); } } #[test] fn test_scalar_fft_cache() { { // Check that FFTs are reused if they're both forward let mut planner = FftPlannerScalar::::new(); let fft_a = planner.plan_fft(1234, FftDirection::Forward); let fft_b = planner.plan_fft(1234, FftDirection::Forward); assert!(Arc::ptr_eq(&fft_a, &fft_b), "Existing fft was not reused"); } { // Check that FFTs are reused if they're both inverse let mut planner = FftPlannerScalar::::new(); let fft_a = planner.plan_fft(1234, FftDirection::Inverse); let fft_b = planner.plan_fft(1234, FftDirection::Inverse); assert!(Arc::ptr_eq(&fft_a, &fft_b), "Existing fft was not reused"); } { // Check that FFTs are NOT resued if they don't both have the same direction let mut planner = FftPlannerScalar::::new(); let fft_a = planner.plan_fft(1234, FftDirection::Forward); let fft_b = planner.plan_fft(1234, FftDirection::Inverse); assert!( !Arc::ptr_eq(&fft_a, &fft_b), "Existing fft was reused, even though directions don't match" ); } } #[test] fn test_scalar_recipe_cache() { // Check that all butterflies are used let mut planner = FftPlannerScalar::::new(); let fft_a = planner.design_fft_for_len(1234); let fft_b = planner.design_fft_for_len(1234); assert!( Arc::ptr_eq(&fft_a, &fft_b), "Existing recipe was not reused" ); } // We don't need to actually compute anything for a FFT size of zero, but we do need to verify that it doesn't explode #[test] fn test_plan_zero_scalar() { let mut planner32 = FftPlannerScalar::::new(); let fft_zero32 = planner32.plan_fft_forward(0); fft_zero32.process(&mut []); let mut planner64 = FftPlannerScalar::::new(); let fft_zero64 = planner64.plan_fft_forward(0); fft_zero64.process(&mut []); } // This test is not designed to be run, only to compile. // We cannot make it #[test] since there is a generic parameter. #[allow(dead_code)] fn test_impl_fft_planner_send() { fn is_send() {} is_send::>(); is_send::>(); is_send::>(); is_send::>(); } } rustfft-6.2.0/src/sse/mod.rs000064400000000000000000000004630072674642500140670ustar 00000000000000#[macro_use] mod sse_common; #[macro_use] mod sse_vector; #[macro_use] pub mod sse_butterflies; pub mod sse_prime_butterflies; pub mod sse_radix4; mod sse_utils; pub mod sse_planner; pub use self::sse_butterflies::*; pub use self::sse_prime_butterflies::*; pub use self::sse_radix4::*; rustfft-6.2.0/src/sse/sse_butterflies.rs000064400000000000000000003743260072674642500165260ustar 00000000000000use core::arch::x86_64::*; use num_complex::Complex; use crate::{common::FftNum, FftDirection}; use crate::array_utils; use crate::array_utils::workaround_transmute_mut; use crate::array_utils::DoubleBuf; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::twiddles; use crate::{Direction, Fft, Length}; use super::sse_common::{assert_f32, assert_f64}; use super::sse_utils::*; use super::sse_vector::SseArrayMut; #[allow(unused)] macro_rules! boilerplate_fft_sse_f32_butterfly { ($struct_name:ident, $len:expr, $direction_fn:expr) => { impl $struct_name { #[target_feature(enable = "sse4.1")] //#[inline(always)] pub(crate) unsafe fn perform_fft_butterfly(&self, buffer: &mut [Complex]) { self.perform_fft_contiguous(workaround_transmute_mut::<_, Complex>(buffer)); } #[target_feature(enable = "sse4.1")] //#[inline(always)] pub(crate) unsafe fn perform_parallel_fft_butterfly(&self, buffer: &mut [Complex]) { self.perform_parallel_fft_contiguous(workaround_transmute_mut::<_, Complex>( buffer, )); } // Do multiple ffts over a longer vector inplace, called from "process_with_scratch" of Fft trait #[target_feature(enable = "sse4.1")] pub(crate) unsafe fn perform_fft_butterfly_multi( &self, buffer: &mut [Complex], ) -> Result<(), ()> { let len = buffer.len(); let alldone = array_utils::iter_chunks(buffer, 2 * self.len(), |chunk| { self.perform_parallel_fft_butterfly(chunk) }); if alldone.is_err() && buffer.len() >= self.len() { self.perform_fft_butterfly(&mut buffer[len - self.len()..]); } Ok(()) } // Do multiple ffts over a longer vector outofplace, called from "process_outofplace_with_scratch" of Fft trait #[target_feature(enable = "sse4.1")] pub(crate) unsafe fn perform_oop_fft_butterfly_multi( &self, input: &mut [Complex], output: &mut [Complex], ) -> Result<(), ()> { let len = input.len(); let alldone = array_utils::iter_chunks_zipped( input, output, 2 * self.len(), |in_chunk, out_chunk| { let input_slice = workaround_transmute_mut(in_chunk); let output_slice = workaround_transmute_mut(out_chunk); self.perform_parallel_fft_contiguous(DoubleBuf { input: input_slice, output: output_slice, }) }, ); if alldone.is_err() && input.len() >= self.len() { let input_slice = workaround_transmute_mut(input); let output_slice = workaround_transmute_mut(output); self.perform_fft_contiguous(DoubleBuf { input: &mut input_slice[len - self.len()..], output: &mut output_slice[len - self.len()..], }) } Ok(()) } } }; } macro_rules! boilerplate_fft_sse_f64_butterfly { ($struct_name:ident, $len:expr, $direction_fn:expr) => { impl $struct_name { // Do a single fft #[target_feature(enable = "sse4.1")] pub(crate) unsafe fn perform_fft_butterfly(&self, buffer: &mut [Complex]) { self.perform_fft_contiguous(workaround_transmute_mut::<_, Complex>(buffer)); } // Do multiple ffts over a longer vector inplace, called from "process_with_scratch" of Fft trait #[target_feature(enable = "sse4.1")] pub(crate) unsafe fn perform_fft_butterfly_multi( &self, buffer: &mut [Complex], ) -> Result<(), ()> { array_utils::iter_chunks(buffer, self.len(), |chunk| { self.perform_fft_butterfly(chunk) }) } // Do multiple ffts over a longer vector outofplace, called from "process_outofplace_with_scratch" of Fft trait #[target_feature(enable = "sse4.1")] pub(crate) unsafe fn perform_oop_fft_butterfly_multi( &self, input: &mut [Complex], output: &mut [Complex], ) -> Result<(), ()> { array_utils::iter_chunks_zipped(input, output, self.len(), |in_chunk, out_chunk| { let input_slice = workaround_transmute_mut(in_chunk); let output_slice = workaround_transmute_mut(out_chunk); self.perform_fft_contiguous(DoubleBuf { input: input_slice, output: output_slice, }) }) } } }; } #[allow(unused)] macro_rules! boilerplate_fft_sse_common_butterfly { ($struct_name:ident, $len:expr, $direction_fn:expr) => { impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], _scratch: &mut [Complex], ) { if input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } let result = unsafe { self.perform_oop_fft_butterfly_multi(input, output) }; if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); } } fn process_with_scratch(&self, buffer: &mut [Complex], _scratch: &mut [Complex]) { if buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace(self.len(), buffer.len(), 0, 0); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let result = unsafe { self.perform_fft_butterfly_multi(buffer) }; if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace(self.len(), buffer.len(), 0, 0); } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { 0 } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { 0 } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { $len } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { $direction_fn(self) } } }; } // _ _________ _ _ _ // / | |___ /___ \| |__ (_) |_ // | | _____ |_ \ __) | '_ \| | __| // | | |_____| ___) / __/| |_) | | |_ // |_| |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly1 { direction: FftDirection, _phantom: std::marker::PhantomData, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly1, 1, |this: &SseF32Butterfly1<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly1, 1, |this: &SseF32Butterfly1<_>| this .direction); impl SseF32Butterfly1 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); Self { direction, _phantom: std::marker::PhantomData, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let value = buffer.load_partial1_complex(0); buffer.store_partial_lo_complex(value, 0); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let value = buffer.load_complex(0); buffer.store_complex(value, 0); } } // _ __ _ _ _ _ _ // / | / /_ | || | | |__ (_) |_ // | | _____ | '_ \| || |_| '_ \| | __| // | | |_____| | (_) |__ _| |_) | | |_ // |_| \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly1 { direction: FftDirection, _phantom: std::marker::PhantomData, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly1, 1, |this: &SseF64Butterfly1<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly1, 1, |this: &SseF64Butterfly1<_>| this .direction); impl SseF64Butterfly1 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); Self { direction, _phantom: std::marker::PhantomData, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let value = buffer.load_complex(0); buffer.store_complex(value, 0); } } // ____ _________ _ _ _ // |___ \ |___ /___ \| |__ (_) |_ // __) | _____ |_ \ __) | '_ \| | __| // / __/ |_____| ___) / __/| |_) | | |_ // |_____| |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly2 { direction: FftDirection, _phantom: std::marker::PhantomData, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly2, 2, |this: &SseF32Butterfly2<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly2, 2, |this: &SseF32Butterfly2<_>| this .direction); impl SseF32Butterfly2 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); Self { direction, _phantom: std::marker::PhantomData, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = buffer.load_complex(0); let temp = self.perform_fft_direct(values); buffer.store_complex(temp, 0); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values_a = buffer.load_complex(0); let values_b = buffer.load_complex(2); let out = self.perform_parallel_fft_direct(values_a, values_b); let [out02, out13] = transpose_complex_2x2_f32(out[0], out[1]); buffer.store_complex(out02, 0); buffer.store_complex(out13, 2); } // length 2 fft of x, given as [x0, x1] // result is [X0, X1] #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: __m128) -> __m128 { solo_fft2_f32(values) } // dual length 2 fft of x and y, given as [x0, x1], [y0, y1] // result is [X0, Y0], [X1, Y1] #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, values_x: __m128, values_y: __m128, ) -> [__m128; 2] { parallel_fft2_contiguous_f32(values_x, values_y) } } // double lenth 2 fft of a and b, given as [x0, y0], [x1, y1] // result is [X0, Y0], [X1, Y1] #[inline(always)] pub(crate) unsafe fn parallel_fft2_interleaved_f32(val02: __m128, val13: __m128) -> [__m128; 2] { let temp0 = _mm_add_ps(val02, val13); let temp1 = _mm_sub_ps(val02, val13); [temp0, temp1] } // double lenth 2 fft of a and b, given as [x0, x1], [y0, y1] // result is [X0, Y0], [X1, Y1] #[inline(always)] unsafe fn parallel_fft2_contiguous_f32(left: __m128, right: __m128) -> [__m128; 2] { let [temp02, temp13] = transpose_complex_2x2_f32(left, right); parallel_fft2_interleaved_f32(temp02, temp13) } // length 2 fft of x, given as [x0, x1] // result is [X0, X1] #[inline(always)] unsafe fn solo_fft2_f32(values: __m128) -> __m128 { let temp = reverse_complex_elements_f32(values); let temp2 = negate_hi_f32(values); _mm_add_ps(temp2, temp) } // ____ __ _ _ _ _ _ // |___ \ / /_ | || | | |__ (_) |_ // __) | _____ | '_ \| || |_| '_ \| | __| // / __/ |_____| | (_) |__ _| |_) | | |_ // |_____| \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly2 { direction: FftDirection, _phantom: std::marker::PhantomData, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly2, 2, |this: &SseF64Butterfly2<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly2, 2, |this: &SseF64Butterfly2<_>| this .direction); impl SseF64Butterfly2 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); Self { direction, _phantom: std::marker::PhantomData, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let value0 = buffer.load_complex(0); let value1 = buffer.load_complex(1); let out = self.perform_fft_direct(value0, value1); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 1); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value0: __m128d, value1: __m128d, ) -> [__m128d; 2] { solo_fft2_f64(value0, value1) } } #[inline(always)] pub(crate) unsafe fn solo_fft2_f64(left: __m128d, right: __m128d) -> [__m128d; 2] { let temp0 = _mm_add_pd(left, right); let temp1 = _mm_sub_pd(left, right); [temp0, temp1] } // _____ _________ _ _ _ // |___ / |___ /___ \| |__ (_) |_ // |_ \ _____ |_ \ __) | '_ \| | __| // ___) | |_____| ___) / __/| |_) | | |_ // |____/ |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly3 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle: __m128, twiddle1re: __m128, twiddle1im: __m128, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly3, 3, |this: &SseF32Butterfly3<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly3, 3, |this: &SseF32Butterfly3<_>| this .direction); impl SseF32Butterfly3 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 3, direction); let twiddle = unsafe { _mm_set_ps(-tw1.im, -tw1.im, tw1.re, tw1.re) }; let twiddle1re = unsafe { _mm_set_ps(tw1.re, tw1.re, tw1.re, tw1.re) }; let twiddle1im = unsafe { _mm_set_ps(tw1.im, tw1.im, tw1.im, tw1.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle, twiddle1re, twiddle1im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let value0x = buffer.load_partial1_complex(0); let value12 = buffer.load_complex(1); let out = self.perform_fft_direct(value0x, value12); buffer.store_partial_lo_complex(out[0], 0); buffer.store_complex(out[1], 1); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let valuea0a1 = buffer.load_complex(0); let valuea2b0 = buffer.load_complex(2); let valueb1b2 = buffer.load_complex(4); let value0 = extract_lo_hi_f32(valuea0a1, valuea2b0); let value1 = extract_hi_lo_f32(valuea0a1, valueb1b2); let value2 = extract_lo_hi_f32(valuea2b0, valueb1b2); let out = self.perform_parallel_fft_direct(value0, value1, value2); let out0 = extract_lo_lo_f32(out[0], out[1]); let out1 = extract_lo_hi_f32(out[2], out[0]); let out2 = extract_hi_hi_f32(out[1], out[2]); buffer.store_complex(out0, 0); buffer.store_complex(out1, 2); buffer.store_complex(out2, 4); } // length 3 fft of a, given as [x0, 0.0], [x1, x2] // result is [X0, Z], [X1, X2] // The value Z should be discarded. #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value0x: __m128, value12: __m128, ) -> [__m128; 2] { // This is a SSE translation of the scalar 3-point butterfly let rev12 = negate_hi_f32(reverse_complex_elements_f32(value12)); let temp12pn = self.rotate.rotate_hi(_mm_add_ps(value12, rev12)); let twiddled = _mm_mul_ps(temp12pn, self.twiddle); let temp = _mm_add_ps(value0x, twiddled); let out12 = solo_fft2_f32(temp); let out0x = _mm_add_ps(value0x, temp12pn); [out0x, out12] } // length 3 dual fft of a, given as (x0, y0), (x1, y1), (x2, y2). // result is [(X0, Y0), (X1, Y1), (X2, Y2)] #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, value0: __m128, value1: __m128, value2: __m128, ) -> [__m128; 3] { // This is a SSE translation of the scalar 3-point butterfly let x12p = _mm_add_ps(value1, value2); let x12n = _mm_sub_ps(value1, value2); let sum = _mm_add_ps(value0, x12p); let temp_a = _mm_mul_ps(self.twiddle1re, x12p); let temp_a = _mm_add_ps(temp_a, value0); let n_rot = self.rotate.rotate_both(x12n); let temp_b = _mm_mul_ps(self.twiddle1im, n_rot); let x1 = _mm_add_ps(temp_a, temp_b); let x2 = _mm_sub_ps(temp_a, temp_b); [sum, x1, x2] } } // _____ __ _ _ _ _ _ // |___ / / /_ | || | | |__ (_) |_ // |_ \ _____ | '_ \| || |_| '_ \| | __| // ___) | |_____| | (_) |__ _| |_) | | |_ // |____/ \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly3 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: __m128d, twiddle1im: __m128d, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly3, 3, |this: &SseF64Butterfly3<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly3, 3, |this: &SseF64Butterfly3<_>| this .direction); impl SseF64Butterfly3 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 3, direction); let twiddle1re = unsafe { _mm_set_pd(tw1.re, tw1.re) }; let twiddle1im = unsafe { _mm_set_pd(tw1.im, tw1.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let value0 = buffer.load_complex(0); let value1 = buffer.load_complex(1); let value2 = buffer.load_complex(2); let out = self.perform_fft_direct(value0, value1, value2); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 1); buffer.store_complex(out[2], 2); } // length 3 fft of x, given as x0, x1, x2. // result is [X0, X1, X2] #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value0: __m128d, value1: __m128d, value2: __m128d, ) -> [__m128d; 3] { // This is a SSE translation of the scalar 3-point butterfly let x12p = _mm_add_pd(value1, value2); let x12n = _mm_sub_pd(value1, value2); let sum = _mm_add_pd(value0, x12p); let temp_a = _mm_mul_pd(self.twiddle1re, x12p); let temp_a = _mm_add_pd(temp_a, value0); let n_rot = self.rotate.rotate(x12n); let temp_b = _mm_mul_pd(self.twiddle1im, n_rot); let x1 = _mm_add_pd(temp_a, temp_b); let x2 = _mm_sub_pd(temp_a, temp_b); [sum, x1, x2] } } // _ _ _________ _ _ _ // | || | |___ /___ \| |__ (_) |_ // | || |_ _____ |_ \ __) | '_ \| | __| // |__ _| |_____| ___) / __/| |_) | | |_ // |_| |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly4 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly4, 4, |this: &SseF32Butterfly4<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly4, 4, |this: &SseF32Butterfly4<_>| this .direction); impl SseF32Butterfly4 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = if direction == FftDirection::Inverse { Rotate90F32::new(true) } else { Rotate90F32::new(false) }; Self { direction, _phantom: std::marker::PhantomData, rotate, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let value01 = buffer.load_complex(0); let value23 = buffer.load_complex(2); let out = self.perform_fft_direct(value01, value23); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 2); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let value01a = buffer.load_complex(0); let value23a = buffer.load_complex(2); let value01b = buffer.load_complex(4); let value23b = buffer.load_complex(6); let [value0ab, value1ab] = transpose_complex_2x2_f32(value01a, value01b); let [value2ab, value3ab] = transpose_complex_2x2_f32(value23a, value23b); let out = self.perform_parallel_fft_direct(value0ab, value1ab, value2ab, value3ab); let [out0, out1] = transpose_complex_2x2_f32(out[0], out[1]); let [out2, out3] = transpose_complex_2x2_f32(out[2], out[3]); buffer.store_complex(out0, 0); buffer.store_complex(out1, 4); buffer.store_complex(out2, 2); buffer.store_complex(out3, 6); } // length 4 fft of a, given as [x0, x1], [x2, x3] // result is [[X0, X1], [X2, X3]] #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value01: __m128, value23: __m128, ) -> [__m128; 2] { //we're going to hardcode a step of mixed radix //aka we're going to do the six step algorithm // step 1: transpose // and // step 2: column FFTs let mut temp = parallel_fft2_interleaved_f32(value01, value23); // step 3: apply twiddle factors (only one in this case, and it's either 0 + i or 0 - i) temp[1] = self.rotate.rotate_hi(temp[1]); // step 4: transpose, which we're skipping because we're the previous FFTs were non-contiguous // step 5: row FFTs // and // step 6: transpose by swapping index 1 and 2 parallel_fft2_contiguous_f32(temp[0], temp[1]) } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, values0: __m128, values1: __m128, values2: __m128, values3: __m128, ) -> [__m128; 4] { //we're going to hardcode a step of mixed radix //aka we're going to do the six step algorithm // step 1: transpose // and // step 2: column FFTs let temp0 = parallel_fft2_interleaved_f32(values0, values2); let mut temp1 = parallel_fft2_interleaved_f32(values1, values3); // step 3: apply twiddle factors (only one in this case, and it's either 0 + i or 0 - i) temp1[1] = self.rotate.rotate_both(temp1[1]); // step 4: transpose, which we're skipping because we're the previous FFTs were non-contiguous // step 5: row FFTs let out0 = parallel_fft2_interleaved_f32(temp0[0], temp1[0]); let out2 = parallel_fft2_interleaved_f32(temp0[1], temp1[1]); // step 6: transpose by swapping index 1 and 2 [out0[0], out2[0], out0[1], out2[1]] } } // _ _ __ _ _ _ _ _ // | || | / /_ | || | | |__ (_) |_ // | || |_ _____ | '_ \| || |_| '_ \| | __| // |__ _| |_____| | (_) |__ _| |_) | | |_ // |_| \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly4 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly4, 4, |this: &SseF64Butterfly4<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly4, 4, |this: &SseF64Butterfly4<_>| this .direction); impl SseF64Butterfly4 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = if direction == FftDirection::Inverse { Rotate90F64::new(true) } else { Rotate90F64::new(false) }; Self { direction, _phantom: std::marker::PhantomData, rotate, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let value0 = buffer.load_complex(0); let value1 = buffer.load_complex(1); let value2 = buffer.load_complex(2); let value3 = buffer.load_complex(3); let out = self.perform_fft_direct(value0, value1, value2, value3); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 1); buffer.store_complex(out[2], 2); buffer.store_complex(out[3], 3); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value0: __m128d, value1: __m128d, value2: __m128d, value3: __m128d, ) -> [__m128d; 4] { //we're going to hardcode a step of mixed radix //aka we're going to do the six step algorithm // step 1: transpose // and // step 2: column FFTs let temp0 = solo_fft2_f64(value0, value2); let mut temp1 = solo_fft2_f64(value1, value3); // step 3: apply twiddle factors (only one in this case, and it's either 0 + i or 0 - i) temp1[1] = self.rotate.rotate(temp1[1]); // step 4: transpose, which we're skipping because we're the previous FFTs were non-contiguous // step 5: row FFTs let out0 = solo_fft2_f64(temp0[0], temp1[0]); let out2 = solo_fft2_f64(temp0[1], temp1[1]); // step 6: transpose by swapping index 1 and 2 [out0[0], out2[0], out0[1], out2[1]] } } // ____ _________ _ _ _ // | ___| |___ /___ \| |__ (_) |_ // |___ \ _____ |_ \ __) | '_ \| | __| // ___) | |_____| ___) / __/| |_) | | |_ // |____/ |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly5 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle12re: __m128, twiddle21re: __m128, twiddle12im: __m128, twiddle21im: __m128, twiddle1re: __m128, twiddle1im: __m128, twiddle2re: __m128, twiddle2im: __m128, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly5, 5, |this: &SseF32Butterfly5<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly5, 5, |this: &SseF32Butterfly5<_>| this .direction); impl SseF32Butterfly5 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 5, direction); let tw2: Complex = twiddles::compute_twiddle(2, 5, direction); let twiddle12re = unsafe { _mm_set_ps(tw2.re, tw2.re, tw1.re, tw1.re) }; let twiddle21re = unsafe { _mm_set_ps(tw1.re, tw1.re, tw2.re, tw2.re) }; let twiddle12im = unsafe { _mm_set_ps(tw2.im, tw2.im, tw1.im, tw1.im) }; let twiddle21im = unsafe { _mm_set_ps(-tw1.im, -tw1.im, tw2.im, tw2.im) }; let twiddle1re = unsafe { _mm_set_ps(tw1.re, tw1.re, tw1.re, tw1.re) }; let twiddle1im = unsafe { _mm_set_ps(tw1.im, tw1.im, tw1.im, tw1.im) }; let twiddle2re = unsafe { _mm_set_ps(tw2.re, tw2.re, tw2.re, tw2.re) }; let twiddle2im = unsafe { _mm_set_ps(tw2.im, tw2.im, tw2.im, tw2.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle12re, twiddle21re, twiddle12im, twiddle21im, twiddle1re, twiddle1im, twiddle2re, twiddle2im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let value00 = buffer.load1_complex(0); let value12 = buffer.load_complex(1); let value34 = buffer.load_complex(3); let out = self.perform_fft_direct(value00, value12, value34); buffer.store_partial_lo_complex(out[0], 0); buffer.store_complex(out[1], 1); buffer.store_complex(out[2], 3); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4 ,6, 8}); let value0 = extract_lo_hi_f32(input_packed[0], input_packed[2]); let value1 = extract_hi_lo_f32(input_packed[0], input_packed[3]); let value2 = extract_lo_hi_f32(input_packed[1], input_packed[3]); let value3 = extract_hi_lo_f32(input_packed[1], input_packed[4]); let value4 = extract_lo_hi_f32(input_packed[2], input_packed[4]); let out = self.perform_parallel_fft_direct(value0, value1, value2, value3, value4); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_hi_f32(out[4], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4}); } // length 5 fft of a, given as [x0, x0], [x1, x2], [x3, x4]. // result is [[X0, Z], [X1, X2], [X3, X4]] // Note that Z should not be used. #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value00: __m128, value12: __m128, value34: __m128, ) -> [__m128; 3] { // This is a SSE translation of the scalar 5-point butterfly let temp43 = reverse_complex_elements_f32(value34); let x1423p = _mm_add_ps(value12, temp43); let x1423n = _mm_sub_ps(value12, temp43); let x1414p = duplicate_lo_f32(x1423p); let x2323p = duplicate_hi_f32(x1423p); let x1414n = duplicate_lo_f32(x1423n); let x2323n = duplicate_hi_f32(x1423n); let temp_a1 = _mm_mul_ps(self.twiddle12re, x1414p); let temp_a2 = _mm_mul_ps(self.twiddle21re, x2323p); let temp_b1 = _mm_mul_ps(self.twiddle12im, x1414n); let temp_b2 = _mm_mul_ps(self.twiddle21im, x2323n); let temp_a = _mm_add_ps(temp_a1, temp_a2); let temp_a = _mm_add_ps(value00, temp_a); let temp_b = _mm_add_ps(temp_b1, temp_b2); let b_rot = self.rotate.rotate_both(temp_b); let x00 = _mm_add_ps(value00, _mm_add_ps(x1414p, x2323p)); let x12 = _mm_add_ps(temp_a, b_rot); let x34 = reverse_complex_elements_f32(_mm_sub_ps(temp_a, b_rot)); [x00, x12, x34] } // length 5 dual fft of x and y, given as (x0, y0), (x1, y1) ... (x4, y4). // result is [(X0, Y0), (X1, Y1) ... (X2, Y2)] #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, value0: __m128, value1: __m128, value2: __m128, value3: __m128, value4: __m128, ) -> [__m128; 5] { // This is a SSE translation of the scalar 3-point butterfly let x14p = _mm_add_ps(value1, value4); let x14n = _mm_sub_ps(value1, value4); let x23p = _mm_add_ps(value2, value3); let x23n = _mm_sub_ps(value2, value3); let temp_a1_1 = _mm_mul_ps(self.twiddle1re, x14p); let temp_a1_2 = _mm_mul_ps(self.twiddle2re, x23p); let temp_b1_1 = _mm_mul_ps(self.twiddle1im, x14n); let temp_b1_2 = _mm_mul_ps(self.twiddle2im, x23n); let temp_a2_1 = _mm_mul_ps(self.twiddle1re, x23p); let temp_a2_2 = _mm_mul_ps(self.twiddle2re, x14p); let temp_b2_1 = _mm_mul_ps(self.twiddle2im, x14n); let temp_b2_2 = _mm_mul_ps(self.twiddle1im, x23n); let temp_a1 = _mm_add_ps(value0, _mm_add_ps(temp_a1_1, temp_a1_2)); let temp_b1 = _mm_add_ps(temp_b1_1, temp_b1_2); let temp_a2 = _mm_add_ps(value0, _mm_add_ps(temp_a2_1, temp_a2_2)); let temp_b2 = _mm_sub_ps(temp_b2_1, temp_b2_2); [ _mm_add_ps(value0, _mm_add_ps(x14p, x23p)), _mm_add_ps(temp_a1, self.rotate.rotate_both(temp_b1)), _mm_add_ps(temp_a2, self.rotate.rotate_both(temp_b2)), _mm_sub_ps(temp_a2, self.rotate.rotate_both(temp_b2)), _mm_sub_ps(temp_a1, self.rotate.rotate_both(temp_b1)), ] } } // ____ __ _ _ _ _ _ // | ___| / /_ | || | | |__ (_) |_ // |___ \ _____ | '_ \| || |_| '_ \| | __| // ___) | |_____| | (_) |__ _| |_) | | |_ // |____/ \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly5 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: __m128d, twiddle1im: __m128d, twiddle2re: __m128d, twiddle2im: __m128d, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly5, 5, |this: &SseF64Butterfly5<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly5, 5, |this: &SseF64Butterfly5<_>| this .direction); impl SseF64Butterfly5 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 5, direction); let tw2: Complex = twiddles::compute_twiddle(2, 5, direction); let twiddle1re = unsafe { _mm_set_pd(tw1.re, tw1.re) }; let twiddle1im = unsafe { _mm_set_pd(tw1.im, tw1.im) }; let twiddle2re = unsafe { _mm_set_pd(tw2.re, tw2.re) }; let twiddle2im = unsafe { _mm_set_pd(tw2.im, tw2.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let value0 = buffer.load_complex(0); let value1 = buffer.load_complex(1); let value2 = buffer.load_complex(2); let value3 = buffer.load_complex(3); let value4 = buffer.load_complex(4); let out = self.perform_fft_direct(value0, value1, value2, value3, value4); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 1); buffer.store_complex(out[2], 2); buffer.store_complex(out[3], 3); buffer.store_complex(out[4], 4); } // length 5 fft of x, given as x0, x1, x2, x3, x4. // result is [X0, X1, X2, X3, X4] #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value0: __m128d, value1: __m128d, value2: __m128d, value3: __m128d, value4: __m128d, ) -> [__m128d; 5] { // This is a SSE translation of the scalar 5-point butterfly let x14p = _mm_add_pd(value1, value4); let x14n = _mm_sub_pd(value1, value4); let x23p = _mm_add_pd(value2, value3); let x23n = _mm_sub_pd(value2, value3); let temp_a1_1 = _mm_mul_pd(self.twiddle1re, x14p); let temp_a1_2 = _mm_mul_pd(self.twiddle2re, x23p); let temp_a2_1 = _mm_mul_pd(self.twiddle2re, x14p); let temp_a2_2 = _mm_mul_pd(self.twiddle1re, x23p); let temp_b1_1 = _mm_mul_pd(self.twiddle1im, x14n); let temp_b1_2 = _mm_mul_pd(self.twiddle2im, x23n); let temp_b2_1 = _mm_mul_pd(self.twiddle2im, x14n); let temp_b2_2 = _mm_mul_pd(self.twiddle1im, x23n); let temp_a1 = _mm_add_pd(value0, _mm_add_pd(temp_a1_1, temp_a1_2)); let temp_a2 = _mm_add_pd(value0, _mm_add_pd(temp_a2_1, temp_a2_2)); let temp_b1 = _mm_add_pd(temp_b1_1, temp_b1_2); let temp_b2 = _mm_sub_pd(temp_b2_1, temp_b2_2); let temp_b1_rot = self.rotate.rotate(temp_b1); let temp_b2_rot = self.rotate.rotate(temp_b2); [ _mm_add_pd(value0, _mm_add_pd(x14p, x23p)), _mm_add_pd(temp_a1, temp_b1_rot), _mm_add_pd(temp_a2, temp_b2_rot), _mm_sub_pd(temp_a2, temp_b2_rot), _mm_sub_pd(temp_a1, temp_b1_rot), ] } } // __ _________ _ _ _ // / /_ |___ /___ \| |__ (_) |_ // | '_ \ _____ |_ \ __) | '_ \| | __| // | (_) | |_____| ___) / __/| |_) | | |_ // \___/ |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly6 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: SseF32Butterfly3, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly6, 6, |this: &SseF32Butterfly6<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly6, 6, |this: &SseF32Butterfly6<_>| this .direction); impl SseF32Butterfly6 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf3 = SseF32Butterfly3::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let value01 = buffer.load_complex(0); let value23 = buffer.load_complex(2); let value45 = buffer.load_complex(4); let out = self.perform_fft_direct(value01, value23, value45); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 2); buffer.store_complex(out[2], 4); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10}); let values = interleave_complex_f32!(input_packed, 3, {0, 1, 2}); let out = self.perform_parallel_fft_direct( values[0], values[1], values[2], values[3], values[4], values[5], ); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0, 1, 2, 3, 4, 5}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value01: __m128, value23: __m128, value45: __m128, ) -> [__m128; 3] { // Algorithm: 3x2 good-thomas // Size-3 FFTs down the columns of our reordered array let reord0 = extract_lo_hi_f32(value01, value23); let reord1 = extract_lo_hi_f32(value23, value45); let reord2 = extract_lo_hi_f32(value45, value01); let mid = self.bf3.perform_parallel_fft_direct(reord0, reord1, reord2); // We normally would put twiddle factors right here, but since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = parallel_fft2_contiguous_f32(mid[0], mid[1]); let output2 = solo_fft2_f32(mid[2]); // Reorder into output [ extract_lo_hi_f32(output0, output1), extract_lo_lo_f32(output2, output1), extract_hi_hi_f32(output0, output2), ] } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, value0: __m128, value1: __m128, value2: __m128, value3: __m128, value4: __m128, value5: __m128, ) -> [__m128; 6] { // Algorithm: 3x2 good-thomas // Size-3 FFTs down the columns of our reordered array let mid0 = self.bf3.perform_parallel_fft_direct(value0, value2, value4); let mid1 = self.bf3.perform_parallel_fft_direct(value3, value5, value1); // We normally would put twiddle factors right here, but since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = parallel_fft2_interleaved_f32(mid0[0], mid1[0]); let [output2, output3] = parallel_fft2_interleaved_f32(mid0[1], mid1[1]); let [output4, output5] = parallel_fft2_interleaved_f32(mid0[2], mid1[2]); // Reorder into output [output0, output3, output4, output1, output2, output5] } } // __ __ _ _ _ _ _ // / /_ / /_ | || | | |__ (_) |_ // | '_ \ _____ | '_ \| || |_| '_ \| | __| // | (_) | |_____| | (_) |__ _| |_) | | |_ // \___/ \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly6 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: SseF64Butterfly3, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly6, 6, |this: &SseF64Butterfly6<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly6, 6, |this: &SseF64Butterfly6<_>| this .direction); impl SseF64Butterfly6 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf3 = SseF64Butterfly3::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let value0 = buffer.load_complex(0); let value1 = buffer.load_complex(1); let value2 = buffer.load_complex(2); let value3 = buffer.load_complex(3); let value4 = buffer.load_complex(4); let value5 = buffer.load_complex(5); let out = self.perform_fft_direct(value0, value1, value2, value3, value4, value5); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 1); buffer.store_complex(out[2], 2); buffer.store_complex(out[3], 3); buffer.store_complex(out[4], 4); buffer.store_complex(out[5], 5); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value0: __m128d, value1: __m128d, value2: __m128d, value3: __m128d, value4: __m128d, value5: __m128d, ) -> [__m128d; 6] { // Algorithm: 3x2 good-thomas // Size-3 FFTs down the columns of our reordered array let mid0 = self.bf3.perform_fft_direct(value0, value2, value4); let mid1 = self.bf3.perform_fft_direct(value3, value5, value1); // We normally would put twiddle factors right here, but since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = solo_fft2_f64(mid0[0], mid1[0]); let [output2, output3] = solo_fft2_f64(mid0[1], mid1[1]); let [output4, output5] = solo_fft2_f64(mid0[2], mid1[2]); // Reorder into output [output0, output3, output4, output1, output2, output5] } } // ___ _________ _ _ _ // ( _ ) |___ /___ \| |__ (_) |_ // / _ \ _____ |_ \ __) | '_ \| | __| // | (_) | |_____| ___) / __/| |_) | | |_ // \___/ |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly8 { root2: __m128, root2_dual: __m128, direction: FftDirection, bf4: SseF32Butterfly4, rotate90: Rotate90F32, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly8, 8, |this: &SseF32Butterfly8<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly8, 8, |this: &SseF32Butterfly8<_>| this .direction); impl SseF32Butterfly8 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf4 = SseF32Butterfly4::new(direction); let root2 = unsafe { _mm_set_ps(0.5f32.sqrt(), 0.5f32.sqrt(), 1.0, 1.0) }; let root2_dual = unsafe { _mm_load1_ps(&0.5f32.sqrt()) }; let rotate90 = if direction == FftDirection::Inverse { Rotate90F32::new(true) } else { Rotate90F32::new(false) }; Self { root2, root2_dual, direction, bf4, rotate90, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6}); let out = self.perform_fft_direct(input_packed); write_complex_to_array_strided!(out, buffer, 2, {0,1,2,3}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14}); let values = interleave_complex_f32!(input_packed, 4, {0, 1, 2, 3}); let out = self.perform_parallel_fft_direct(values); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4, 6}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0,1,2,3,4,5,6,7}); } #[inline(always)] unsafe fn perform_fft_direct(&self, values: [__m128; 4]) -> [__m128; 4] { // we're going to hardcode a step of mixed radix // step 1: copy and reorder the input into the scratch let [in02, in13] = transpose_complex_2x2_f32(values[0], values[1]); let [in46, in57] = transpose_complex_2x2_f32(values[2], values[3]); // step 2: column FFTs let val0 = self.bf4.perform_fft_direct(in02, in46); let mut val2 = self.bf4.perform_fft_direct(in13, in57); // step 3: apply twiddle factors let val2b = self.rotate90.rotate_hi(val2[0]); let val2c = _mm_add_ps(val2b, val2[0]); let val2d = _mm_mul_ps(val2c, self.root2); val2[0] = extract_lo_hi_f32(val2[0], val2d); let val3b = self.rotate90.rotate_both(val2[1]); let val3c = _mm_sub_ps(val3b, val2[1]); let val3d = _mm_mul_ps(val3c, self.root2); val2[1] = extract_lo_hi_f32(val3b, val3d); // step 4: transpose -- skipped because we're going to do the next FFTs non-contiguously // step 5: row FFTs let out0 = parallel_fft2_interleaved_f32(val0[0], val2[0]); let out1 = parallel_fft2_interleaved_f32(val0[1], val2[1]); // step 6: rearrange and copy to buffer [out0[0], out1[0], out0[1], out1[1]] } #[inline(always)] unsafe fn perform_parallel_fft_direct(&self, values: [__m128; 8]) -> [__m128; 8] { // we're going to hardcode a step of mixed radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let val03 = self .bf4 .perform_parallel_fft_direct(values[0], values[2], values[4], values[6]); let mut val47 = self .bf4 .perform_parallel_fft_direct(values[1], values[3], values[5], values[7]); // step 3: apply twiddle factors let val5b = self.rotate90.rotate_both(val47[1]); let val5c = _mm_add_ps(val5b, val47[1]); val47[1] = _mm_mul_ps(val5c, self.root2_dual); val47[2] = self.rotate90.rotate_both(val47[2]); let val7b = self.rotate90.rotate_both(val47[3]); let val7c = _mm_sub_ps(val7b, val47[3]); val47[3] = _mm_mul_ps(val7c, self.root2_dual); // step 4: transpose -- skipped because we're going to do the next FFTs non-contiguously // step 5: row FFTs let out0 = parallel_fft2_interleaved_f32(val03[0], val47[0]); let out1 = parallel_fft2_interleaved_f32(val03[1], val47[1]); let out2 = parallel_fft2_interleaved_f32(val03[2], val47[2]); let out3 = parallel_fft2_interleaved_f32(val03[3], val47[3]); // step 6: rearrange and copy to buffer [ out0[0], out1[0], out2[0], out3[0], out0[1], out1[1], out2[1], out3[1], ] } } // ___ __ _ _ _ _ _ // ( _ ) / /_ | || | | |__ (_) |_ // / _ \ _____ | '_ \| || |_| '_ \| | __| // | (_) | |_____| | (_) |__ _| |_) | | |_ // \___/ \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly8 { root2: __m128d, direction: FftDirection, bf4: SseF64Butterfly4, rotate90: Rotate90F64, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly8, 8, |this: &SseF64Butterfly8<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly8, 8, |this: &SseF64Butterfly8<_>| this .direction); impl SseF64Butterfly8 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf4 = SseF64Butterfly4::new(direction); let root2 = unsafe { _mm_load1_pd(&0.5f64.sqrt()) }; let rotate90 = if direction == FftDirection::Inverse { Rotate90F64::new(true) } else { Rotate90F64::new(false) }; Self { root2, direction, bf4, rotate90, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7}); } #[inline(always)] unsafe fn perform_fft_direct(&self, values: [__m128d; 8]) -> [__m128d; 8] { // we're going to hardcode a step of mixed radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let val03 = self .bf4 .perform_fft_direct(values[0], values[2], values[4], values[6]); let mut val47 = self .bf4 .perform_fft_direct(values[1], values[3], values[5], values[7]); // step 3: apply twiddle factors let val5b = self.rotate90.rotate(val47[1]); let val5c = _mm_add_pd(val5b, val47[1]); val47[1] = _mm_mul_pd(val5c, self.root2); val47[2] = self.rotate90.rotate(val47[2]); let val7b = self.rotate90.rotate(val47[3]); let val7c = _mm_sub_pd(val7b, val47[3]); val47[3] = _mm_mul_pd(val7c, self.root2); // step 4: transpose -- skipped because we're going to do the next FFTs non-contiguously // step 5: row FFTs let out0 = solo_fft2_f64(val03[0], val47[0]); let out1 = solo_fft2_f64(val03[1], val47[1]); let out2 = solo_fft2_f64(val03[2], val47[2]); let out3 = solo_fft2_f64(val03[3], val47[3]); // step 6: rearrange and copy to buffer [ out0[0], out1[0], out2[0], out3[0], out0[1], out1[1], out2[1], out3[1], ] } } // ___ _________ _ _ _ // / _ \ |___ /___ \| |__ (_) |_ // | (_) | _____ |_ \ __) | '_ \| | __| // \__, | |_____| ___) / __/| |_) | | |_ // /_/ |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly9 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: SseF32Butterfly3, twiddle1: __m128, twiddle2: __m128, twiddle4: __m128, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly9, 9, |this: &SseF32Butterfly9<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly9, 9, |this: &SseF32Butterfly9<_>| this .direction); impl SseF32Butterfly9 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf3 = SseF32Butterfly3::new(direction); let tw1: Complex = twiddles::compute_twiddle(1, 9, direction); let tw2: Complex = twiddles::compute_twiddle(2, 9, direction); let tw4: Complex = twiddles::compute_twiddle(4, 9, direction); let twiddle1 = unsafe { _mm_set_ps(tw1.im, tw1.re, tw1.im, tw1.re) }; let twiddle2 = unsafe { _mm_set_ps(tw2.im, tw2.re, tw2.im, tw2.re) }; let twiddle4 = unsafe { _mm_set_ps(tw4.im, tw4.re, tw4.im, tw4.re) }; Self { direction, _phantom: std::marker::PhantomData, bf3, twiddle1, twiddle2, twiddle4, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { // A single Sse 9-point will need a lot of shuffling, let's just reuse the dual one let values = read_partial1_complex_to_array!(buffer, {0,1,2,3,4,5,6,7,8}); let out = self.perform_parallel_fft_direct(values); for n in 0..9 { buffer.store_partial_lo_complex(out[n], n); } } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[4]), extract_hi_lo_f32(input_packed[0], input_packed[5]), extract_lo_hi_f32(input_packed[1], input_packed[5]), extract_hi_lo_f32(input_packed[1], input_packed[6]), extract_lo_hi_f32(input_packed[2], input_packed[6]), extract_hi_lo_f32(input_packed[2], input_packed[7]), extract_lo_hi_f32(input_packed[3], input_packed[7]), extract_hi_lo_f32(input_packed[3], input_packed[8]), extract_lo_hi_f32(input_packed[4], input_packed[8]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_hi_f32(out[8], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0,1,2,3,4,5,6,7,8}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [__m128; 9]) -> [__m128; 9] { // Algorithm: 3x3 mixed radix // Size-3 FFTs down the columns let mid0 = self .bf3 .perform_parallel_fft_direct(values[0], values[3], values[6]); let mut mid1 = self .bf3 .perform_parallel_fft_direct(values[1], values[4], values[7]); let mut mid2 = self .bf3 .perform_parallel_fft_direct(values[2], values[5], values[8]); // Apply twiddle factors. Note that we're re-using twiddle2 mid1[1] = mul_complex_f32(self.twiddle1, mid1[1]); mid1[2] = mul_complex_f32(self.twiddle2, mid1[2]); mid2[1] = mul_complex_f32(self.twiddle2, mid2[1]); mid2[2] = mul_complex_f32(self.twiddle4, mid2[2]); let [output0, output1, output2] = self .bf3 .perform_parallel_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self .bf3 .perform_parallel_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self .bf3 .perform_parallel_fft_direct(mid0[2], mid1[2], mid2[2]); [ output0, output3, output6, output1, output4, output7, output2, output5, output8, ] } } // ___ __ _ _ _ _ _ // / _ \ / /_ | || | | |__ (_) |_ // | (_) | _____ | '_ \| || |_| '_ \| | __| // \__, | |_____| | (_) |__ _| |_) | | |_ // /_/ \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly9 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: SseF64Butterfly3, twiddle1: __m128d, twiddle2: __m128d, twiddle4: __m128d, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly9, 9, |this: &SseF64Butterfly9<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly9, 9, |this: &SseF64Butterfly9<_>| this .direction); impl SseF64Butterfly9 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf3 = SseF64Butterfly3::new(direction); let tw1: Complex = twiddles::compute_twiddle(1, 9, direction); let tw2: Complex = twiddles::compute_twiddle(2, 9, direction); let tw4: Complex = twiddles::compute_twiddle(4, 9, direction); let twiddle1 = unsafe { _mm_set_pd(tw1.im, tw1.re) }; let twiddle2 = unsafe { _mm_set_pd(tw2.im, tw2.re) }; let twiddle4 = unsafe { _mm_set_pd(tw4.im, tw4.re) }; Self { direction, _phantom: std::marker::PhantomData, bf3, twiddle1, twiddle2, twiddle4, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [__m128d; 9]) -> [__m128d; 9] { // Algorithm: 3x3 mixed radix // Size-3 FFTs down the columns let mid0 = self.bf3.perform_fft_direct(values[0], values[3], values[6]); let mut mid1 = self.bf3.perform_fft_direct(values[1], values[4], values[7]); let mut mid2 = self.bf3.perform_fft_direct(values[2], values[5], values[8]); // Apply twiddle factors. Note that we're re-using twiddle2 mid1[1] = mul_complex_f64(self.twiddle1, mid1[1]); mid1[2] = mul_complex_f64(self.twiddle2, mid1[2]); mid2[1] = mul_complex_f64(self.twiddle2, mid2[1]); mid2[2] = mul_complex_f64(self.twiddle4, mid2[2]); let [output0, output1, output2] = self.bf3.perform_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self.bf3.perform_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self.bf3.perform_fft_direct(mid0[2], mid1[2], mid2[2]); [ output0, output3, output6, output1, output4, output7, output2, output5, output8, ] } } // _ ___ _________ _ _ _ // / |/ _ \ |___ /___ \| |__ (_) |_ // | | | | | _____ |_ \ __) | '_ \| | __| // | | |_| | |_____| ___) / __/| |_) | | |_ // |_|\___/ |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly10 { direction: FftDirection, _phantom: std::marker::PhantomData, bf5: SseF32Butterfly5, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly10, 10, |this: &SseF32Butterfly10<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly10, 10, |this: &SseF32Butterfly10<_>| this .direction); impl SseF32Butterfly10 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf5 = SseF32Butterfly5::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf5, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8}); let out = self.perform_fft_direct(input_packed); write_complex_to_array_strided!(out, buffer, 2, {0,1,2,3,4}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18}); let values = interleave_complex_f32!(input_packed, 5, {0, 1, 2, 3, 4}); let out = self.perform_parallel_fft_direct(values); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4, 6, 8}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0,1,2,3,4,5,6,7,8,9}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [__m128; 5]) -> [__m128; 5] { // Algorithm: 5x2 good-thomas // Reorder and pack let reord0 = extract_lo_hi_f32(values[0], values[2]); let reord1 = extract_lo_hi_f32(values[1], values[3]); let reord2 = extract_lo_hi_f32(values[2], values[4]); let reord3 = extract_lo_hi_f32(values[3], values[0]); let reord4 = extract_lo_hi_f32(values[4], values[1]); // Size-5 FFTs down the columns of our reordered array let mids = self .bf5 .perform_parallel_fft_direct(reord0, reord1, reord2, reord3, reord4); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [temp01, temp23] = parallel_fft2_contiguous_f32(mids[0], mids[1]); let [temp45, temp67] = parallel_fft2_contiguous_f32(mids[2], mids[3]); let temp89 = solo_fft2_f32(mids[4]); // Reorder let out01 = extract_lo_hi_f32(temp01, temp23); let out23 = extract_lo_hi_f32(temp45, temp67); let out45 = extract_lo_lo_f32(temp89, temp23); let out67 = extract_hi_lo_f32(temp01, temp67); let out89 = extract_hi_hi_f32(temp45, temp89); [out01, out23, out45, out67, out89] } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [__m128; 10]) -> [__m128; 10] { // Algorithm: 5x2 good-thomas // Size-5 FFTs down the columns of our reordered array let mid0 = self .bf5 .perform_parallel_fft_direct(values[0], values[2], values[4], values[6], values[8]); let mid1 = self .bf5 .perform_parallel_fft_direct(values[5], values[7], values[9], values[1], values[3]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = parallel_fft2_interleaved_f32(mid0[0], mid1[0]); let [output2, output3] = parallel_fft2_interleaved_f32(mid0[1], mid1[1]); let [output4, output5] = parallel_fft2_interleaved_f32(mid0[2], mid1[2]); let [output6, output7] = parallel_fft2_interleaved_f32(mid0[3], mid1[3]); let [output8, output9] = parallel_fft2_interleaved_f32(mid0[4], mid1[4]); // Reorder and return [ output0, output3, output4, output7, output8, output1, output2, output5, output6, output9, ] } } // _ ___ __ _ _ _ _ _ // / |/ _ \ / /_ | || | | |__ (_) |_ // | | | | | _____ | '_ \| || |_| '_ \| | __| // | | |_| | |_____| | (_) |__ _| |_) | | |_ // |_|\___/ \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly10 { direction: FftDirection, _phantom: std::marker::PhantomData, bf2: SseF64Butterfly2, bf5: SseF64Butterfly5, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly10, 10, |this: &SseF64Butterfly10<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly10, 10, |this: &SseF64Butterfly10<_>| this .direction); impl SseF64Butterfly10 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf2 = SseF64Butterfly2::new(direction); let bf5 = SseF64Butterfly5::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf2, bf5, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [__m128d; 10]) -> [__m128d; 10] { // Algorithm: 5x2 good-thomas // Size-5 FFTs down the columns of our reordered array let mid0 = self .bf5 .perform_fft_direct(values[0], values[2], values[4], values[6], values[8]); let mid1 = self .bf5 .perform_fft_direct(values[5], values[7], values[9], values[1], values[3]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = self.bf2.perform_fft_direct(mid0[0], mid1[0]); let [output2, output3] = self.bf2.perform_fft_direct(mid0[1], mid1[1]); let [output4, output5] = self.bf2.perform_fft_direct(mid0[2], mid1[2]); let [output6, output7] = self.bf2.perform_fft_direct(mid0[3], mid1[3]); let [output8, output9] = self.bf2.perform_fft_direct(mid0[4], mid1[4]); // Reorder and return [ output0, output3, output4, output7, output8, output1, output2, output5, output6, output9, ] } } // _ ____ _________ _ _ _ // / |___ \ |___ /___ \| |__ (_) |_ // | | __) | _____ |_ \ __) | '_ \| | __| // | |/ __/ |_____| ___) / __/| |_) | | |_ // |_|_____| |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly12 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: SseF32Butterfly3, bf4: SseF32Butterfly4, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly12, 12, |this: &SseF32Butterfly12<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly12, 12, |this: &SseF32Butterfly12<_>| this .direction); impl SseF32Butterfly12 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf3 = SseF32Butterfly3::new(direction); let bf4 = SseF32Butterfly4::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, bf4, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10 }); let out = self.perform_fft_direct(input_packed); write_complex_to_array_strided!(out, buffer, 2, {0,1,2,3,4,5}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22}); let values = interleave_complex_f32!(input_packed, 6, {0, 1, 2, 3, 4, 5}); let out = self.perform_parallel_fft_direct(values); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4, 6, 8, 10}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0,1,2,3,4,5,6,7,8,9, 10, 11}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [__m128; 6]) -> [__m128; 6] { // Algorithm: 4x3 good-thomas // Reorder and pack let packed03 = extract_lo_hi_f32(values[0], values[1]); let packed47 = extract_lo_hi_f32(values[2], values[3]); let packed69 = extract_lo_hi_f32(values[3], values[4]); let packed101 = extract_lo_hi_f32(values[5], values[0]); let packed811 = extract_lo_hi_f32(values[4], values[5]); let packed25 = extract_lo_hi_f32(values[1], values[2]); // Size-4 FFTs down the columns of our reordered array let mid0 = self.bf4.perform_fft_direct(packed03, packed69); let mid1 = self.bf4.perform_fft_direct(packed47, packed101); let mid2 = self.bf4.perform_fft_direct(packed811, packed25); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-3 FFTs down the columns let [temp03, temp14, temp25] = self .bf3 .perform_parallel_fft_direct(mid0[0], mid1[0], mid2[0]); let [temp69, temp710, temp811] = self .bf3 .perform_parallel_fft_direct(mid0[1], mid1[1], mid2[1]); // Reorder and return [ extract_lo_hi_f32(temp03, temp14), extract_lo_hi_f32(temp811, temp69), extract_lo_hi_f32(temp14, temp25), extract_lo_hi_f32(temp69, temp710), extract_lo_hi_f32(temp25, temp03), extract_lo_hi_f32(temp710, temp811), ] } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [__m128; 12]) -> [__m128; 12] { // Algorithm: 4x3 good-thomas // Size-4 FFTs down the columns of our reordered array let mid0 = self .bf4 .perform_parallel_fft_direct(values[0], values[3], values[6], values[9]); let mid1 = self .bf4 .perform_parallel_fft_direct(values[4], values[7], values[10], values[1]); let mid2 = self .bf4 .perform_parallel_fft_direct(values[8], values[11], values[2], values[5]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-3 FFTs down the columns let [output0, output1, output2] = self .bf3 .perform_parallel_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self .bf3 .perform_parallel_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self .bf3 .perform_parallel_fft_direct(mid0[2], mid1[2], mid2[2]); let [output9, output10, output11] = self .bf3 .perform_parallel_fft_direct(mid0[3], mid1[3], mid2[3]); // Reorder and return [ output0, output4, output8, output9, output1, output5, output6, output10, output2, output3, output7, output11, ] } } // _ ____ __ _ _ _ _ _ // / |___ \ / /_ | || | | |__ (_) |_ // | | __) | _____ | '_ \| || |_| '_ \| | __| // | |/ __/ |_____| | (_) |__ _| |_) | | |_ // |_|_____| \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly12 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: SseF64Butterfly3, bf4: SseF64Butterfly4, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly12, 12, |this: &SseF64Butterfly12<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly12, 12, |this: &SseF64Butterfly12<_>| this .direction); impl SseF64Butterfly12 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf3 = SseF64Butterfly3::new(direction); let bf4 = SseF64Butterfly4::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, bf4, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [__m128d; 12]) -> [__m128d; 12] { // Algorithm: 4x3 good-thomas // Size-4 FFTs down the columns of our reordered array let mid0 = self .bf4 .perform_fft_direct(values[0], values[3], values[6], values[9]); let mid1 = self .bf4 .perform_fft_direct(values[4], values[7], values[10], values[1]); let mid2 = self .bf4 .perform_fft_direct(values[8], values[11], values[2], values[5]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-3 FFTs down the columns let [output0, output1, output2] = self.bf3.perform_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self.bf3.perform_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self.bf3.perform_fft_direct(mid0[2], mid1[2], mid2[2]); let [output9, output10, output11] = self.bf3.perform_fft_direct(mid0[3], mid1[3], mid2[3]); [ output0, output4, output8, output9, output1, output5, output6, output10, output2, output3, output7, output11, ] } } // _ ____ _________ _ _ _ // / | ___| |___ /___ \| |__ (_) |_ // | |___ \ _____ |_ \ __) | '_ \| | __| // | |___) | |_____| ___) / __/| |_) | | |_ // |_|____/ |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly15 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: SseF32Butterfly3, bf5: SseF32Butterfly5, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly15, 15, |this: &SseF32Butterfly15<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly15, 15, |this: &SseF32Butterfly15<_>| this .direction); impl SseF32Butterfly15 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf3 = SseF32Butterfly3::new(direction); let bf5 = SseF32Butterfly5::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, bf5, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { // A single Sse 15-point will need a lot of shuffling, let's just reuse the dual one let values = read_partial1_complex_to_array!(buffer, {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14}); let out = self.perform_parallel_fft_direct(values); for n in 0..15 { buffer.store_partial_lo_complex(out[n], n); } } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[7]), extract_hi_lo_f32(input_packed[0], input_packed[8]), extract_lo_hi_f32(input_packed[1], input_packed[8]), extract_hi_lo_f32(input_packed[1], input_packed[9]), extract_lo_hi_f32(input_packed[2], input_packed[9]), extract_hi_lo_f32(input_packed[2], input_packed[10]), extract_lo_hi_f32(input_packed[3], input_packed[10]), extract_hi_lo_f32(input_packed[3], input_packed[11]), extract_lo_hi_f32(input_packed[4], input_packed[11]), extract_hi_lo_f32(input_packed[4], input_packed[12]), extract_lo_hi_f32(input_packed[5], input_packed[12]), extract_hi_lo_f32(input_packed[5], input_packed[13]), extract_lo_hi_f32(input_packed[6], input_packed[13]), extract_hi_lo_f32(input_packed[6], input_packed[14]), extract_lo_hi_f32(input_packed[7], input_packed[14]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_hi_f32(out[14], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0,1,2,3,4,5,6,7,8,9, 10, 11, 12, 13, 14}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [__m128; 15]) -> [__m128; 15] { // Algorithm: 5x3 good-thomas // Size-5 FFTs down the columns of our reordered array let mid0 = self .bf5 .perform_parallel_fft_direct(values[0], values[3], values[6], values[9], values[12]); let mid1 = self .bf5 .perform_parallel_fft_direct(values[5], values[8], values[11], values[14], values[2]); let mid2 = self .bf5 .perform_parallel_fft_direct(values[10], values[13], values[1], values[4], values[7]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-3 FFTs down the columns let [output0, output1, output2] = self .bf3 .perform_parallel_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self .bf3 .perform_parallel_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self .bf3 .perform_parallel_fft_direct(mid0[2], mid1[2], mid2[2]); let [output9, output10, output11] = self .bf3 .perform_parallel_fft_direct(mid0[3], mid1[3], mid2[3]); let [output12, output13, output14] = self .bf3 .perform_parallel_fft_direct(mid0[4], mid1[4], mid2[4]); [ output0, output4, output8, output9, output13, output2, output3, output7, output11, output12, output1, output5, output6, output10, output14, ] } } // _ ____ __ _ _ _ _ _ // / | ___| / /_ | || | | |__ (_) |_ // | |___ \ _____ | '_ \| || |_| '_ \| | __| // | |___) | |_____| | (_) |__ _| |_) | | |_ // |_|____/ \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly15 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: SseF64Butterfly3, bf5: SseF64Butterfly5, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly15, 15, |this: &SseF64Butterfly15<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly15, 15, |this: &SseF64Butterfly15<_>| this .direction); impl SseF64Butterfly15 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf3 = SseF64Butterfly3::new(direction); let bf5 = SseF64Butterfly5::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, bf5, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [__m128d; 15]) -> [__m128d; 15] { // Algorithm: 5x3 good-thomas // Size-5 FFTs down the columns of our reordered array let mid0 = self .bf5 .perform_fft_direct(values[0], values[3], values[6], values[9], values[12]); let mid1 = self .bf5 .perform_fft_direct(values[5], values[8], values[11], values[14], values[2]); let mid2 = self .bf5 .perform_fft_direct(values[10], values[13], values[1], values[4], values[7]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-3 FFTs down the columns let [output0, output1, output2] = self.bf3.perform_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self.bf3.perform_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self.bf3.perform_fft_direct(mid0[2], mid1[2], mid2[2]); let [output9, output10, output11] = self.bf3.perform_fft_direct(mid0[3], mid1[3], mid2[3]); let [output12, output13, output14] = self.bf3.perform_fft_direct(mid0[4], mid1[4], mid2[4]); [ output0, output4, output8, output9, output13, output2, output3, output7, output11, output12, output1, output5, output6, output10, output14, ] } } // _ __ _________ _ _ _ // / |/ /_ |___ /___ \| |__ (_) |_ // | | '_ \ _____ |_ \ __) | '_ \| | __| // | | (_) | |_____| ___) / __/| |_) | | |_ // |_|\___/ |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly16 { direction: FftDirection, bf4: SseF32Butterfly4, bf8: SseF32Butterfly8, rotate90: Rotate90F32, twiddle01: __m128, twiddle23: __m128, twiddle01conj: __m128, twiddle23conj: __m128, twiddle1: __m128, twiddle2: __m128, twiddle3: __m128, twiddle1c: __m128, twiddle2c: __m128, twiddle3c: __m128, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly16, 16, |this: &SseF32Butterfly16<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly16, 16, |this: &SseF32Butterfly16<_>| this .direction); impl SseF32Butterfly16 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf8 = SseF32Butterfly8::new(direction); let bf4 = SseF32Butterfly4::new(direction); let rotate90 = if direction == FftDirection::Inverse { Rotate90F32::new(true) } else { Rotate90F32::new(false) }; let tw1: Complex = twiddles::compute_twiddle(1, 16, direction); let tw2: Complex = twiddles::compute_twiddle(2, 16, direction); let tw3: Complex = twiddles::compute_twiddle(3, 16, direction); let twiddle01 = unsafe { _mm_set_ps(tw1.im, tw1.re, 0.0, 1.0) }; let twiddle23 = unsafe { _mm_set_ps(tw3.im, tw3.re, tw2.im, tw2.re) }; let twiddle01conj = unsafe { _mm_set_ps(-tw1.im, tw1.re, 0.0, 1.0) }; let twiddle23conj = unsafe { _mm_set_ps(-tw3.im, tw3.re, -tw2.im, tw2.re) }; let twiddle1 = unsafe { _mm_set_ps(tw1.im, tw1.re, tw1.im, tw1.re) }; let twiddle2 = unsafe { _mm_set_ps(tw2.im, tw2.re, tw2.im, tw2.re) }; let twiddle3 = unsafe { _mm_set_ps(tw3.im, tw3.re, tw3.im, tw3.re) }; let twiddle1c = unsafe { _mm_set_ps(-tw1.im, tw1.re, -tw1.im, tw1.re) }; let twiddle2c = unsafe { _mm_set_ps(-tw2.im, tw2.re, -tw2.im, tw2.re) }; let twiddle3c = unsafe { _mm_set_ps(-tw3.im, tw3.re, -tw3.im, tw3.re) }; Self { direction, bf4, bf8, rotate90, twiddle01, twiddle23, twiddle01conj, twiddle23conj, twiddle1, twiddle2, twiddle3, twiddle1c, twiddle2c, twiddle3c, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14 }); let out = self.perform_fft_direct(input_packed); write_complex_to_array_strided!(out, buffer, 2, {0,1,2,3,4,5,6,7}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30}); let values = interleave_complex_f32!(input_packed, 8, {0, 1, 2, 3 ,4 ,5 ,6 ,7}); let out = self.perform_parallel_fft_direct(values); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4, 6, 8, 10, 12, 14}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0,1,2,3,4,5,6,7,8,9, 10, 11,12,13,14, 15}); } #[inline(always)] unsafe fn perform_fft_direct(&self, input: [__m128; 8]) -> [__m128; 8] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch let in0002 = extract_lo_lo_f32(input[0], input[1]); let in0406 = extract_lo_lo_f32(input[2], input[3]); let in0810 = extract_lo_lo_f32(input[4], input[5]); let in1214 = extract_lo_lo_f32(input[6], input[7]); let in0105 = extract_hi_hi_f32(input[0], input[2]); let in0913 = extract_hi_hi_f32(input[4], input[6]); let in1503 = extract_hi_hi_f32(input[7], input[1]); let in0711 = extract_hi_hi_f32(input[3], input[5]); let in_evens = [in0002, in0406, in0810, in1214]; // step 2: column FFTs let evens = self.bf8.perform_fft_direct(in_evens); let mut odds1 = self.bf4.perform_fft_direct(in0105, in0913); let mut odds3 = self.bf4.perform_fft_direct(in1503, in0711); // step 3: apply twiddle factors odds1[0] = mul_complex_f32(odds1[0], self.twiddle01); odds3[0] = mul_complex_f32(odds3[0], self.twiddle01conj); odds1[1] = mul_complex_f32(odds1[1], self.twiddle23); odds3[1] = mul_complex_f32(odds3[1], self.twiddle23conj); // step 4: cross FFTs let mut temp0 = parallel_fft2_interleaved_f32(odds1[0], odds3[0]); let mut temp1 = parallel_fft2_interleaved_f32(odds1[1], odds3[1]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate_both(temp0[1]); temp1[1] = self.rotate90.rotate_both(temp1[1]); //step 5: copy/add/subtract data back to buffer [ _mm_add_ps(evens[0], temp0[0]), _mm_add_ps(evens[1], temp1[0]), _mm_add_ps(evens[2], temp0[1]), _mm_add_ps(evens[3], temp1[1]), _mm_sub_ps(evens[0], temp0[0]), _mm_sub_ps(evens[1], temp1[0]), _mm_sub_ps(evens[2], temp0[1]), _mm_sub_ps(evens[3], temp1[1]), ] } #[inline(always)] unsafe fn perform_parallel_fft_direct(&self, input: [__m128; 16]) -> [__m128; 16] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let evens = self.bf8.perform_parallel_fft_direct([ input[0], input[2], input[4], input[6], input[8], input[10], input[12], input[14], ]); let mut odds1 = self .bf4 .perform_parallel_fft_direct(input[1], input[5], input[9], input[13]); let mut odds3 = self .bf4 .perform_parallel_fft_direct(input[15], input[3], input[7], input[11]); // step 3: apply twiddle factors odds1[1] = mul_complex_f32(odds1[1], self.twiddle1); odds3[1] = mul_complex_f32(odds3[1], self.twiddle1c); odds1[2] = mul_complex_f32(odds1[2], self.twiddle2); odds3[2] = mul_complex_f32(odds3[2], self.twiddle2c); odds1[3] = mul_complex_f32(odds1[3], self.twiddle3); odds3[3] = mul_complex_f32(odds3[3], self.twiddle3c); // step 4: cross FFTs let mut temp0 = parallel_fft2_interleaved_f32(odds1[0], odds3[0]); let mut temp1 = parallel_fft2_interleaved_f32(odds1[1], odds3[1]); let mut temp2 = parallel_fft2_interleaved_f32(odds1[2], odds3[2]); let mut temp3 = parallel_fft2_interleaved_f32(odds1[3], odds3[3]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate_both(temp0[1]); temp1[1] = self.rotate90.rotate_both(temp1[1]); temp2[1] = self.rotate90.rotate_both(temp2[1]); temp3[1] = self.rotate90.rotate_both(temp3[1]); //step 5: copy/add/subtract data back to buffer [ _mm_add_ps(evens[0], temp0[0]), _mm_add_ps(evens[1], temp1[0]), _mm_add_ps(evens[2], temp2[0]), _mm_add_ps(evens[3], temp3[0]), _mm_add_ps(evens[4], temp0[1]), _mm_add_ps(evens[5], temp1[1]), _mm_add_ps(evens[6], temp2[1]), _mm_add_ps(evens[7], temp3[1]), _mm_sub_ps(evens[0], temp0[0]), _mm_sub_ps(evens[1], temp1[0]), _mm_sub_ps(evens[2], temp2[0]), _mm_sub_ps(evens[3], temp3[0]), _mm_sub_ps(evens[4], temp0[1]), _mm_sub_ps(evens[5], temp1[1]), _mm_sub_ps(evens[6], temp2[1]), _mm_sub_ps(evens[7], temp3[1]), ] } } // _ __ __ _ _ _ _ _ // / |/ /_ / /_ | || | | |__ (_) |_ // | | '_ \ _____ | '_ \| || |_| '_ \| | __| // | | (_) | |_____| | (_) |__ _| |_) | | |_ // |_|\___/ \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly16 { direction: FftDirection, bf4: SseF64Butterfly4, bf8: SseF64Butterfly8, rotate90: Rotate90F64, twiddle1: __m128d, twiddle2: __m128d, twiddle3: __m128d, twiddle1c: __m128d, twiddle2c: __m128d, twiddle3c: __m128d, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly16, 16, |this: &SseF64Butterfly16<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly16, 16, |this: &SseF64Butterfly16<_>| this .direction); impl SseF64Butterfly16 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf8 = SseF64Butterfly8::new(direction); let bf4 = SseF64Butterfly4::new(direction); let rotate90 = if direction == FftDirection::Inverse { Rotate90F64::new(true) } else { Rotate90F64::new(false) }; let twiddle1 = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(1, 16, direction).re as *const f64) }; let twiddle2 = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(2, 16, direction).re as *const f64) }; let twiddle3 = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(3, 16, direction).re as *const f64) }; let twiddle1c = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(1, 16, direction).conj().re as *const f64) }; let twiddle2c = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(2, 16, direction).conj().re as *const f64) }; let twiddle3c = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(3, 16, direction).conj().re as *const f64) }; Self { direction, bf4, bf8, rotate90, twiddle1, twiddle2, twiddle3, twiddle1c, twiddle2c, twiddle3c, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}); } #[inline(always)] unsafe fn perform_fft_direct(&self, input: [__m128d; 16]) -> [__m128d; 16] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let evens = self.bf8.perform_fft_direct([ input[0], input[2], input[4], input[6], input[8], input[10], input[12], input[14], ]); let mut odds1 = self .bf4 .perform_fft_direct(input[1], input[5], input[9], input[13]); let mut odds3 = self .bf4 .perform_fft_direct(input[15], input[3], input[7], input[11]); // step 3: apply twiddle factors odds1[1] = mul_complex_f64(odds1[1], self.twiddle1); odds3[1] = mul_complex_f64(odds3[1], self.twiddle1c); odds1[2] = mul_complex_f64(odds1[2], self.twiddle2); odds3[2] = mul_complex_f64(odds3[2], self.twiddle2c); odds1[3] = mul_complex_f64(odds1[3], self.twiddle3); odds3[3] = mul_complex_f64(odds3[3], self.twiddle3c); // step 4: cross FFTs let mut temp0 = solo_fft2_f64(odds1[0], odds3[0]); let mut temp1 = solo_fft2_f64(odds1[1], odds3[1]); let mut temp2 = solo_fft2_f64(odds1[2], odds3[2]); let mut temp3 = solo_fft2_f64(odds1[3], odds3[3]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate(temp0[1]); temp1[1] = self.rotate90.rotate(temp1[1]); temp2[1] = self.rotate90.rotate(temp2[1]); temp3[1] = self.rotate90.rotate(temp3[1]); //step 5: copy/add/subtract data back to buffer [ _mm_add_pd(evens[0], temp0[0]), _mm_add_pd(evens[1], temp1[0]), _mm_add_pd(evens[2], temp2[0]), _mm_add_pd(evens[3], temp3[0]), _mm_add_pd(evens[4], temp0[1]), _mm_add_pd(evens[5], temp1[1]), _mm_add_pd(evens[6], temp2[1]), _mm_add_pd(evens[7], temp3[1]), _mm_sub_pd(evens[0], temp0[0]), _mm_sub_pd(evens[1], temp1[0]), _mm_sub_pd(evens[2], temp2[0]), _mm_sub_pd(evens[3], temp3[0]), _mm_sub_pd(evens[4], temp0[1]), _mm_sub_pd(evens[5], temp1[1]), _mm_sub_pd(evens[6], temp2[1]), _mm_sub_pd(evens[7], temp3[1]), ] } } // _________ _________ _ _ _ // |___ /___ \ |___ /___ \| |__ (_) |_ // |_ \ __) | _____ |_ \ __) | '_ \| | __| // ___) / __/ |_____| ___) / __/| |_) | | |_ // |____/_____| |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly32 { direction: FftDirection, bf8: SseF32Butterfly8, bf16: SseF32Butterfly16, rotate90: Rotate90F32, twiddle01: __m128, twiddle23: __m128, twiddle45: __m128, twiddle67: __m128, twiddle01conj: __m128, twiddle23conj: __m128, twiddle45conj: __m128, twiddle67conj: __m128, twiddle1: __m128, twiddle2: __m128, twiddle3: __m128, twiddle4: __m128, twiddle5: __m128, twiddle6: __m128, twiddle7: __m128, twiddle1c: __m128, twiddle2c: __m128, twiddle3c: __m128, twiddle4c: __m128, twiddle5c: __m128, twiddle6c: __m128, twiddle7c: __m128, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly32, 32, |this: &SseF32Butterfly32<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly32, 32, |this: &SseF32Butterfly32<_>| this .direction); impl SseF32Butterfly32 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf8 = SseF32Butterfly8::new(direction); let bf16 = SseF32Butterfly16::new(direction); let rotate90 = if direction == FftDirection::Inverse { Rotate90F32::new(true) } else { Rotate90F32::new(false) }; let tw1: Complex = twiddles::compute_twiddle(1, 32, direction); let tw2: Complex = twiddles::compute_twiddle(2, 32, direction); let tw3: Complex = twiddles::compute_twiddle(3, 32, direction); let tw4: Complex = twiddles::compute_twiddle(4, 32, direction); let tw5: Complex = twiddles::compute_twiddle(5, 32, direction); let tw6: Complex = twiddles::compute_twiddle(6, 32, direction); let tw7: Complex = twiddles::compute_twiddle(7, 32, direction); let twiddle01 = unsafe { _mm_set_ps(tw1.im, tw1.re, 0.0, 1.0) }; let twiddle23 = unsafe { _mm_set_ps(tw3.im, tw3.re, tw2.im, tw2.re) }; let twiddle45 = unsafe { _mm_set_ps(tw5.im, tw5.re, tw4.im, tw4.re) }; let twiddle67 = unsafe { _mm_set_ps(tw7.im, tw7.re, tw6.im, tw6.re) }; let twiddle01conj = unsafe { _mm_set_ps(-tw1.im, tw1.re, 0.0, 1.0) }; let twiddle23conj = unsafe { _mm_set_ps(-tw3.im, tw3.re, -tw2.im, tw2.re) }; let twiddle45conj = unsafe { _mm_set_ps(-tw5.im, tw5.re, -tw4.im, tw4.re) }; let twiddle67conj = unsafe { _mm_set_ps(-tw7.im, tw7.re, -tw6.im, tw6.re) }; let twiddle1 = unsafe { _mm_set_ps(tw1.im, tw1.re, tw1.im, tw1.re) }; let twiddle2 = unsafe { _mm_set_ps(tw2.im, tw2.re, tw2.im, tw2.re) }; let twiddle3 = unsafe { _mm_set_ps(tw3.im, tw3.re, tw3.im, tw3.re) }; let twiddle4 = unsafe { _mm_set_ps(tw4.im, tw4.re, tw4.im, tw4.re) }; let twiddle5 = unsafe { _mm_set_ps(tw5.im, tw5.re, tw5.im, tw5.re) }; let twiddle6 = unsafe { _mm_set_ps(tw6.im, tw6.re, tw6.im, tw6.re) }; let twiddle7 = unsafe { _mm_set_ps(tw7.im, tw7.re, tw7.im, tw7.re) }; let twiddle1c = unsafe { _mm_set_ps(-tw1.im, tw1.re, -tw1.im, tw1.re) }; let twiddle2c = unsafe { _mm_set_ps(-tw2.im, tw2.re, -tw2.im, tw2.re) }; let twiddle3c = unsafe { _mm_set_ps(-tw3.im, tw3.re, -tw3.im, tw3.re) }; let twiddle4c = unsafe { _mm_set_ps(-tw4.im, tw4.re, -tw4.im, tw4.re) }; let twiddle5c = unsafe { _mm_set_ps(-tw5.im, tw5.re, -tw5.im, tw5.re) }; let twiddle6c = unsafe { _mm_set_ps(-tw6.im, tw6.re, -tw6.im, tw6.re) }; let twiddle7c = unsafe { _mm_set_ps(-tw7.im, tw7.re, -tw7.im, tw7.re) }; Self { direction, bf8, bf16, rotate90, twiddle01, twiddle23, twiddle45, twiddle67, twiddle01conj, twiddle23conj, twiddle45conj, twiddle67conj, twiddle1, twiddle2, twiddle3, twiddle4, twiddle5, twiddle6, twiddle7, twiddle1c, twiddle2c, twiddle3c, twiddle4c, twiddle5c, twiddle6c, twiddle7c, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30 }); let out = self.perform_fft_direct(input_packed); write_complex_to_array_strided!(out, buffer, 2, {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62}); let values = interleave_complex_f32!(input_packed, 16, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}); let out = self.perform_parallel_fft_direct(values); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0,1,2,3,4,5,6,7,8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }); } #[inline(always)] unsafe fn perform_fft_direct(&self, input: [__m128; 16]) -> [__m128; 16] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch let in0002 = extract_lo_lo_f32(input[0], input[1]); let in0406 = extract_lo_lo_f32(input[2], input[3]); let in0810 = extract_lo_lo_f32(input[4], input[5]); let in1214 = extract_lo_lo_f32(input[6], input[7]); let in1618 = extract_lo_lo_f32(input[8], input[9]); let in2022 = extract_lo_lo_f32(input[10], input[11]); let in2426 = extract_lo_lo_f32(input[12], input[13]); let in2830 = extract_lo_lo_f32(input[14], input[15]); let in0105 = extract_hi_hi_f32(input[0], input[2]); let in0913 = extract_hi_hi_f32(input[4], input[6]); let in1721 = extract_hi_hi_f32(input[8], input[10]); let in2529 = extract_hi_hi_f32(input[12], input[14]); let in3103 = extract_hi_hi_f32(input[15], input[1]); let in0711 = extract_hi_hi_f32(input[3], input[5]); let in1519 = extract_hi_hi_f32(input[7], input[9]); let in2327 = extract_hi_hi_f32(input[11], input[13]); let in_evens = [ in0002, in0406, in0810, in1214, in1618, in2022, in2426, in2830, ]; // step 2: column FFTs let evens = self.bf16.perform_fft_direct(in_evens); let mut odds1 = self .bf8 .perform_fft_direct([in0105, in0913, in1721, in2529]); let mut odds3 = self .bf8 .perform_fft_direct([in3103, in0711, in1519, in2327]); // step 3: apply twiddle factors odds1[0] = mul_complex_f32(odds1[0], self.twiddle01); odds3[0] = mul_complex_f32(odds3[0], self.twiddle01conj); odds1[1] = mul_complex_f32(odds1[1], self.twiddle23); odds3[1] = mul_complex_f32(odds3[1], self.twiddle23conj); odds1[2] = mul_complex_f32(odds1[2], self.twiddle45); odds3[2] = mul_complex_f32(odds3[2], self.twiddle45conj); odds1[3] = mul_complex_f32(odds1[3], self.twiddle67); odds3[3] = mul_complex_f32(odds3[3], self.twiddle67conj); // step 4: cross FFTs let mut temp0 = parallel_fft2_interleaved_f32(odds1[0], odds3[0]); let mut temp1 = parallel_fft2_interleaved_f32(odds1[1], odds3[1]); let mut temp2 = parallel_fft2_interleaved_f32(odds1[2], odds3[2]); let mut temp3 = parallel_fft2_interleaved_f32(odds1[3], odds3[3]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate_both(temp0[1]); temp1[1] = self.rotate90.rotate_both(temp1[1]); temp2[1] = self.rotate90.rotate_both(temp2[1]); temp3[1] = self.rotate90.rotate_both(temp3[1]); //step 5: copy/add/subtract data back to buffer [ _mm_add_ps(evens[0], temp0[0]), _mm_add_ps(evens[1], temp1[0]), _mm_add_ps(evens[2], temp2[0]), _mm_add_ps(evens[3], temp3[0]), _mm_add_ps(evens[4], temp0[1]), _mm_add_ps(evens[5], temp1[1]), _mm_add_ps(evens[6], temp2[1]), _mm_add_ps(evens[7], temp3[1]), _mm_sub_ps(evens[0], temp0[0]), _mm_sub_ps(evens[1], temp1[0]), _mm_sub_ps(evens[2], temp2[0]), _mm_sub_ps(evens[3], temp3[0]), _mm_sub_ps(evens[4], temp0[1]), _mm_sub_ps(evens[5], temp1[1]), _mm_sub_ps(evens[6], temp2[1]), _mm_sub_ps(evens[7], temp3[1]), ] } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, input: [__m128; 32]) -> [__m128; 32] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let evens = self.bf16.perform_parallel_fft_direct([ input[0], input[2], input[4], input[6], input[8], input[10], input[12], input[14], input[16], input[18], input[20], input[22], input[24], input[26], input[28], input[30], ]); let mut odds1 = self.bf8.perform_parallel_fft_direct([ input[1], input[5], input[9], input[13], input[17], input[21], input[25], input[29], ]); let mut odds3 = self.bf8.perform_parallel_fft_direct([ input[31], input[3], input[7], input[11], input[15], input[19], input[23], input[27], ]); // step 3: apply twiddle factors odds1[1] = mul_complex_f32(odds1[1], self.twiddle1); odds3[1] = mul_complex_f32(odds3[1], self.twiddle1c); odds1[2] = mul_complex_f32(odds1[2], self.twiddle2); odds3[2] = mul_complex_f32(odds3[2], self.twiddle2c); odds1[3] = mul_complex_f32(odds1[3], self.twiddle3); odds3[3] = mul_complex_f32(odds3[3], self.twiddle3c); odds1[4] = mul_complex_f32(odds1[4], self.twiddle4); odds3[4] = mul_complex_f32(odds3[4], self.twiddle4c); odds1[5] = mul_complex_f32(odds1[5], self.twiddle5); odds3[5] = mul_complex_f32(odds3[5], self.twiddle5c); odds1[6] = mul_complex_f32(odds1[6], self.twiddle6); odds3[6] = mul_complex_f32(odds3[6], self.twiddle6c); odds1[7] = mul_complex_f32(odds1[7], self.twiddle7); odds3[7] = mul_complex_f32(odds3[7], self.twiddle7c); // step 4: cross FFTs let mut temp0 = parallel_fft2_interleaved_f32(odds1[0], odds3[0]); let mut temp1 = parallel_fft2_interleaved_f32(odds1[1], odds3[1]); let mut temp2 = parallel_fft2_interleaved_f32(odds1[2], odds3[2]); let mut temp3 = parallel_fft2_interleaved_f32(odds1[3], odds3[3]); let mut temp4 = parallel_fft2_interleaved_f32(odds1[4], odds3[4]); let mut temp5 = parallel_fft2_interleaved_f32(odds1[5], odds3[5]); let mut temp6 = parallel_fft2_interleaved_f32(odds1[6], odds3[6]); let mut temp7 = parallel_fft2_interleaved_f32(odds1[7], odds3[7]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate_both(temp0[1]); temp1[1] = self.rotate90.rotate_both(temp1[1]); temp2[1] = self.rotate90.rotate_both(temp2[1]); temp3[1] = self.rotate90.rotate_both(temp3[1]); temp4[1] = self.rotate90.rotate_both(temp4[1]); temp5[1] = self.rotate90.rotate_both(temp5[1]); temp6[1] = self.rotate90.rotate_both(temp6[1]); temp7[1] = self.rotate90.rotate_both(temp7[1]); //step 5: copy/add/subtract data back to buffer [ _mm_add_ps(evens[0], temp0[0]), _mm_add_ps(evens[1], temp1[0]), _mm_add_ps(evens[2], temp2[0]), _mm_add_ps(evens[3], temp3[0]), _mm_add_ps(evens[4], temp4[0]), _mm_add_ps(evens[5], temp5[0]), _mm_add_ps(evens[6], temp6[0]), _mm_add_ps(evens[7], temp7[0]), _mm_add_ps(evens[8], temp0[1]), _mm_add_ps(evens[9], temp1[1]), _mm_add_ps(evens[10], temp2[1]), _mm_add_ps(evens[11], temp3[1]), _mm_add_ps(evens[12], temp4[1]), _mm_add_ps(evens[13], temp5[1]), _mm_add_ps(evens[14], temp6[1]), _mm_add_ps(evens[15], temp7[1]), _mm_sub_ps(evens[0], temp0[0]), _mm_sub_ps(evens[1], temp1[0]), _mm_sub_ps(evens[2], temp2[0]), _mm_sub_ps(evens[3], temp3[0]), _mm_sub_ps(evens[4], temp4[0]), _mm_sub_ps(evens[5], temp5[0]), _mm_sub_ps(evens[6], temp6[0]), _mm_sub_ps(evens[7], temp7[0]), _mm_sub_ps(evens[8], temp0[1]), _mm_sub_ps(evens[9], temp1[1]), _mm_sub_ps(evens[10], temp2[1]), _mm_sub_ps(evens[11], temp3[1]), _mm_sub_ps(evens[12], temp4[1]), _mm_sub_ps(evens[13], temp5[1]), _mm_sub_ps(evens[14], temp6[1]), _mm_sub_ps(evens[15], temp7[1]), ] } } // _________ __ _ _ _ _ _ // |___ /___ \ / /_ | || | | |__ (_) |_ // |_ \ __) | _____ | '_ \| || |_| '_ \| | __| // ___) / __/ |_____| | (_) |__ _| |_) | | |_ // |____/_____| \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly32 { direction: FftDirection, bf8: SseF64Butterfly8, bf16: SseF64Butterfly16, rotate90: Rotate90F64, twiddle1: __m128d, twiddle2: __m128d, twiddle3: __m128d, twiddle4: __m128d, twiddle5: __m128d, twiddle6: __m128d, twiddle7: __m128d, twiddle1c: __m128d, twiddle2c: __m128d, twiddle3c: __m128d, twiddle4c: __m128d, twiddle5c: __m128d, twiddle6c: __m128d, twiddle7c: __m128d, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly32, 32, |this: &SseF64Butterfly32<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly32, 32, |this: &SseF64Butterfly32<_>| this .direction); impl SseF64Butterfly32 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf8 = SseF64Butterfly8::new(direction); let bf16 = SseF64Butterfly16::new(direction); let rotate90 = if direction == FftDirection::Inverse { Rotate90F64::new(true) } else { Rotate90F64::new(false) }; let twiddle1 = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(1, 32, direction).re as *const f64) }; let twiddle2 = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(2, 32, direction).re as *const f64) }; let twiddle3 = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(3, 32, direction).re as *const f64) }; let twiddle4 = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(4, 32, direction).re as *const f64) }; let twiddle5 = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(5, 32, direction).re as *const f64) }; let twiddle6 = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(6, 32, direction).re as *const f64) }; let twiddle7 = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(7, 32, direction).re as *const f64) }; let twiddle1c = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(1, 32, direction).conj().re as *const f64) }; let twiddle2c = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(2, 32, direction).conj().re as *const f64) }; let twiddle3c = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(3, 32, direction).conj().re as *const f64) }; let twiddle4c = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(4, 32, direction).conj().re as *const f64) }; let twiddle5c = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(5, 32, direction).conj().re as *const f64) }; let twiddle6c = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(6, 32, direction).conj().re as *const f64) }; let twiddle7c = unsafe { _mm_loadu_pd(&twiddles::compute_twiddle(7, 32, direction).conj().re as *const f64) }; Self { direction, bf8, bf16, rotate90, twiddle1, twiddle2, twiddle3, twiddle4, twiddle5, twiddle6, twiddle7, twiddle1c, twiddle2c, twiddle3c, twiddle4c, twiddle5c, twiddle6c, twiddle7c, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}); } #[inline(always)] unsafe fn perform_fft_direct(&self, input: [__m128d; 32]) -> [__m128d; 32] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let evens = self.bf16.perform_fft_direct([ input[0], input[2], input[4], input[6], input[8], input[10], input[12], input[14], input[16], input[18], input[20], input[22], input[24], input[26], input[28], input[30], ]); let mut odds1 = self.bf8.perform_fft_direct([ input[1], input[5], input[9], input[13], input[17], input[21], input[25], input[29], ]); let mut odds3 = self.bf8.perform_fft_direct([ input[31], input[3], input[7], input[11], input[15], input[19], input[23], input[27], ]); // step 3: apply twiddle factors odds1[1] = mul_complex_f64(odds1[1], self.twiddle1); odds3[1] = mul_complex_f64(odds3[1], self.twiddle1c); odds1[2] = mul_complex_f64(odds1[2], self.twiddle2); odds3[2] = mul_complex_f64(odds3[2], self.twiddle2c); odds1[3] = mul_complex_f64(odds1[3], self.twiddle3); odds3[3] = mul_complex_f64(odds3[3], self.twiddle3c); odds1[4] = mul_complex_f64(odds1[4], self.twiddle4); odds3[4] = mul_complex_f64(odds3[4], self.twiddle4c); odds1[5] = mul_complex_f64(odds1[5], self.twiddle5); odds3[5] = mul_complex_f64(odds3[5], self.twiddle5c); odds1[6] = mul_complex_f64(odds1[6], self.twiddle6); odds3[6] = mul_complex_f64(odds3[6], self.twiddle6c); odds1[7] = mul_complex_f64(odds1[7], self.twiddle7); odds3[7] = mul_complex_f64(odds3[7], self.twiddle7c); // step 4: cross FFTs let mut temp0 = solo_fft2_f64(odds1[0], odds3[0]); let mut temp1 = solo_fft2_f64(odds1[1], odds3[1]); let mut temp2 = solo_fft2_f64(odds1[2], odds3[2]); let mut temp3 = solo_fft2_f64(odds1[3], odds3[3]); let mut temp4 = solo_fft2_f64(odds1[4], odds3[4]); let mut temp5 = solo_fft2_f64(odds1[5], odds3[5]); let mut temp6 = solo_fft2_f64(odds1[6], odds3[6]); let mut temp7 = solo_fft2_f64(odds1[7], odds3[7]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate(temp0[1]); temp1[1] = self.rotate90.rotate(temp1[1]); temp2[1] = self.rotate90.rotate(temp2[1]); temp3[1] = self.rotate90.rotate(temp3[1]); temp4[1] = self.rotate90.rotate(temp4[1]); temp5[1] = self.rotate90.rotate(temp5[1]); temp6[1] = self.rotate90.rotate(temp6[1]); temp7[1] = self.rotate90.rotate(temp7[1]); //step 5: copy/add/subtract data back to buffer [ _mm_add_pd(evens[0], temp0[0]), _mm_add_pd(evens[1], temp1[0]), _mm_add_pd(evens[2], temp2[0]), _mm_add_pd(evens[3], temp3[0]), _mm_add_pd(evens[4], temp4[0]), _mm_add_pd(evens[5], temp5[0]), _mm_add_pd(evens[6], temp6[0]), _mm_add_pd(evens[7], temp7[0]), _mm_add_pd(evens[8], temp0[1]), _mm_add_pd(evens[9], temp1[1]), _mm_add_pd(evens[10], temp2[1]), _mm_add_pd(evens[11], temp3[1]), _mm_add_pd(evens[12], temp4[1]), _mm_add_pd(evens[13], temp5[1]), _mm_add_pd(evens[14], temp6[1]), _mm_add_pd(evens[15], temp7[1]), _mm_sub_pd(evens[0], temp0[0]), _mm_sub_pd(evens[1], temp1[0]), _mm_sub_pd(evens[2], temp2[0]), _mm_sub_pd(evens[3], temp3[0]), _mm_sub_pd(evens[4], temp4[0]), _mm_sub_pd(evens[5], temp5[0]), _mm_sub_pd(evens[6], temp6[0]), _mm_sub_pd(evens[7], temp7[0]), _mm_sub_pd(evens[8], temp0[1]), _mm_sub_pd(evens[9], temp1[1]), _mm_sub_pd(evens[10], temp2[1]), _mm_sub_pd(evens[11], temp3[1]), _mm_sub_pd(evens[12], temp4[1]), _mm_sub_pd(evens[13], temp5[1]), _mm_sub_pd(evens[14], temp6[1]), _mm_sub_pd(evens[15], temp7[1]), ] } } #[cfg(test)] mod unit_tests { use super::*; use crate::algorithm::Dft; use crate::test_utils::{check_fft_algorithm, compare_vectors}; //the tests for all butterflies will be identical except for the identifiers used and size //so it's ideal for a macro macro_rules! test_butterfly_32_func { ($test_name:ident, $struct_name:ident, $size:expr) => { #[test] fn $test_name() { let butterfly = $struct_name::new(FftDirection::Forward); check_fft_algorithm::(&butterfly, $size, FftDirection::Forward); let butterfly_direction = $struct_name::new(FftDirection::Inverse); check_fft_algorithm::(&butterfly_direction, $size, FftDirection::Inverse); } }; } test_butterfly_32_func!(test_ssef32_butterfly1, SseF32Butterfly1, 1); test_butterfly_32_func!(test_ssef32_butterfly2, SseF32Butterfly2, 2); test_butterfly_32_func!(test_ssef32_butterfly3, SseF32Butterfly3, 3); test_butterfly_32_func!(test_ssef32_butterfly4, SseF32Butterfly4, 4); test_butterfly_32_func!(test_ssef32_butterfly5, SseF32Butterfly5, 5); test_butterfly_32_func!(test_ssef32_butterfly6, SseF32Butterfly6, 6); test_butterfly_32_func!(test_ssef32_butterfly8, SseF32Butterfly8, 8); test_butterfly_32_func!(test_ssef32_butterfly9, SseF32Butterfly9, 9); test_butterfly_32_func!(test_ssef32_butterfly10, SseF32Butterfly10, 10); test_butterfly_32_func!(test_ssef32_butterfly12, SseF32Butterfly12, 12); test_butterfly_32_func!(test_ssef32_butterfly15, SseF32Butterfly15, 15); test_butterfly_32_func!(test_ssef32_butterfly16, SseF32Butterfly16, 16); test_butterfly_32_func!(test_ssef32_butterfly32, SseF32Butterfly32, 32); //the tests for all butterflies will be identical except for the identifiers used and size //so it's ideal for a macro macro_rules! test_butterfly_64_func { ($test_name:ident, $struct_name:ident, $size:expr) => { #[test] fn $test_name() { let butterfly = $struct_name::new(FftDirection::Forward); check_fft_algorithm::(&butterfly, $size, FftDirection::Forward); let butterfly_direction = $struct_name::new(FftDirection::Inverse); check_fft_algorithm::(&butterfly_direction, $size, FftDirection::Inverse); } }; } test_butterfly_64_func!(test_ssef64_butterfly1, SseF64Butterfly1, 1); test_butterfly_64_func!(test_ssef64_butterfly2, SseF64Butterfly2, 2); test_butterfly_64_func!(test_ssef64_butterfly3, SseF64Butterfly3, 3); test_butterfly_64_func!(test_ssef64_butterfly4, SseF64Butterfly4, 4); test_butterfly_64_func!(test_ssef64_butterfly5, SseF64Butterfly5, 5); test_butterfly_64_func!(test_ssef64_butterfly6, SseF64Butterfly6, 6); test_butterfly_64_func!(test_ssef64_butterfly8, SseF64Butterfly8, 8); test_butterfly_64_func!(test_ssef64_butterfly9, SseF64Butterfly9, 9); test_butterfly_64_func!(test_ssef64_butterfly10, SseF64Butterfly10, 10); test_butterfly_64_func!(test_ssef64_butterfly12, SseF64Butterfly12, 12); test_butterfly_64_func!(test_ssef64_butterfly15, SseF64Butterfly15, 15); test_butterfly_64_func!(test_ssef64_butterfly16, SseF64Butterfly16, 16); test_butterfly_64_func!(test_ssef64_butterfly32, SseF64Butterfly32, 32); #[test] fn test_mul_complex_f64() { unsafe { let right = _mm_set_pd(1.0, 2.0); let left = _mm_set_pd(5.0, 7.0); let res = mul_complex_f64(left, right); let expected = _mm_set_pd(2.0 * 5.0 + 1.0 * 7.0, 2.0 * 7.0 - 1.0 * 5.0); assert_eq!( std::mem::transmute::<__m128d, Complex>(res), std::mem::transmute::<__m128d, Complex>(expected) ); } } #[test] fn test_mul_complex_f32() { unsafe { let val1 = Complex::::new(1.0, 2.5); let val2 = Complex::::new(3.2, 4.2); let val3 = Complex::::new(5.6, 6.2); let val4 = Complex::::new(7.4, 8.3); let nbr2 = _mm_set_ps(val4.im, val4.re, val3.im, val3.re); let nbr1 = _mm_set_ps(val2.im, val2.re, val1.im, val1.re); let res = mul_complex_f32(nbr1, nbr2); let res = std::mem::transmute::<__m128, [Complex; 2]>(res); let expected = [val1 * val3, val2 * val4]; assert_eq!(res, expected); } } #[test] fn test_parallel_fft4_32() { unsafe { let val_a1 = Complex::::new(1.0, 2.5); let val_a2 = Complex::::new(3.2, 4.2); let val_a3 = Complex::::new(5.6, 6.2); let val_a4 = Complex::::new(7.4, 8.3); let val_b1 = Complex::::new(6.0, 24.5); let val_b2 = Complex::::new(4.2, 34.2); let val_b3 = Complex::::new(9.6, 61.2); let val_b4 = Complex::::new(17.4, 81.3); let p1 = _mm_set_ps(val_b1.im, val_b1.re, val_a1.im, val_a1.re); let p2 = _mm_set_ps(val_b2.im, val_b2.re, val_a2.im, val_a2.re); let p3 = _mm_set_ps(val_b3.im, val_b3.re, val_a3.im, val_a3.re); let p4 = _mm_set_ps(val_b4.im, val_b4.re, val_a4.im, val_a4.re); let mut val_a = vec![val_a1, val_a2, val_a3, val_a4]; let mut val_b = vec![val_b1, val_b2, val_b3, val_b4]; let dft = Dft::new(4, FftDirection::Forward); let bf4 = SseF32Butterfly4::::new(FftDirection::Forward); dft.process(&mut val_a); dft.process(&mut val_b); let res_both = bf4.perform_parallel_fft_direct(p1, p2, p3, p4); let res = std::mem::transmute::<[__m128; 4], [Complex; 8]>(res_both); let sse_res_a = [res[0], res[2], res[4], res[6]]; let sse_res_b = [res[1], res[3], res[5], res[7]]; assert!(compare_vectors(&val_a, &sse_res_a)); assert!(compare_vectors(&val_b, &sse_res_b)); } } #[test] fn test_pack() { unsafe { let nbr2 = _mm_set_ps(8.0, 7.0, 6.0, 5.0); let nbr1 = _mm_set_ps(4.0, 3.0, 2.0, 1.0); let first = extract_lo_lo_f32(nbr1, nbr2); let second = extract_hi_hi_f32(nbr1, nbr2); let first = std::mem::transmute::<__m128, [Complex; 2]>(first); let second = std::mem::transmute::<__m128, [Complex; 2]>(second); let first_expected = [Complex::new(1.0, 2.0), Complex::new(5.0, 6.0)]; let second_expected = [Complex::new(3.0, 4.0), Complex::new(7.0, 8.0)]; assert_eq!(first, first_expected); assert_eq!(second, second_expected); } } } rustfft-6.2.0/src/sse/sse_common.rs000064400000000000000000000345370072674642500154630ustar 00000000000000use std::any::TypeId; // Calculate the sum of an expression consisting of just plus and minus, like `value = a + b - c + d`. // The expression is rewritten to `value = a + (b - (c - d))` (note the flipped sign on d). // After this the `$add` and `$sub` functions are used to make the calculation. // For f32 using `_mm_add_ps` and `_mm_sub_ps`, the expression `value = a + b - c + d` becomes: // ```let value = _mm_add_ps(a, _mm_sub_ps(b, _mm_sub_ps(c, d)));``` // Only plus and minus are supported, and all the terms must be plain scalar variables. // Using array indices, like `value = temp[0] + temp[1]` is not supported. macro_rules! calc_sum { ($add:ident, $sub:ident, + $acc:tt + $($rest:tt)*)=> { $add($acc, calc_sum!($add, $sub, + $($rest)*)) }; ($add:ident, $sub:ident, + $acc:tt - $($rest:tt)*)=> { $sub($acc, calc_sum!($add, $sub, - $($rest)*)) }; ($add:ident, $sub:ident, - $acc:tt + $($rest:tt)*)=> { $sub($acc, calc_sum!($add, $sub, + $($rest)*)) }; ($add:ident, $sub:ident, - $acc:tt - $($rest:tt)*)=> { $add($acc, calc_sum!($add, $sub, - $($rest)*)) }; ($add:ident, $sub:ident, $acc:tt + $($rest:tt)*)=> { $add($acc, calc_sum!($add, $sub, + $($rest)*)) }; ($add:ident, $sub:ident, $acc:tt - $($rest:tt)*)=> { $sub($acc, calc_sum!($add, $sub, - $($rest)*)) }; ($add:ident, $sub:ident, + $val:tt) => {$val}; ($add:ident, $sub:ident, - $val:tt) => {$val}; } // Calculate the sum of an expression consisting of just plus and minus, like a + b - c + d macro_rules! calc_f32 { ($($tokens:tt)*) => { calc_sum!(_mm_add_ps, _mm_sub_ps, $($tokens)*)}; } // Calculate the sum of an expression consisting of just plus and minus, like a + b - c + d macro_rules! calc_f64 { ($($tokens:tt)*) => { calc_sum!(_mm_add_pd, _mm_sub_pd, $($tokens)*)}; } // Helper function to assert we have the right float type pub fn assert_f32() { let id_f32 = TypeId::of::(); let id_t = TypeId::of::(); assert!(id_t == id_f32, "Wrong float type, must be f32"); } // Helper function to assert we have the right float type pub fn assert_f64() { let id_f64 = TypeId::of::(); let id_t = TypeId::of::(); assert!(id_t == id_f64, "Wrong float type, must be f64"); } // Shuffle elements to interleave two contiguous sets of f32, from an array of simd vectors to a new array of simd vectors macro_rules! interleave_complex_f32 { ($input:ident, $offset:literal, { $($idx:literal),* }) => { [ $( extract_lo_lo_f32($input[$idx], $input[$idx+$offset]), extract_hi_hi_f32($input[$idx], $input[$idx+$offset]), )* ] } } // Shuffle elements to interleave two contiguous sets of f32, from an array of simd vectors to a new array of simd vectors // This statement: // ``` // let values = separate_interleaved_complex_f32!(input, {0, 2, 4}); // ``` // is equivalent to: // ``` // let values = [ // extract_lo_lo_f32(input[0], input[1]), // extract_lo_lo_f32(input[2], input[3]), // extract_lo_lo_f32(input[4], input[5]), // extract_hi_hi_f32(input[0], input[1]), // extract_hi_hi_f32(input[2], input[3]), // extract_hi_hi_f32(input[4], input[5]), // ]; macro_rules! separate_interleaved_complex_f32 { ($input:ident, { $($idx:literal),* }) => { [ $( extract_lo_lo_f32($input[$idx], $input[$idx+1]), )* $( extract_hi_hi_f32($input[$idx], $input[$idx+1]), )* ] } } macro_rules! boilerplate_fft_sse_oop { ($struct_name:ident, $len_fn:expr) => { impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], _scratch: &mut [Complex], ) { if self.len() == 0 { return; } if input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } let result = unsafe { array_utils::iter_chunks_zipped( input, output, self.len(), |in_chunk, out_chunk| { self.perform_fft_out_of_place(in_chunk, out_chunk, &mut []) }, ) }; if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); } } fn process_with_scratch(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { if self.len() == 0 { return; } let required_scratch = self.get_inplace_scratch_len(); if scratch.len() < required_scratch || buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let scratch = &mut scratch[..required_scratch]; let result = unsafe { array_utils::iter_chunks(buffer, self.len(), |chunk| { self.perform_fft_out_of_place(chunk, scratch, &mut []); chunk.copy_from_slice(scratch); }) }; if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { self.len() } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { 0 } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { $len_fn(self) } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { self.direction } } }; } /* Not used now, but maybe later for the mixed radixes etc macro_rules! boilerplate_sse_fft { ($struct_name:ident, $len_fn:expr, $inplace_scratch_len_fn:expr, $out_of_place_scratch_len_fn:expr) => { impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], scratch: &mut [Complex], ) { if self.len() == 0 { return; } let required_scratch = self.get_outofplace_scratch_len(); if scratch.len() < required_scratch || input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace( self.len(), input.len(), output.len(), self.get_outofplace_scratch_len(), scratch.len(), ); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } let scratch = &mut scratch[..required_scratch]; let result = array_utils::iter_chunks_zipped( input, output, self.len(), |in_chunk, out_chunk| { self.perform_fft_out_of_place(in_chunk, out_chunk, scratch) }, ); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace( self.len(), input.len(), output.len(), self.get_outofplace_scratch_len(), scratch.len(), ); } } fn process_with_scratch(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { if self.len() == 0 { return; } let required_scratch = self.get_inplace_scratch_len(); if scratch.len() < required_scratch || buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let scratch = &mut scratch[..required_scratch]; let result = array_utils::iter_chunks(buffer, self.len(), |chunk| { self.perform_fft_inplace(chunk, scratch) }); if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { $inplace_scratch_len_fn(self) } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { $out_of_place_scratch_len_fn(self) } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { $len_fn(self) } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { self.direction } } }; } */ #[cfg(test)] mod unit_tests { use core::arch::x86_64::*; #[test] fn test_calc_f32() { unsafe { let a = _mm_set_ps(1.0, 1.0, 1.0, 1.0); let b = _mm_set_ps(2.0, 2.0, 2.0, 2.0); let c = _mm_set_ps(3.0, 3.0, 3.0, 3.0); let d = _mm_set_ps(4.0, 4.0, 4.0, 4.0); let e = _mm_set_ps(5.0, 5.0, 5.0, 5.0); let f = _mm_set_ps(6.0, 6.0, 6.0, 6.0); let g = _mm_set_ps(7.0, 7.0, 7.0, 7.0); let h = _mm_set_ps(8.0, 8.0, 8.0, 8.0); let i = _mm_set_ps(9.0, 9.0, 9.0, 9.0); let expected: f32 = 1.0 + 2.0 - 3.0 + 4.0 - 5.0 + 6.0 - 7.0 - 8.0 + 9.0; let res = calc_f32!(a + b - c + d - e + f - g - h + i); let sum = std::mem::transmute::<__m128, [f32; 4]>(res); assert_eq!(sum[0], expected); assert_eq!(sum[1], expected); assert_eq!(sum[2], expected); assert_eq!(sum[3], expected); } } #[test] fn test_calc_f64() { unsafe { let a = _mm_set_pd(1.0, 1.0); let b = _mm_set_pd(2.0, 2.0); let c = _mm_set_pd(3.0, 3.0); let d = _mm_set_pd(4.0, 4.0); let e = _mm_set_pd(5.0, 5.0); let f = _mm_set_pd(6.0, 6.0); let g = _mm_set_pd(7.0, 7.0); let h = _mm_set_pd(8.0, 8.0); let i = _mm_set_pd(9.0, 9.0); let expected: f64 = 1.0 + 2.0 - 3.0 + 4.0 - 5.0 + 6.0 - 7.0 - 8.0 + 9.0; let res = calc_f64!(a + b - c + d - e + f - g - h + i); let sum = std::mem::transmute::<__m128d, [f64; 2]>(res); assert_eq!(sum[0], expected); assert_eq!(sum[1], expected); } } } rustfft-6.2.0/src/sse/sse_planner.rs000064400000000000000000001054050072674642500156230ustar 00000000000000use num_integer::gcd; use std::any::TypeId; use std::collections::HashMap; use std::sync::Arc; use crate::{common::FftNum, fft_cache::FftCache, FftDirection}; use crate::algorithm::*; use crate::sse::sse_butterflies::*; use crate::sse::sse_prime_butterflies::*; use crate::sse::sse_radix4::*; use crate::Fft; use crate::math_utils::{PrimeFactor, PrimeFactors}; const MIN_RADIX4_BITS: u32 = 6; // smallest size to consider radix 4 an option is 2^6 = 64 const MAX_RADER_PRIME_FACTOR: usize = 23; // don't use Raders if the inner fft length has prime factor larger than this const MIN_BLUESTEIN_MIXED_RADIX_LEN: usize = 90; // only use mixed radix for the inner fft of Bluestein if length is larger than this /// A Recipe is a structure that describes the design of a FFT, without actually creating it. /// It is used as a middle step in the planning process. #[derive(Debug, PartialEq, Clone)] pub enum Recipe { Dft(usize), MixedRadix { left_fft: Arc, right_fft: Arc, }, #[allow(dead_code)] GoodThomasAlgorithm { left_fft: Arc, right_fft: Arc, }, MixedRadixSmall { left_fft: Arc, right_fft: Arc, }, GoodThomasAlgorithmSmall { left_fft: Arc, right_fft: Arc, }, RadersAlgorithm { inner_fft: Arc, }, BluesteinsAlgorithm { len: usize, inner_fft: Arc, }, Radix4(usize), Butterfly1, Butterfly2, Butterfly3, Butterfly4, Butterfly5, Butterfly6, Butterfly7, Butterfly8, Butterfly9, Butterfly10, Butterfly11, Butterfly12, Butterfly13, Butterfly15, Butterfly16, Butterfly17, Butterfly19, Butterfly23, Butterfly29, Butterfly31, Butterfly32, } impl Recipe { pub fn len(&self) -> usize { match self { Recipe::Dft(length) => *length, Recipe::Radix4(length) => *length, Recipe::Butterfly1 => 1, Recipe::Butterfly2 => 2, Recipe::Butterfly3 => 3, Recipe::Butterfly4 => 4, Recipe::Butterfly5 => 5, Recipe::Butterfly6 => 6, Recipe::Butterfly7 => 7, Recipe::Butterfly8 => 8, Recipe::Butterfly9 => 9, Recipe::Butterfly10 => 10, Recipe::Butterfly11 => 11, Recipe::Butterfly12 => 12, Recipe::Butterfly13 => 13, Recipe::Butterfly15 => 15, Recipe::Butterfly16 => 16, Recipe::Butterfly17 => 17, Recipe::Butterfly19 => 19, Recipe::Butterfly23 => 23, Recipe::Butterfly29 => 29, Recipe::Butterfly31 => 31, Recipe::Butterfly32 => 32, Recipe::MixedRadix { left_fft, right_fft, } => left_fft.len() * right_fft.len(), Recipe::GoodThomasAlgorithm { left_fft, right_fft, } => left_fft.len() * right_fft.len(), Recipe::MixedRadixSmall { left_fft, right_fft, } => left_fft.len() * right_fft.len(), Recipe::GoodThomasAlgorithmSmall { left_fft, right_fft, } => left_fft.len() * right_fft.len(), Recipe::RadersAlgorithm { inner_fft } => inner_fft.len() + 1, Recipe::BluesteinsAlgorithm { len, .. } => *len, } } } /// The SSE FFT planner creates new FFT algorithm instances using a mix of scalar and SSE accelerated algorithms. /// It requires at least SSE4.1, which is available on all reasonably recent x86_64 cpus. /// /// RustFFT has several FFT algorithms available. For a given FFT size, the `FftPlannerSse` decides which of the /// available FFT algorithms to use and then initializes them. /// /// ~~~ /// // Perform a forward Fft of size 1234 /// use std::sync::Arc; /// use rustfft::{FftPlannerSse, num_complex::Complex}; /// /// if let Ok(mut planner) = FftPlannerSse::new() { /// let fft = planner.plan_fft_forward(1234); /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 1234]; /// fft.process(&mut buffer); /// /// // The FFT instance returned by the planner has the type `Arc>`, /// // where T is the numeric type, ie f32 or f64, so it's cheap to clone /// let fft_clone = Arc::clone(&fft); /// } /// ~~~ /// /// If you plan on creating multiple FFT instances, it is recommended to re-use the same planner for all of them. This /// is because the planner re-uses internal data across FFT instances wherever possible, saving memory and reducing /// setup time. (FFT instances created with one planner will never re-use data and buffers with FFT instances created /// by a different planner) /// /// Each FFT instance owns [`Arc`s](std::sync::Arc) to its internal data, rather than borrowing it from the planner, so it's perfectly /// safe to drop the planner after creating Fft instances. pub struct FftPlannerSse { algorithm_cache: FftCache, recipe_cache: HashMap>, } impl FftPlannerSse { /// Creates a new `FftPlannerSse` instance. /// /// Returns `Ok(planner_instance)` if we're compiling for X86_64, SSE support was enabled in feature flags, and the current CPU supports the `sse4.1` CPU feature. /// Returns `Err(())` if SSE support is not available. pub fn new() -> Result { if is_x86_feature_detected!("sse4.1") { // Ideally, we would implement the planner with specialization. // Specialization won't be on stable rust for a long time though, so in the meantime, we can hack around it. // // We use TypeID to determine if T is f32, f64, or neither. If neither, we don't want to do any SSE acceleration // If it's f32 or f64, then construct and return a SSE planner instance. // // All SSE accelerated algorithms come in separate versions for f32 and f64. The type is checked when a new one is created, and if it does not // match the type the FFT is meant for, it will panic. This will never be a problem if using a planner to construct the FFTs. // // An annoying snag with this setup is that we frequently have to transmute buffers from &mut [Complex] to &mut [Complex] or vice versa. // We know this is safe because we assert everywhere that Type(f32 or f64)==Type(T), so it's just a matter of "doing it right" every time. // These transmutes are required because the FFT algorithm's input will come through the FFT trait, which may only be bounded by FftNum. // So the buffers will have the type &mut [Complex]. let id_f32 = TypeId::of::(); let id_f64 = TypeId::of::(); let id_t = TypeId::of::(); if id_t == id_f32 || id_t == id_f64 { return Ok(Self { algorithm_cache: FftCache::new(), recipe_cache: HashMap::new(), }); } } Err(()) } /// Returns a `Fft` instance which uses SSE4.1 instructions to compute FFTs of size `len`. /// /// If the provided `direction` is `FftDirection::Forward`, the returned instance will compute forward FFTs. If it's `FftDirection::Inverse`, it will compute inverse FFTs. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft(&mut self, len: usize, direction: FftDirection) -> Arc> { // Step 1: Create a "recipe" for this FFT, which will tell us exactly which combination of algorithms to use let recipe = self.design_fft_for_len(len); // Step 2: Use our recipe to construct a Fft trait object self.build_fft(&recipe, direction) } /// Returns a `Fft` instance which uses SSE4.1 instructions to compute forward FFTs of size `len` /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_forward(&mut self, len: usize) -> Arc> { self.plan_fft(len, FftDirection::Forward) } /// Returns a `Fft` instance which uses SSE4.1 instructions to compute inverse FFTs of size `len. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_inverse(&mut self, len: usize) -> Arc> { self.plan_fft(len, FftDirection::Inverse) } // Make a recipe for a length fn design_fft_for_len(&mut self, len: usize) -> Arc { if len < 1 { Arc::new(Recipe::Dft(len)) } else if let Some(recipe) = self.recipe_cache.get(&len) { Arc::clone(&recipe) } else { let factors = PrimeFactors::compute(len); let recipe = self.design_fft_with_factors(len, factors); self.recipe_cache.insert(len, Arc::clone(&recipe)); recipe } } // Create the fft from a recipe, take from cache if possible fn build_fft(&mut self, recipe: &Recipe, direction: FftDirection) -> Arc> { let len = recipe.len(); if let Some(instance) = self.algorithm_cache.get(len, direction) { instance } else { let fft = self.build_new_fft(recipe, direction); self.algorithm_cache.insert(&fft); fft } } // Create a new fft from a recipe fn build_new_fft(&mut self, recipe: &Recipe, direction: FftDirection) -> Arc> { let id_f32 = TypeId::of::(); let id_f64 = TypeId::of::(); let id_t = TypeId::of::(); match recipe { Recipe::Dft(len) => Arc::new(Dft::new(*len, direction)) as Arc>, Recipe::Radix4(len) => { if id_t == id_f32 { Arc::new(Sse32Radix4::new(*len, direction)) as Arc> } else if id_t == id_f64 { Arc::new(Sse64Radix4::new(*len, direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly1 => { if id_t == id_f32 { Arc::new(SseF32Butterfly1::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly1::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly2 => { if id_t == id_f32 { Arc::new(SseF32Butterfly2::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly2::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly3 => { if id_t == id_f32 { Arc::new(SseF32Butterfly3::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly3::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly4 => { if id_t == id_f32 { Arc::new(SseF32Butterfly4::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly4::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly5 => { if id_t == id_f32 { Arc::new(SseF32Butterfly5::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly5::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly6 => { if id_t == id_f32 { Arc::new(SseF32Butterfly6::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly6::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly7 => { if id_t == id_f32 { Arc::new(SseF32Butterfly7::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly7::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly8 => { if id_t == id_f32 { Arc::new(SseF32Butterfly8::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly8::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly9 => { if id_t == id_f32 { Arc::new(SseF32Butterfly9::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly9::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly10 => { if id_t == id_f32 { Arc::new(SseF32Butterfly10::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly10::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly11 => { if id_t == id_f32 { Arc::new(SseF32Butterfly11::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly11::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly12 => { if id_t == id_f32 { Arc::new(SseF32Butterfly12::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly12::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly13 => { if id_t == id_f32 { Arc::new(SseF32Butterfly13::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly13::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly15 => { if id_t == id_f32 { Arc::new(SseF32Butterfly15::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly15::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly16 => { if id_t == id_f32 { Arc::new(SseF32Butterfly16::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly16::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly17 => { if id_t == id_f32 { Arc::new(SseF32Butterfly17::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly17::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly19 => { if id_t == id_f32 { Arc::new(SseF32Butterfly19::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly19::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly23 => { if id_t == id_f32 { Arc::new(SseF32Butterfly23::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly23::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly29 => { if id_t == id_f32 { Arc::new(SseF32Butterfly29::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly29::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly31 => { if id_t == id_f32 { Arc::new(SseF32Butterfly31::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly31::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly32 => { if id_t == id_f32 { Arc::new(SseF32Butterfly32::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(SseF64Butterfly32::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::MixedRadix { left_fft, right_fft, } => { let left_fft = self.build_fft(&left_fft, direction); let right_fft = self.build_fft(&right_fft, direction); Arc::new(MixedRadix::new(left_fft, right_fft)) as Arc> } Recipe::GoodThomasAlgorithm { left_fft, right_fft, } => { let left_fft = self.build_fft(&left_fft, direction); let right_fft = self.build_fft(&right_fft, direction); Arc::new(GoodThomasAlgorithm::new(left_fft, right_fft)) as Arc> } Recipe::MixedRadixSmall { left_fft, right_fft, } => { let left_fft = self.build_fft(&left_fft, direction); let right_fft = self.build_fft(&right_fft, direction); Arc::new(MixedRadixSmall::new(left_fft, right_fft)) as Arc> } Recipe::GoodThomasAlgorithmSmall { left_fft, right_fft, } => { let left_fft = self.build_fft(&left_fft, direction); let right_fft = self.build_fft(&right_fft, direction); Arc::new(GoodThomasAlgorithmSmall::new(left_fft, right_fft)) as Arc> } Recipe::RadersAlgorithm { inner_fft } => { let inner_fft = self.build_fft(&inner_fft, direction); Arc::new(RadersAlgorithm::new(inner_fft)) as Arc> } Recipe::BluesteinsAlgorithm { len, inner_fft } => { let inner_fft = self.build_fft(&inner_fft, direction); Arc::new(BluesteinsAlgorithm::new(*len, inner_fft)) as Arc> } } } fn design_fft_with_factors(&mut self, len: usize, factors: PrimeFactors) -> Arc { if let Some(fft_instance) = self.design_butterfly_algorithm(len) { fft_instance } else if factors.is_prime() { self.design_prime(len) } else if len.trailing_zeros() >= MIN_RADIX4_BITS { if len.is_power_of_two() { Arc::new(Recipe::Radix4(len)) } else { let non_power_of_two = factors .remove_factors(PrimeFactor { value: 2, count: len.trailing_zeros(), }) .unwrap(); let power_of_two = PrimeFactors::compute(1 << len.trailing_zeros()); self.design_mixed_radix(power_of_two, non_power_of_two) } } else { // Can we do this as a mixed radix with just two butterflies? // Loop through and find all combinations // If more than one is found, keep the one where the factors are closer together. // For example length 20 where 10x2 and 5x4 are possible, we use 5x4. let butterflies: [usize; 20] = [ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 19, 23, 29, 31, 32, ]; let mut bf_left = 0; let mut bf_right = 0; // If the length is below 14, or over 1024 we don't need to try this. if len > 13 && len <= 1024 { for (n, bf_l) in butterflies.iter().enumerate() { if len % bf_l == 0 { let bf_r = len / bf_l; if butterflies.iter().skip(n).any(|&m| m == bf_r) { bf_left = *bf_l; bf_right = bf_r; } } } if bf_left > 0 { let fact_l = PrimeFactors::compute(bf_left); let fact_r = PrimeFactors::compute(bf_right); return self.design_mixed_radix(fact_l, fact_r); } } // Not possible with just butterflies, go with the general solution. let (left_factors, right_factors) = factors.partition_factors(); self.design_mixed_radix(left_factors, right_factors) } } fn design_mixed_radix( &mut self, left_factors: PrimeFactors, right_factors: PrimeFactors, ) -> Arc { let left_len = left_factors.get_product(); let right_len = right_factors.get_product(); //neither size is a butterfly, so go with the normal algorithm let left_fft = self.design_fft_with_factors(left_len, left_factors); let right_fft = self.design_fft_with_factors(right_len, right_factors); //if both left_len and right_len are small, use algorithms optimized for small FFTs if left_len < 33 && right_len < 33 { // for small FFTs, if gcd is 1, good-thomas is faster if gcd(left_len, right_len) == 1 { Arc::new(Recipe::GoodThomasAlgorithmSmall { left_fft, right_fft, }) } else { Arc::new(Recipe::MixedRadixSmall { left_fft, right_fft, }) } } else { Arc::new(Recipe::MixedRadix { left_fft, right_fft, }) } } // Returns Some(instance) if we have a butterfly available for this size. Returns None if there is no butterfly available for this size fn design_butterfly_algorithm(&mut self, len: usize) -> Option> { match len { 1 => Some(Arc::new(Recipe::Butterfly1)), 2 => Some(Arc::new(Recipe::Butterfly2)), 3 => Some(Arc::new(Recipe::Butterfly3)), 4 => Some(Arc::new(Recipe::Butterfly4)), 5 => Some(Arc::new(Recipe::Butterfly5)), 6 => Some(Arc::new(Recipe::Butterfly6)), 7 => Some(Arc::new(Recipe::Butterfly7)), 8 => Some(Arc::new(Recipe::Butterfly8)), 9 => Some(Arc::new(Recipe::Butterfly9)), 10 => Some(Arc::new(Recipe::Butterfly10)), 11 => Some(Arc::new(Recipe::Butterfly11)), 12 => Some(Arc::new(Recipe::Butterfly12)), 13 => Some(Arc::new(Recipe::Butterfly13)), 15 => Some(Arc::new(Recipe::Butterfly15)), 16 => Some(Arc::new(Recipe::Butterfly16)), 17 => Some(Arc::new(Recipe::Butterfly17)), 19 => Some(Arc::new(Recipe::Butterfly19)), 23 => Some(Arc::new(Recipe::Butterfly23)), 29 => Some(Arc::new(Recipe::Butterfly29)), 31 => Some(Arc::new(Recipe::Butterfly31)), 32 => Some(Arc::new(Recipe::Butterfly32)), _ => None, } } fn design_prime(&mut self, len: usize) -> Arc { let inner_fft_len_rader = len - 1; let raders_factors = PrimeFactors::compute(inner_fft_len_rader); // If any of the prime factors is too large, Rader's gets slow and Bluestein's is the better choice if raders_factors .get_other_factors() .iter() .any(|val| val.value > MAX_RADER_PRIME_FACTOR) { let inner_fft_len_pow2 = (2 * len - 1).checked_next_power_of_two().unwrap(); // for long ffts a mixed radix inner fft is faster than a longer radix4 let min_inner_len = 2 * len - 1; let mixed_radix_len = 3 * inner_fft_len_pow2 / 4; let inner_fft = if mixed_radix_len >= min_inner_len && len >= MIN_BLUESTEIN_MIXED_RADIX_LEN { let mixed_radix_factors = PrimeFactors::compute(mixed_radix_len); self.design_fft_with_factors(mixed_radix_len, mixed_radix_factors) } else { Arc::new(Recipe::Radix4(inner_fft_len_pow2)) }; Arc::new(Recipe::BluesteinsAlgorithm { len, inner_fft }) } else { let inner_fft = self.design_fft_with_factors(inner_fft_len_rader, raders_factors); Arc::new(Recipe::RadersAlgorithm { inner_fft }) } } } #[cfg(test)] mod unit_tests { use super::*; fn is_mixedradix(plan: &Recipe) -> bool { match plan { &Recipe::MixedRadix { .. } => true, _ => false, } } fn is_mixedradixsmall(plan: &Recipe) -> bool { match plan { &Recipe::MixedRadixSmall { .. } => true, _ => false, } } fn is_goodthomassmall(plan: &Recipe) -> bool { match plan { &Recipe::GoodThomasAlgorithmSmall { .. } => true, _ => false, } } fn is_raders(plan: &Recipe) -> bool { match plan { &Recipe::RadersAlgorithm { .. } => true, _ => false, } } fn is_bluesteins(plan: &Recipe) -> bool { match plan { &Recipe::BluesteinsAlgorithm { .. } => true, _ => false, } } #[test] fn test_plan_sse_trivial() { // Length 0 and 1 should use Dft let mut planner = FftPlannerSse::::new().unwrap(); for len in 0..1 { let plan = planner.design_fft_for_len(len); assert_eq!(*plan, Recipe::Dft(len)); assert_eq!(plan.len(), len, "Recipe reports wrong length"); } } #[test] fn test_plan_sse_largepoweroftwo() { // Powers of 2 above 6 should use Radix4 let mut planner = FftPlannerSse::::new().unwrap(); for pow in 6..32 { let len = 1 << pow; let plan = planner.design_fft_for_len(len); assert_eq!(*plan, Recipe::Radix4(len)); assert_eq!(plan.len(), len, "Recipe reports wrong length"); } } #[test] fn test_plan_sse_butterflies() { // Check that all butterflies are used let mut planner = FftPlannerSse::::new().unwrap(); assert_eq!(*planner.design_fft_for_len(2), Recipe::Butterfly2); assert_eq!(*planner.design_fft_for_len(3), Recipe::Butterfly3); assert_eq!(*planner.design_fft_for_len(4), Recipe::Butterfly4); assert_eq!(*planner.design_fft_for_len(5), Recipe::Butterfly5); assert_eq!(*planner.design_fft_for_len(6), Recipe::Butterfly6); assert_eq!(*planner.design_fft_for_len(7), Recipe::Butterfly7); assert_eq!(*planner.design_fft_for_len(8), Recipe::Butterfly8); assert_eq!(*planner.design_fft_for_len(9), Recipe::Butterfly9); assert_eq!(*planner.design_fft_for_len(10), Recipe::Butterfly10); assert_eq!(*planner.design_fft_for_len(11), Recipe::Butterfly11); assert_eq!(*planner.design_fft_for_len(12), Recipe::Butterfly12); assert_eq!(*planner.design_fft_for_len(13), Recipe::Butterfly13); assert_eq!(*planner.design_fft_for_len(15), Recipe::Butterfly15); assert_eq!(*planner.design_fft_for_len(16), Recipe::Butterfly16); assert_eq!(*planner.design_fft_for_len(17), Recipe::Butterfly17); assert_eq!(*planner.design_fft_for_len(19), Recipe::Butterfly19); assert_eq!(*planner.design_fft_for_len(23), Recipe::Butterfly23); assert_eq!(*planner.design_fft_for_len(29), Recipe::Butterfly29); assert_eq!(*planner.design_fft_for_len(31), Recipe::Butterfly31); assert_eq!(*planner.design_fft_for_len(32), Recipe::Butterfly32); } #[test] fn test_plan_sse_mixedradix() { // Products of several different primes should become MixedRadix let mut planner = FftPlannerSse::::new().unwrap(); for pow2 in 2..5 { for pow3 in 2..5 { for pow5 in 2..5 { for pow7 in 2..5 { let len = 2usize.pow(pow2) * 3usize.pow(pow3) * 5usize.pow(pow5) * 7usize.pow(pow7); let plan = planner.design_fft_for_len(len); assert!(is_mixedradix(&plan), "Expected MixedRadix, got {:?}", plan); assert_eq!(plan.len(), len, "Recipe reports wrong length"); } } } } } #[test] fn test_plan_sse_mixedradixsmall() { // Products of two "small" lengths < 31 that have a common divisor >1, and isn't a power of 2 should be MixedRadixSmall let mut planner = FftPlannerSse::::new().unwrap(); for len in [5 * 20, 5 * 25].iter() { let plan = planner.design_fft_for_len(*len); assert!( is_mixedradixsmall(&plan), "Expected MixedRadixSmall, got {:?}", plan ); assert_eq!(plan.len(), *len, "Recipe reports wrong length"); } } #[test] fn test_plan_sse_goodthomasbutterfly() { let mut planner = FftPlannerSse::::new().unwrap(); for len in [3 * 7, 5 * 7, 11 * 13, 2 * 29].iter() { let plan = planner.design_fft_for_len(*len); assert!( is_goodthomassmall(&plan), "Expected GoodThomasAlgorithmSmall, got {:?}", plan ); assert_eq!(plan.len(), *len, "Recipe reports wrong length"); } } #[test] fn test_plan_sse_bluestein_vs_rader() { let difficultprimes: [usize; 11] = [59, 83, 107, 149, 167, 173, 179, 359, 719, 1439, 2879]; let easyprimes: [usize; 24] = [ 53, 61, 67, 71, 73, 79, 89, 97, 101, 103, 109, 113, 127, 131, 137, 139, 151, 157, 163, 181, 191, 193, 197, 199, ]; let mut planner = FftPlannerSse::::new().unwrap(); for len in difficultprimes.iter() { let plan = planner.design_fft_for_len(*len); assert!( is_bluesteins(&plan), "Expected BluesteinsAlgorithm, got {:?}", plan ); assert_eq!(plan.len(), *len, "Recipe reports wrong length"); } for len in easyprimes.iter() { let plan = planner.design_fft_for_len(*len); assert!(is_raders(&plan), "Expected RadersAlgorithm, got {:?}", plan); assert_eq!(plan.len(), *len, "Recipe reports wrong length"); } } #[test] fn test_sse_fft_cache() { { // Check that FFTs are reused if they're both forward let mut planner = FftPlannerSse::::new().unwrap(); let fft_a = planner.plan_fft(1234, FftDirection::Forward); let fft_b = planner.plan_fft(1234, FftDirection::Forward); assert!(Arc::ptr_eq(&fft_a, &fft_b), "Existing fft was not reused"); } { // Check that FFTs are reused if they're both inverse let mut planner = FftPlannerSse::::new().unwrap(); let fft_a = planner.plan_fft(1234, FftDirection::Inverse); let fft_b = planner.plan_fft(1234, FftDirection::Inverse); assert!(Arc::ptr_eq(&fft_a, &fft_b), "Existing fft was not reused"); } { // Check that FFTs are NOT resued if they don't both have the same direction let mut planner = FftPlannerSse::::new().unwrap(); let fft_a = planner.plan_fft(1234, FftDirection::Forward); let fft_b = planner.plan_fft(1234, FftDirection::Inverse); assert!( !Arc::ptr_eq(&fft_a, &fft_b), "Existing fft was reused, even though directions don't match" ); } } #[test] fn test_sse_recipe_cache() { // Check that all butterflies are used let mut planner = FftPlannerSse::::new().unwrap(); let fft_a = planner.design_fft_for_len(1234); let fft_b = planner.design_fft_for_len(1234); assert!( Arc::ptr_eq(&fft_a, &fft_b), "Existing recipe was not reused" ); } } rustfft-6.2.0/src/sse/sse_prime_butterflies.rs000064400000000000000000012317650072674642500177220ustar 00000000000000use core::arch::x86_64::*; use num_complex::Complex; use crate::{common::FftNum, FftDirection}; use crate::array_utils; use crate::array_utils::workaround_transmute_mut; use crate::array_utils::DoubleBuf; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::twiddles; use crate::{Direction, Fft, Length}; use super::sse_common::{assert_f32, assert_f64}; use super::sse_utils::*; use super::sse_vector::{SseArrayMut}; use super::sse_butterflies::{parallel_fft2_interleaved_f32, solo_fft2_f64}; // Auto-generated prime length butterflies // The code here is mostly autogenerated by the python script tools/gen_sse_butterflies.py // // The algorithm is derived directly from the definition of the DFT, by eliminating any repeated calculations. // See the comments in src/algorithm/butterflies.rs for a detailed description. // // The script generates the code for performing a single f64 fft, as well as dual f32 fft. // It also generates the code for reading and writing the input and output. // The single 32-bit ffts reuse the dual ffts. // _____ _________ _ _ _ // |___ | |___ /___ \| |__ (_) |_ // / / _____ |_ \ __) | '_ \| | __| // / / |_____| ___) / __/| |_) | | |_ // /_/ |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly7 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: __m128, twiddle1im: __m128, twiddle2re: __m128, twiddle2im: __m128, twiddle3re: __m128, twiddle3im: __m128, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly7, 7, |this: &SseF32Butterfly7<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly7, 7, |this: &SseF32Butterfly7<_>| this .direction); impl SseF32Butterfly7 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 7, direction); let tw2: Complex = twiddles::compute_twiddle(2, 7, direction); let tw3: Complex = twiddles::compute_twiddle(3, 7, direction); let twiddle1re = unsafe { _mm_set_ps(tw1.re, tw1.re, tw1.re, tw1.re) }; let twiddle1im = unsafe { _mm_set_ps(tw1.im, tw1.im, tw1.im, tw1.im) }; let twiddle2re = unsafe { _mm_set_ps(tw2.re, tw2.re, tw2.re, tw2.re) }; let twiddle2im = unsafe { _mm_set_ps(tw2.im, tw2.im, tw2.im, tw2.im) }; let twiddle3re = unsafe { _mm_set_ps(tw3.re, tw3.re, tw3.re, tw3.re) }; let twiddle3im = unsafe { _mm_set_ps(tw3.im, tw3.im, tw3.im, tw3.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[3]), extract_hi_lo_f32(input_packed[0], input_packed[4]), extract_lo_hi_f32(input_packed[1], input_packed[4]), extract_hi_lo_f32(input_packed[1], input_packed[5]), extract_lo_hi_f32(input_packed[2], input_packed[5]), extract_hi_lo_f32(input_packed[2], input_packed[6]), extract_lo_hi_f32(input_packed[3], input_packed[6]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_hi_f32(out[6], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0,1,2,3,4,5,6}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [__m128; 7]) -> [__m128; 7] { let [x1p6, x1m6] = parallel_fft2_interleaved_f32(values[1], values[6]); let [x2p5, x2m5] = parallel_fft2_interleaved_f32(values[2], values[5]); let [x3p4, x3m4] = parallel_fft2_interleaved_f32(values[3], values[4]); let t_a1_1 = _mm_mul_ps(self.twiddle1re, x1p6); let t_a1_2 = _mm_mul_ps(self.twiddle2re, x2p5); let t_a1_3 = _mm_mul_ps(self.twiddle3re, x3p4); let t_a2_1 = _mm_mul_ps(self.twiddle2re, x1p6); let t_a2_2 = _mm_mul_ps(self.twiddle3re, x2p5); let t_a2_3 = _mm_mul_ps(self.twiddle1re, x3p4); let t_a3_1 = _mm_mul_ps(self.twiddle3re, x1p6); let t_a3_2 = _mm_mul_ps(self.twiddle1re, x2p5); let t_a3_3 = _mm_mul_ps(self.twiddle2re, x3p4); let t_b1_1 = _mm_mul_ps(self.twiddle1im, x1m6); let t_b1_2 = _mm_mul_ps(self.twiddle2im, x2m5); let t_b1_3 = _mm_mul_ps(self.twiddle3im, x3m4); let t_b2_1 = _mm_mul_ps(self.twiddle2im, x1m6); let t_b2_2 = _mm_mul_ps(self.twiddle3im, x2m5); let t_b2_3 = _mm_mul_ps(self.twiddle1im, x3m4); let t_b3_1 = _mm_mul_ps(self.twiddle3im, x1m6); let t_b3_2 = _mm_mul_ps(self.twiddle1im, x2m5); let t_b3_3 = _mm_mul_ps(self.twiddle2im, x3m4); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3); let t_b2 = calc_f32!(t_b2_1 - t_b2_2 - t_b2_3); let t_b3 = calc_f32!(t_b3_1 - t_b3_2 + t_b3_3); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let y0 = calc_f32!(x0 + x1p6 + x2p5 + x3p4); let [y1, y6] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y5] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y4] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); [y0, y1, y2, y3, y4, y5, y6] } } // _____ __ _ _ _ _ _ // |___ | / /_ | || | | |__ (_) |_ // / / _____ | '_ \| || |_| '_ \| | __| // / / |_____| | (_) |__ _| |_) | | |_ // /_/ \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly7 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: __m128d, twiddle1im: __m128d, twiddle2re: __m128d, twiddle2im: __m128d, twiddle3re: __m128d, twiddle3im: __m128d, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly7, 7, |this: &SseF64Butterfly7<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly7, 7, |this: &SseF64Butterfly7<_>| this .direction); impl SseF64Butterfly7 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 7, direction); let tw2: Complex = twiddles::compute_twiddle(2, 7, direction); let tw3: Complex = twiddles::compute_twiddle(3, 7, direction); let twiddle1re = unsafe { _mm_set_pd(tw1.re, tw1.re) }; let twiddle1im = unsafe { _mm_set_pd(tw1.im, tw1.im) }; let twiddle2re = unsafe { _mm_set_pd(tw2.re, tw2.re) }; let twiddle2im = unsafe { _mm_set_pd(tw2.im, tw2.im) }; let twiddle3re = unsafe { _mm_set_pd(tw3.re, tw3.re) }; let twiddle3im = unsafe { _mm_set_pd(tw3.im, tw3.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [__m128d; 7]) -> [__m128d; 7] { let [x1p6, x1m6] = solo_fft2_f64(values[1], values[6]); let [x2p5, x2m5] = solo_fft2_f64(values[2], values[5]); let [x3p4, x3m4] = solo_fft2_f64(values[3], values[4]); let t_a1_1 = _mm_mul_pd(self.twiddle1re, x1p6); let t_a1_2 = _mm_mul_pd(self.twiddle2re, x2p5); let t_a1_3 = _mm_mul_pd(self.twiddle3re, x3p4); let t_a2_1 = _mm_mul_pd(self.twiddle2re, x1p6); let t_a2_2 = _mm_mul_pd(self.twiddle3re, x2p5); let t_a2_3 = _mm_mul_pd(self.twiddle1re, x3p4); let t_a3_1 = _mm_mul_pd(self.twiddle3re, x1p6); let t_a3_2 = _mm_mul_pd(self.twiddle1re, x2p5); let t_a3_3 = _mm_mul_pd(self.twiddle2re, x3p4); let t_b1_1 = _mm_mul_pd(self.twiddle1im, x1m6); let t_b1_2 = _mm_mul_pd(self.twiddle2im, x2m5); let t_b1_3 = _mm_mul_pd(self.twiddle3im, x3m4); let t_b2_1 = _mm_mul_pd(self.twiddle2im, x1m6); let t_b2_2 = _mm_mul_pd(self.twiddle3im, x2m5); let t_b2_3 = _mm_mul_pd(self.twiddle1im, x3m4); let t_b3_1 = _mm_mul_pd(self.twiddle3im, x1m6); let t_b3_2 = _mm_mul_pd(self.twiddle1im, x2m5); let t_b3_3 = _mm_mul_pd(self.twiddle2im, x3m4); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3); let t_b2 = calc_f64!(t_b2_1 - t_b2_2 - t_b2_3); let t_b3 = calc_f64!(t_b3_1 - t_b3_2 + t_b3_3); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let y0 = calc_f64!(x0 + x1p6 + x2p5 + x3p4); let [y1, y6] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y5] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y4] = solo_fft2_f64(t_a3, t_b3_rot); [y0, y1, y2, y3, y4, y5, y6] } } // _ _ _________ _ _ _ // / / | |___ /___ \| |__ (_) |_ // | | | _____ |_ \ __) | '_ \| | __| // | | | |_____| ___) / __/| |_) | | |_ // |_|_| |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly11 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: __m128, twiddle1im: __m128, twiddle2re: __m128, twiddle2im: __m128, twiddle3re: __m128, twiddle3im: __m128, twiddle4re: __m128, twiddle4im: __m128, twiddle5re: __m128, twiddle5im: __m128, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly11, 11, |this: &SseF32Butterfly11<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly11, 11, |this: &SseF32Butterfly11<_>| this .direction); impl SseF32Butterfly11 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 11, direction); let tw2: Complex = twiddles::compute_twiddle(2, 11, direction); let tw3: Complex = twiddles::compute_twiddle(3, 11, direction); let tw4: Complex = twiddles::compute_twiddle(4, 11, direction); let tw5: Complex = twiddles::compute_twiddle(5, 11, direction); let twiddle1re = unsafe { _mm_load1_ps(&tw1.re) }; let twiddle1im = unsafe { _mm_load1_ps(&tw1.im) }; let twiddle2re = unsafe { _mm_load1_ps(&tw2.re) }; let twiddle2im = unsafe { _mm_load1_ps(&tw2.im) }; let twiddle3re = unsafe { _mm_load1_ps(&tw3.re) }; let twiddle3im = unsafe { _mm_load1_ps(&tw3.im) }; let twiddle4re = unsafe { _mm_load1_ps(&tw4.re) }; let twiddle4im = unsafe { _mm_load1_ps(&tw4.im) }; let twiddle5re = unsafe { _mm_load1_ps(&tw5.re) }; let twiddle5im = unsafe { _mm_load1_ps(&tw5.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[5]), extract_hi_lo_f32(input_packed[0], input_packed[6]), extract_lo_hi_f32(input_packed[1], input_packed[6]), extract_hi_lo_f32(input_packed[1], input_packed[7]), extract_lo_hi_f32(input_packed[2], input_packed[7]), extract_hi_lo_f32(input_packed[2], input_packed[8]), extract_lo_hi_f32(input_packed[3], input_packed[8]), extract_hi_lo_f32(input_packed[3], input_packed[9]), extract_lo_hi_f32(input_packed[4], input_packed[9]), extract_hi_lo_f32(input_packed[4], input_packed[10]), extract_lo_hi_f32(input_packed[5], input_packed[10]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_hi_f32(out[10], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [__m128; 11]) -> [__m128; 11] { let [x1p10, x1m10] = parallel_fft2_interleaved_f32(values[1], values[10]); let [x2p9, x2m9] = parallel_fft2_interleaved_f32(values[2], values[9]); let [x3p8, x3m8] = parallel_fft2_interleaved_f32(values[3], values[8]); let [x4p7, x4m7] = parallel_fft2_interleaved_f32(values[4], values[7]); let [x5p6, x5m6] = parallel_fft2_interleaved_f32(values[5], values[6]); let t_a1_1 = _mm_mul_ps(self.twiddle1re, x1p10); let t_a1_2 = _mm_mul_ps(self.twiddle2re, x2p9); let t_a1_3 = _mm_mul_ps(self.twiddle3re, x3p8); let t_a1_4 = _mm_mul_ps(self.twiddle4re, x4p7); let t_a1_5 = _mm_mul_ps(self.twiddle5re, x5p6); let t_a2_1 = _mm_mul_ps(self.twiddle2re, x1p10); let t_a2_2 = _mm_mul_ps(self.twiddle4re, x2p9); let t_a2_3 = _mm_mul_ps(self.twiddle5re, x3p8); let t_a2_4 = _mm_mul_ps(self.twiddle3re, x4p7); let t_a2_5 = _mm_mul_ps(self.twiddle1re, x5p6); let t_a3_1 = _mm_mul_ps(self.twiddle3re, x1p10); let t_a3_2 = _mm_mul_ps(self.twiddle5re, x2p9); let t_a3_3 = _mm_mul_ps(self.twiddle2re, x3p8); let t_a3_4 = _mm_mul_ps(self.twiddle1re, x4p7); let t_a3_5 = _mm_mul_ps(self.twiddle4re, x5p6); let t_a4_1 = _mm_mul_ps(self.twiddle4re, x1p10); let t_a4_2 = _mm_mul_ps(self.twiddle3re, x2p9); let t_a4_3 = _mm_mul_ps(self.twiddle1re, x3p8); let t_a4_4 = _mm_mul_ps(self.twiddle5re, x4p7); let t_a4_5 = _mm_mul_ps(self.twiddle2re, x5p6); let t_a5_1 = _mm_mul_ps(self.twiddle5re, x1p10); let t_a5_2 = _mm_mul_ps(self.twiddle1re, x2p9); let t_a5_3 = _mm_mul_ps(self.twiddle4re, x3p8); let t_a5_4 = _mm_mul_ps(self.twiddle2re, x4p7); let t_a5_5 = _mm_mul_ps(self.twiddle3re, x5p6); let t_b1_1 = _mm_mul_ps(self.twiddle1im, x1m10); let t_b1_2 = _mm_mul_ps(self.twiddle2im, x2m9); let t_b1_3 = _mm_mul_ps(self.twiddle3im, x3m8); let t_b1_4 = _mm_mul_ps(self.twiddle4im, x4m7); let t_b1_5 = _mm_mul_ps(self.twiddle5im, x5m6); let t_b2_1 = _mm_mul_ps(self.twiddle2im, x1m10); let t_b2_2 = _mm_mul_ps(self.twiddle4im, x2m9); let t_b2_3 = _mm_mul_ps(self.twiddle5im, x3m8); let t_b2_4 = _mm_mul_ps(self.twiddle3im, x4m7); let t_b2_5 = _mm_mul_ps(self.twiddle1im, x5m6); let t_b3_1 = _mm_mul_ps(self.twiddle3im, x1m10); let t_b3_2 = _mm_mul_ps(self.twiddle5im, x2m9); let t_b3_3 = _mm_mul_ps(self.twiddle2im, x3m8); let t_b3_4 = _mm_mul_ps(self.twiddle1im, x4m7); let t_b3_5 = _mm_mul_ps(self.twiddle4im, x5m6); let t_b4_1 = _mm_mul_ps(self.twiddle4im, x1m10); let t_b4_2 = _mm_mul_ps(self.twiddle3im, x2m9); let t_b4_3 = _mm_mul_ps(self.twiddle1im, x3m8); let t_b4_4 = _mm_mul_ps(self.twiddle5im, x4m7); let t_b4_5 = _mm_mul_ps(self.twiddle2im, x5m6); let t_b5_1 = _mm_mul_ps(self.twiddle5im, x1m10); let t_b5_2 = _mm_mul_ps(self.twiddle1im, x2m9); let t_b5_3 = _mm_mul_ps(self.twiddle4im, x3m8); let t_b5_4 = _mm_mul_ps(self.twiddle2im, x4m7); let t_b5_5 = _mm_mul_ps(self.twiddle3im, x5m6); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 - t_b2_3 - t_b2_4 - t_b2_5); let t_b3 = calc_f32!(t_b3_1 - t_b3_2 - t_b3_3 + t_b3_4 + t_b3_5); let t_b4 = calc_f32!(t_b4_1 - t_b4_2 + t_b4_3 + t_b4_4 - t_b4_5); let t_b5 = calc_f32!(t_b5_1 - t_b5_2 + t_b5_3 - t_b5_4 + t_b5_5); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let y0 = calc_f32!(x0 + x1p10 + x2p9 + x3p8 + x4p7 + x5p6); let [y1, y10] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y9] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y8] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y7] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y6] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10] } } // _ _ __ _ _ _ _ _ // / / | / /_ | || | | |__ (_) |_ // | | | _____ | '_ \| || |_| '_ \| | __| // | | | |_____| | (_) |__ _| |_) | | |_ // |_|_| \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly11 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: __m128d, twiddle1im: __m128d, twiddle2re: __m128d, twiddle2im: __m128d, twiddle3re: __m128d, twiddle3im: __m128d, twiddle4re: __m128d, twiddle4im: __m128d, twiddle5re: __m128d, twiddle5im: __m128d, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly11, 11, |this: &SseF64Butterfly11<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly11, 11, |this: &SseF64Butterfly11<_>| this .direction); impl SseF64Butterfly11 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 11, direction); let tw2: Complex = twiddles::compute_twiddle(2, 11, direction); let tw3: Complex = twiddles::compute_twiddle(3, 11, direction); let tw4: Complex = twiddles::compute_twiddle(4, 11, direction); let tw5: Complex = twiddles::compute_twiddle(5, 11, direction); let twiddle1re = unsafe { _mm_set_pd(tw1.re, tw1.re) }; let twiddle1im = unsafe { _mm_set_pd(tw1.im, tw1.im) }; let twiddle2re = unsafe { _mm_set_pd(tw2.re, tw2.re) }; let twiddle2im = unsafe { _mm_set_pd(tw2.im, tw2.im) }; let twiddle3re = unsafe { _mm_set_pd(tw3.re, tw3.re) }; let twiddle3im = unsafe { _mm_set_pd(tw3.im, tw3.im) }; let twiddle4re = unsafe { _mm_set_pd(tw4.re, tw4.re) }; let twiddle4im = unsafe { _mm_set_pd(tw4.im, tw4.im) }; let twiddle5re = unsafe { _mm_set_pd(tw5.re, tw5.re) }; let twiddle5im = unsafe { _mm_set_pd(tw5.im, tw5.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [__m128d; 11]) -> [__m128d; 11] { let [x1p10, x1m10] = solo_fft2_f64(values[1], values[10]); let [x2p9, x2m9] = solo_fft2_f64(values[2], values[9]); let [x3p8, x3m8] = solo_fft2_f64(values[3], values[8]); let [x4p7, x4m7] = solo_fft2_f64(values[4], values[7]); let [x5p6, x5m6] = solo_fft2_f64(values[5], values[6]); let t_a1_1 = _mm_mul_pd(self.twiddle1re, x1p10); let t_a1_2 = _mm_mul_pd(self.twiddle2re, x2p9); let t_a1_3 = _mm_mul_pd(self.twiddle3re, x3p8); let t_a1_4 = _mm_mul_pd(self.twiddle4re, x4p7); let t_a1_5 = _mm_mul_pd(self.twiddle5re, x5p6); let t_a2_1 = _mm_mul_pd(self.twiddle2re, x1p10); let t_a2_2 = _mm_mul_pd(self.twiddle4re, x2p9); let t_a2_3 = _mm_mul_pd(self.twiddle5re, x3p8); let t_a2_4 = _mm_mul_pd(self.twiddle3re, x4p7); let t_a2_5 = _mm_mul_pd(self.twiddle1re, x5p6); let t_a3_1 = _mm_mul_pd(self.twiddle3re, x1p10); let t_a3_2 = _mm_mul_pd(self.twiddle5re, x2p9); let t_a3_3 = _mm_mul_pd(self.twiddle2re, x3p8); let t_a3_4 = _mm_mul_pd(self.twiddle1re, x4p7); let t_a3_5 = _mm_mul_pd(self.twiddle4re, x5p6); let t_a4_1 = _mm_mul_pd(self.twiddle4re, x1p10); let t_a4_2 = _mm_mul_pd(self.twiddle3re, x2p9); let t_a4_3 = _mm_mul_pd(self.twiddle1re, x3p8); let t_a4_4 = _mm_mul_pd(self.twiddle5re, x4p7); let t_a4_5 = _mm_mul_pd(self.twiddle2re, x5p6); let t_a5_1 = _mm_mul_pd(self.twiddle5re, x1p10); let t_a5_2 = _mm_mul_pd(self.twiddle1re, x2p9); let t_a5_3 = _mm_mul_pd(self.twiddle4re, x3p8); let t_a5_4 = _mm_mul_pd(self.twiddle2re, x4p7); let t_a5_5 = _mm_mul_pd(self.twiddle3re, x5p6); let t_b1_1 = _mm_mul_pd(self.twiddle1im, x1m10); let t_b1_2 = _mm_mul_pd(self.twiddle2im, x2m9); let t_b1_3 = _mm_mul_pd(self.twiddle3im, x3m8); let t_b1_4 = _mm_mul_pd(self.twiddle4im, x4m7); let t_b1_5 = _mm_mul_pd(self.twiddle5im, x5m6); let t_b2_1 = _mm_mul_pd(self.twiddle2im, x1m10); let t_b2_2 = _mm_mul_pd(self.twiddle4im, x2m9); let t_b2_3 = _mm_mul_pd(self.twiddle5im, x3m8); let t_b2_4 = _mm_mul_pd(self.twiddle3im, x4m7); let t_b2_5 = _mm_mul_pd(self.twiddle1im, x5m6); let t_b3_1 = _mm_mul_pd(self.twiddle3im, x1m10); let t_b3_2 = _mm_mul_pd(self.twiddle5im, x2m9); let t_b3_3 = _mm_mul_pd(self.twiddle2im, x3m8); let t_b3_4 = _mm_mul_pd(self.twiddle1im, x4m7); let t_b3_5 = _mm_mul_pd(self.twiddle4im, x5m6); let t_b4_1 = _mm_mul_pd(self.twiddle4im, x1m10); let t_b4_2 = _mm_mul_pd(self.twiddle3im, x2m9); let t_b4_3 = _mm_mul_pd(self.twiddle1im, x3m8); let t_b4_4 = _mm_mul_pd(self.twiddle5im, x4m7); let t_b4_5 = _mm_mul_pd(self.twiddle2im, x5m6); let t_b5_1 = _mm_mul_pd(self.twiddle5im, x1m10); let t_b5_2 = _mm_mul_pd(self.twiddle1im, x2m9); let t_b5_3 = _mm_mul_pd(self.twiddle4im, x3m8); let t_b5_4 = _mm_mul_pd(self.twiddle2im, x4m7); let t_b5_5 = _mm_mul_pd(self.twiddle3im, x5m6); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 - t_b2_3 - t_b2_4 - t_b2_5); let t_b3 = calc_f64!(t_b3_1 - t_b3_2 - t_b3_3 + t_b3_4 + t_b3_5); let t_b4 = calc_f64!(t_b4_1 - t_b4_2 + t_b4_3 + t_b4_4 - t_b4_5); let t_b5 = calc_f64!(t_b5_1 - t_b5_2 + t_b5_3 - t_b5_4 + t_b5_5); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let y0 = calc_f64!(x0 + x1p10 + x2p9 + x3p8 + x4p7 + x5p6); let [y1, y10] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y9] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y8] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y7] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y6] = solo_fft2_f64(t_a5, t_b5_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10] } } // _ _____ _________ _ _ _ // / |___ / |___ /___ \| |__ (_) |_ // | | |_ \ _____ |_ \ __) | '_ \| | __| // | |___) | |_____| ___) / __/| |_) | | |_ // |_|____/ |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly13 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: __m128, twiddle1im: __m128, twiddle2re: __m128, twiddle2im: __m128, twiddle3re: __m128, twiddle3im: __m128, twiddle4re: __m128, twiddle4im: __m128, twiddle5re: __m128, twiddle5im: __m128, twiddle6re: __m128, twiddle6im: __m128, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly13, 13, |this: &SseF32Butterfly13<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly13, 13, |this: &SseF32Butterfly13<_>| this .direction); impl SseF32Butterfly13 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 13, direction); let tw2: Complex = twiddles::compute_twiddle(2, 13, direction); let tw3: Complex = twiddles::compute_twiddle(3, 13, direction); let tw4: Complex = twiddles::compute_twiddle(4, 13, direction); let tw5: Complex = twiddles::compute_twiddle(5, 13, direction); let tw6: Complex = twiddles::compute_twiddle(6, 13, direction); let twiddle1re = unsafe { _mm_load1_ps(&tw1.re) }; let twiddle1im = unsafe { _mm_load1_ps(&tw1.im) }; let twiddle2re = unsafe { _mm_load1_ps(&tw2.re) }; let twiddle2im = unsafe { _mm_load1_ps(&tw2.im) }; let twiddle3re = unsafe { _mm_load1_ps(&tw3.re) }; let twiddle3im = unsafe { _mm_load1_ps(&tw3.im) }; let twiddle4re = unsafe { _mm_load1_ps(&tw4.re) }; let twiddle4im = unsafe { _mm_load1_ps(&tw4.im) }; let twiddle5re = unsafe { _mm_load1_ps(&tw5.re) }; let twiddle5im = unsafe { _mm_load1_ps(&tw5.im) }; let twiddle6re = unsafe { _mm_load1_ps(&tw6.re) }; let twiddle6im = unsafe { _mm_load1_ps(&tw6.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[6]), extract_hi_lo_f32(input_packed[0], input_packed[7]), extract_lo_hi_f32(input_packed[1], input_packed[7]), extract_hi_lo_f32(input_packed[1], input_packed[8]), extract_lo_hi_f32(input_packed[2], input_packed[8]), extract_hi_lo_f32(input_packed[2], input_packed[9]), extract_lo_hi_f32(input_packed[3], input_packed[9]), extract_hi_lo_f32(input_packed[3], input_packed[10]), extract_lo_hi_f32(input_packed[4], input_packed[10]), extract_hi_lo_f32(input_packed[4], input_packed[11]), extract_lo_hi_f32(input_packed[5], input_packed[11]), extract_hi_lo_f32(input_packed[5], input_packed[12]), extract_lo_hi_f32(input_packed[6], input_packed[12]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_hi_f32(out[12], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [__m128; 13]) -> [__m128; 13] { let [x1p12, x1m12] = parallel_fft2_interleaved_f32(values[1], values[12]); let [x2p11, x2m11] = parallel_fft2_interleaved_f32(values[2], values[11]); let [x3p10, x3m10] = parallel_fft2_interleaved_f32(values[3], values[10]); let [x4p9, x4m9] = parallel_fft2_interleaved_f32(values[4], values[9]); let [x5p8, x5m8] = parallel_fft2_interleaved_f32(values[5], values[8]); let [x6p7, x6m7] = parallel_fft2_interleaved_f32(values[6], values[7]); let t_a1_1 = _mm_mul_ps(self.twiddle1re, x1p12); let t_a1_2 = _mm_mul_ps(self.twiddle2re, x2p11); let t_a1_3 = _mm_mul_ps(self.twiddle3re, x3p10); let t_a1_4 = _mm_mul_ps(self.twiddle4re, x4p9); let t_a1_5 = _mm_mul_ps(self.twiddle5re, x5p8); let t_a1_6 = _mm_mul_ps(self.twiddle6re, x6p7); let t_a2_1 = _mm_mul_ps(self.twiddle2re, x1p12); let t_a2_2 = _mm_mul_ps(self.twiddle4re, x2p11); let t_a2_3 = _mm_mul_ps(self.twiddle6re, x3p10); let t_a2_4 = _mm_mul_ps(self.twiddle5re, x4p9); let t_a2_5 = _mm_mul_ps(self.twiddle3re, x5p8); let t_a2_6 = _mm_mul_ps(self.twiddle1re, x6p7); let t_a3_1 = _mm_mul_ps(self.twiddle3re, x1p12); let t_a3_2 = _mm_mul_ps(self.twiddle6re, x2p11); let t_a3_3 = _mm_mul_ps(self.twiddle4re, x3p10); let t_a3_4 = _mm_mul_ps(self.twiddle1re, x4p9); let t_a3_5 = _mm_mul_ps(self.twiddle2re, x5p8); let t_a3_6 = _mm_mul_ps(self.twiddle5re, x6p7); let t_a4_1 = _mm_mul_ps(self.twiddle4re, x1p12); let t_a4_2 = _mm_mul_ps(self.twiddle5re, x2p11); let t_a4_3 = _mm_mul_ps(self.twiddle1re, x3p10); let t_a4_4 = _mm_mul_ps(self.twiddle3re, x4p9); let t_a4_5 = _mm_mul_ps(self.twiddle6re, x5p8); let t_a4_6 = _mm_mul_ps(self.twiddle2re, x6p7); let t_a5_1 = _mm_mul_ps(self.twiddle5re, x1p12); let t_a5_2 = _mm_mul_ps(self.twiddle3re, x2p11); let t_a5_3 = _mm_mul_ps(self.twiddle2re, x3p10); let t_a5_4 = _mm_mul_ps(self.twiddle6re, x4p9); let t_a5_5 = _mm_mul_ps(self.twiddle1re, x5p8); let t_a5_6 = _mm_mul_ps(self.twiddle4re, x6p7); let t_a6_1 = _mm_mul_ps(self.twiddle6re, x1p12); let t_a6_2 = _mm_mul_ps(self.twiddle1re, x2p11); let t_a6_3 = _mm_mul_ps(self.twiddle5re, x3p10); let t_a6_4 = _mm_mul_ps(self.twiddle2re, x4p9); let t_a6_5 = _mm_mul_ps(self.twiddle4re, x5p8); let t_a6_6 = _mm_mul_ps(self.twiddle3re, x6p7); let t_b1_1 = _mm_mul_ps(self.twiddle1im, x1m12); let t_b1_2 = _mm_mul_ps(self.twiddle2im, x2m11); let t_b1_3 = _mm_mul_ps(self.twiddle3im, x3m10); let t_b1_4 = _mm_mul_ps(self.twiddle4im, x4m9); let t_b1_5 = _mm_mul_ps(self.twiddle5im, x5m8); let t_b1_6 = _mm_mul_ps(self.twiddle6im, x6m7); let t_b2_1 = _mm_mul_ps(self.twiddle2im, x1m12); let t_b2_2 = _mm_mul_ps(self.twiddle4im, x2m11); let t_b2_3 = _mm_mul_ps(self.twiddle6im, x3m10); let t_b2_4 = _mm_mul_ps(self.twiddle5im, x4m9); let t_b2_5 = _mm_mul_ps(self.twiddle3im, x5m8); let t_b2_6 = _mm_mul_ps(self.twiddle1im, x6m7); let t_b3_1 = _mm_mul_ps(self.twiddle3im, x1m12); let t_b3_2 = _mm_mul_ps(self.twiddle6im, x2m11); let t_b3_3 = _mm_mul_ps(self.twiddle4im, x3m10); let t_b3_4 = _mm_mul_ps(self.twiddle1im, x4m9); let t_b3_5 = _mm_mul_ps(self.twiddle2im, x5m8); let t_b3_6 = _mm_mul_ps(self.twiddle5im, x6m7); let t_b4_1 = _mm_mul_ps(self.twiddle4im, x1m12); let t_b4_2 = _mm_mul_ps(self.twiddle5im, x2m11); let t_b4_3 = _mm_mul_ps(self.twiddle1im, x3m10); let t_b4_4 = _mm_mul_ps(self.twiddle3im, x4m9); let t_b4_5 = _mm_mul_ps(self.twiddle6im, x5m8); let t_b4_6 = _mm_mul_ps(self.twiddle2im, x6m7); let t_b5_1 = _mm_mul_ps(self.twiddle5im, x1m12); let t_b5_2 = _mm_mul_ps(self.twiddle3im, x2m11); let t_b5_3 = _mm_mul_ps(self.twiddle2im, x3m10); let t_b5_4 = _mm_mul_ps(self.twiddle6im, x4m9); let t_b5_5 = _mm_mul_ps(self.twiddle1im, x5m8); let t_b5_6 = _mm_mul_ps(self.twiddle4im, x6m7); let t_b6_1 = _mm_mul_ps(self.twiddle6im, x1m12); let t_b6_2 = _mm_mul_ps(self.twiddle1im, x2m11); let t_b6_3 = _mm_mul_ps(self.twiddle5im, x3m10); let t_b6_4 = _mm_mul_ps(self.twiddle2im, x4m9); let t_b6_5 = _mm_mul_ps(self.twiddle4im, x5m8); let t_b6_6 = _mm_mul_ps(self.twiddle3im, x6m7); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6); let t_a6 = calc_f32!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 + t_b2_3 - t_b2_4 - t_b2_5 - t_b2_6); let t_b3 = calc_f32!(t_b3_1 + t_b3_2 - t_b3_3 - t_b3_4 + t_b3_5 + t_b3_6); let t_b4 = calc_f32!(t_b4_1 - t_b4_2 - t_b4_3 + t_b4_4 - t_b4_5 - t_b4_6); let t_b5 = calc_f32!(t_b5_1 - t_b5_2 + t_b5_3 - t_b5_4 - t_b5_5 + t_b5_6); let t_b6 = calc_f32!(t_b6_1 - t_b6_2 + t_b6_3 - t_b6_4 + t_b6_5 - t_b6_6); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let y0 = calc_f32!(x0 + x1p12 + x2p11 + x3p10 + x4p9 + x5p8 + x6p7); let [y1, y12] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y11] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y10] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y9] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y8] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y7] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12] } } // _ _____ __ _ _ _ _ _ // / |___ / / /_ | || | | |__ (_) |_ // | | |_ \ _____ | '_ \| || |_| '_ \| | __| // | |___) | |_____| | (_) |__ _| |_) | | |_ // |_|____/ \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly13 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: __m128d, twiddle1im: __m128d, twiddle2re: __m128d, twiddle2im: __m128d, twiddle3re: __m128d, twiddle3im: __m128d, twiddle4re: __m128d, twiddle4im: __m128d, twiddle5re: __m128d, twiddle5im: __m128d, twiddle6re: __m128d, twiddle6im: __m128d, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly13, 13, |this: &SseF64Butterfly13<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly13, 13, |this: &SseF64Butterfly13<_>| this .direction); impl SseF64Butterfly13 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 13, direction); let tw2: Complex = twiddles::compute_twiddle(2, 13, direction); let tw3: Complex = twiddles::compute_twiddle(3, 13, direction); let tw4: Complex = twiddles::compute_twiddle(4, 13, direction); let tw5: Complex = twiddles::compute_twiddle(5, 13, direction); let tw6: Complex = twiddles::compute_twiddle(6, 13, direction); let twiddle1re = unsafe { _mm_set_pd(tw1.re, tw1.re) }; let twiddle1im = unsafe { _mm_set_pd(tw1.im, tw1.im) }; let twiddle2re = unsafe { _mm_set_pd(tw2.re, tw2.re) }; let twiddle2im = unsafe { _mm_set_pd(tw2.im, tw2.im) }; let twiddle3re = unsafe { _mm_set_pd(tw3.re, tw3.re) }; let twiddle3im = unsafe { _mm_set_pd(tw3.im, tw3.im) }; let twiddle4re = unsafe { _mm_set_pd(tw4.re, tw4.re) }; let twiddle4im = unsafe { _mm_set_pd(tw4.im, tw4.im) }; let twiddle5re = unsafe { _mm_set_pd(tw5.re, tw5.re) }; let twiddle5im = unsafe { _mm_set_pd(tw5.im, tw5.im) }; let twiddle6re = unsafe { _mm_set_pd(tw6.re, tw6.re) }; let twiddle6im = unsafe { _mm_set_pd(tw6.im, tw6.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [__m128d; 13]) -> [__m128d; 13] { let [x1p12, x1m12] = solo_fft2_f64(values[1], values[12]); let [x2p11, x2m11] = solo_fft2_f64(values[2], values[11]); let [x3p10, x3m10] = solo_fft2_f64(values[3], values[10]); let [x4p9, x4m9] = solo_fft2_f64(values[4], values[9]); let [x5p8, x5m8] = solo_fft2_f64(values[5], values[8]); let [x6p7, x6m7] = solo_fft2_f64(values[6], values[7]); let t_a1_1 = _mm_mul_pd(self.twiddle1re, x1p12); let t_a1_2 = _mm_mul_pd(self.twiddle2re, x2p11); let t_a1_3 = _mm_mul_pd(self.twiddle3re, x3p10); let t_a1_4 = _mm_mul_pd(self.twiddle4re, x4p9); let t_a1_5 = _mm_mul_pd(self.twiddle5re, x5p8); let t_a1_6 = _mm_mul_pd(self.twiddle6re, x6p7); let t_a2_1 = _mm_mul_pd(self.twiddle2re, x1p12); let t_a2_2 = _mm_mul_pd(self.twiddle4re, x2p11); let t_a2_3 = _mm_mul_pd(self.twiddle6re, x3p10); let t_a2_4 = _mm_mul_pd(self.twiddle5re, x4p9); let t_a2_5 = _mm_mul_pd(self.twiddle3re, x5p8); let t_a2_6 = _mm_mul_pd(self.twiddle1re, x6p7); let t_a3_1 = _mm_mul_pd(self.twiddle3re, x1p12); let t_a3_2 = _mm_mul_pd(self.twiddle6re, x2p11); let t_a3_3 = _mm_mul_pd(self.twiddle4re, x3p10); let t_a3_4 = _mm_mul_pd(self.twiddle1re, x4p9); let t_a3_5 = _mm_mul_pd(self.twiddle2re, x5p8); let t_a3_6 = _mm_mul_pd(self.twiddle5re, x6p7); let t_a4_1 = _mm_mul_pd(self.twiddle4re, x1p12); let t_a4_2 = _mm_mul_pd(self.twiddle5re, x2p11); let t_a4_3 = _mm_mul_pd(self.twiddle1re, x3p10); let t_a4_4 = _mm_mul_pd(self.twiddle3re, x4p9); let t_a4_5 = _mm_mul_pd(self.twiddle6re, x5p8); let t_a4_6 = _mm_mul_pd(self.twiddle2re, x6p7); let t_a5_1 = _mm_mul_pd(self.twiddle5re, x1p12); let t_a5_2 = _mm_mul_pd(self.twiddle3re, x2p11); let t_a5_3 = _mm_mul_pd(self.twiddle2re, x3p10); let t_a5_4 = _mm_mul_pd(self.twiddle6re, x4p9); let t_a5_5 = _mm_mul_pd(self.twiddle1re, x5p8); let t_a5_6 = _mm_mul_pd(self.twiddle4re, x6p7); let t_a6_1 = _mm_mul_pd(self.twiddle6re, x1p12); let t_a6_2 = _mm_mul_pd(self.twiddle1re, x2p11); let t_a6_3 = _mm_mul_pd(self.twiddle5re, x3p10); let t_a6_4 = _mm_mul_pd(self.twiddle2re, x4p9); let t_a6_5 = _mm_mul_pd(self.twiddle4re, x5p8); let t_a6_6 = _mm_mul_pd(self.twiddle3re, x6p7); let t_b1_1 = _mm_mul_pd(self.twiddle1im, x1m12); let t_b1_2 = _mm_mul_pd(self.twiddle2im, x2m11); let t_b1_3 = _mm_mul_pd(self.twiddle3im, x3m10); let t_b1_4 = _mm_mul_pd(self.twiddle4im, x4m9); let t_b1_5 = _mm_mul_pd(self.twiddle5im, x5m8); let t_b1_6 = _mm_mul_pd(self.twiddle6im, x6m7); let t_b2_1 = _mm_mul_pd(self.twiddle2im, x1m12); let t_b2_2 = _mm_mul_pd(self.twiddle4im, x2m11); let t_b2_3 = _mm_mul_pd(self.twiddle6im, x3m10); let t_b2_4 = _mm_mul_pd(self.twiddle5im, x4m9); let t_b2_5 = _mm_mul_pd(self.twiddle3im, x5m8); let t_b2_6 = _mm_mul_pd(self.twiddle1im, x6m7); let t_b3_1 = _mm_mul_pd(self.twiddle3im, x1m12); let t_b3_2 = _mm_mul_pd(self.twiddle6im, x2m11); let t_b3_3 = _mm_mul_pd(self.twiddle4im, x3m10); let t_b3_4 = _mm_mul_pd(self.twiddle1im, x4m9); let t_b3_5 = _mm_mul_pd(self.twiddle2im, x5m8); let t_b3_6 = _mm_mul_pd(self.twiddle5im, x6m7); let t_b4_1 = _mm_mul_pd(self.twiddle4im, x1m12); let t_b4_2 = _mm_mul_pd(self.twiddle5im, x2m11); let t_b4_3 = _mm_mul_pd(self.twiddle1im, x3m10); let t_b4_4 = _mm_mul_pd(self.twiddle3im, x4m9); let t_b4_5 = _mm_mul_pd(self.twiddle6im, x5m8); let t_b4_6 = _mm_mul_pd(self.twiddle2im, x6m7); let t_b5_1 = _mm_mul_pd(self.twiddle5im, x1m12); let t_b5_2 = _mm_mul_pd(self.twiddle3im, x2m11); let t_b5_3 = _mm_mul_pd(self.twiddle2im, x3m10); let t_b5_4 = _mm_mul_pd(self.twiddle6im, x4m9); let t_b5_5 = _mm_mul_pd(self.twiddle1im, x5m8); let t_b5_6 = _mm_mul_pd(self.twiddle4im, x6m7); let t_b6_1 = _mm_mul_pd(self.twiddle6im, x1m12); let t_b6_2 = _mm_mul_pd(self.twiddle1im, x2m11); let t_b6_3 = _mm_mul_pd(self.twiddle5im, x3m10); let t_b6_4 = _mm_mul_pd(self.twiddle2im, x4m9); let t_b6_5 = _mm_mul_pd(self.twiddle4im, x5m8); let t_b6_6 = _mm_mul_pd(self.twiddle3im, x6m7); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6); let t_a6 = calc_f64!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 + t_b2_3 - t_b2_4 - t_b2_5 - t_b2_6); let t_b3 = calc_f64!(t_b3_1 + t_b3_2 - t_b3_3 - t_b3_4 + t_b3_5 + t_b3_6); let t_b4 = calc_f64!(t_b4_1 - t_b4_2 - t_b4_3 + t_b4_4 - t_b4_5 - t_b4_6); let t_b5 = calc_f64!(t_b5_1 - t_b5_2 + t_b5_3 - t_b5_4 - t_b5_5 + t_b5_6); let t_b6 = calc_f64!(t_b6_1 - t_b6_2 + t_b6_3 - t_b6_4 + t_b6_5 - t_b6_6); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let y0 = calc_f64!(x0 + x1p12 + x2p11 + x3p10 + x4p9 + x5p8 + x6p7); let [y1, y12] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y11] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y10] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y9] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y8] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y7] = solo_fft2_f64(t_a6, t_b6_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12] } } // _ _____ _________ _ _ _ // / |___ | |___ /___ \| |__ (_) |_ // | | / / _____ |_ \ __) | '_ \| | __| // | | / / |_____| ___) / __/| |_) | | |_ // |_|/_/ |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly17 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: __m128, twiddle1im: __m128, twiddle2re: __m128, twiddle2im: __m128, twiddle3re: __m128, twiddle3im: __m128, twiddle4re: __m128, twiddle4im: __m128, twiddle5re: __m128, twiddle5im: __m128, twiddle6re: __m128, twiddle6im: __m128, twiddle7re: __m128, twiddle7im: __m128, twiddle8re: __m128, twiddle8im: __m128, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly17, 17, |this: &SseF32Butterfly17<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly17, 17, |this: &SseF32Butterfly17<_>| this .direction); impl SseF32Butterfly17 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 17, direction); let tw2: Complex = twiddles::compute_twiddle(2, 17, direction); let tw3: Complex = twiddles::compute_twiddle(3, 17, direction); let tw4: Complex = twiddles::compute_twiddle(4, 17, direction); let tw5: Complex = twiddles::compute_twiddle(5, 17, direction); let tw6: Complex = twiddles::compute_twiddle(6, 17, direction); let tw7: Complex = twiddles::compute_twiddle(7, 17, direction); let tw8: Complex = twiddles::compute_twiddle(8, 17, direction); let twiddle1re = unsafe { _mm_load1_ps(&tw1.re) }; let twiddle1im = unsafe { _mm_load1_ps(&tw1.im) }; let twiddle2re = unsafe { _mm_load1_ps(&tw2.re) }; let twiddle2im = unsafe { _mm_load1_ps(&tw2.im) }; let twiddle3re = unsafe { _mm_load1_ps(&tw3.re) }; let twiddle3im = unsafe { _mm_load1_ps(&tw3.im) }; let twiddle4re = unsafe { _mm_load1_ps(&tw4.re) }; let twiddle4im = unsafe { _mm_load1_ps(&tw4.im) }; let twiddle5re = unsafe { _mm_load1_ps(&tw5.re) }; let twiddle5im = unsafe { _mm_load1_ps(&tw5.im) }; let twiddle6re = unsafe { _mm_load1_ps(&tw6.re) }; let twiddle6im = unsafe { _mm_load1_ps(&tw6.im) }; let twiddle7re = unsafe { _mm_load1_ps(&tw7.re) }; let twiddle7im = unsafe { _mm_load1_ps(&tw7.im) }; let twiddle8re = unsafe { _mm_load1_ps(&tw8.re) }; let twiddle8im = unsafe { _mm_load1_ps(&tw8.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[8]), extract_hi_lo_f32(input_packed[0], input_packed[9]), extract_lo_hi_f32(input_packed[1], input_packed[9]), extract_hi_lo_f32(input_packed[1], input_packed[10]), extract_lo_hi_f32(input_packed[2], input_packed[10]), extract_hi_lo_f32(input_packed[2], input_packed[11]), extract_lo_hi_f32(input_packed[3], input_packed[11]), extract_hi_lo_f32(input_packed[3], input_packed[12]), extract_lo_hi_f32(input_packed[4], input_packed[12]), extract_hi_lo_f32(input_packed[4], input_packed[13]), extract_lo_hi_f32(input_packed[5], input_packed[13]), extract_hi_lo_f32(input_packed[5], input_packed[14]), extract_lo_hi_f32(input_packed[6], input_packed[14]), extract_hi_lo_f32(input_packed[6], input_packed[15]), extract_lo_hi_f32(input_packed[7], input_packed[15]), extract_hi_lo_f32(input_packed[7], input_packed[16]), extract_lo_hi_f32(input_packed[8], input_packed[16]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_lo_f32(out[14], out[15]), extract_lo_hi_f32(out[16], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), extract_hi_hi_f32(out[15], out[16]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [__m128; 17]) -> [__m128; 17] { let [x1p16, x1m16] = parallel_fft2_interleaved_f32(values[1], values[16]); let [x2p15, x2m15] = parallel_fft2_interleaved_f32(values[2], values[15]); let [x3p14, x3m14] = parallel_fft2_interleaved_f32(values[3], values[14]); let [x4p13, x4m13] = parallel_fft2_interleaved_f32(values[4], values[13]); let [x5p12, x5m12] = parallel_fft2_interleaved_f32(values[5], values[12]); let [x6p11, x6m11] = parallel_fft2_interleaved_f32(values[6], values[11]); let [x7p10, x7m10] = parallel_fft2_interleaved_f32(values[7], values[10]); let [x8p9, x8m9] = parallel_fft2_interleaved_f32(values[8], values[9]); let t_a1_1 = _mm_mul_ps(self.twiddle1re, x1p16); let t_a1_2 = _mm_mul_ps(self.twiddle2re, x2p15); let t_a1_3 = _mm_mul_ps(self.twiddle3re, x3p14); let t_a1_4 = _mm_mul_ps(self.twiddle4re, x4p13); let t_a1_5 = _mm_mul_ps(self.twiddle5re, x5p12); let t_a1_6 = _mm_mul_ps(self.twiddle6re, x6p11); let t_a1_7 = _mm_mul_ps(self.twiddle7re, x7p10); let t_a1_8 = _mm_mul_ps(self.twiddle8re, x8p9); let t_a2_1 = _mm_mul_ps(self.twiddle2re, x1p16); let t_a2_2 = _mm_mul_ps(self.twiddle4re, x2p15); let t_a2_3 = _mm_mul_ps(self.twiddle6re, x3p14); let t_a2_4 = _mm_mul_ps(self.twiddle8re, x4p13); let t_a2_5 = _mm_mul_ps(self.twiddle7re, x5p12); let t_a2_6 = _mm_mul_ps(self.twiddle5re, x6p11); let t_a2_7 = _mm_mul_ps(self.twiddle3re, x7p10); let t_a2_8 = _mm_mul_ps(self.twiddle1re, x8p9); let t_a3_1 = _mm_mul_ps(self.twiddle3re, x1p16); let t_a3_2 = _mm_mul_ps(self.twiddle6re, x2p15); let t_a3_3 = _mm_mul_ps(self.twiddle8re, x3p14); let t_a3_4 = _mm_mul_ps(self.twiddle5re, x4p13); let t_a3_5 = _mm_mul_ps(self.twiddle2re, x5p12); let t_a3_6 = _mm_mul_ps(self.twiddle1re, x6p11); let t_a3_7 = _mm_mul_ps(self.twiddle4re, x7p10); let t_a3_8 = _mm_mul_ps(self.twiddle7re, x8p9); let t_a4_1 = _mm_mul_ps(self.twiddle4re, x1p16); let t_a4_2 = _mm_mul_ps(self.twiddle8re, x2p15); let t_a4_3 = _mm_mul_ps(self.twiddle5re, x3p14); let t_a4_4 = _mm_mul_ps(self.twiddle1re, x4p13); let t_a4_5 = _mm_mul_ps(self.twiddle3re, x5p12); let t_a4_6 = _mm_mul_ps(self.twiddle7re, x6p11); let t_a4_7 = _mm_mul_ps(self.twiddle6re, x7p10); let t_a4_8 = _mm_mul_ps(self.twiddle2re, x8p9); let t_a5_1 = _mm_mul_ps(self.twiddle5re, x1p16); let t_a5_2 = _mm_mul_ps(self.twiddle7re, x2p15); let t_a5_3 = _mm_mul_ps(self.twiddle2re, x3p14); let t_a5_4 = _mm_mul_ps(self.twiddle3re, x4p13); let t_a5_5 = _mm_mul_ps(self.twiddle8re, x5p12); let t_a5_6 = _mm_mul_ps(self.twiddle4re, x6p11); let t_a5_7 = _mm_mul_ps(self.twiddle1re, x7p10); let t_a5_8 = _mm_mul_ps(self.twiddle6re, x8p9); let t_a6_1 = _mm_mul_ps(self.twiddle6re, x1p16); let t_a6_2 = _mm_mul_ps(self.twiddle5re, x2p15); let t_a6_3 = _mm_mul_ps(self.twiddle1re, x3p14); let t_a6_4 = _mm_mul_ps(self.twiddle7re, x4p13); let t_a6_5 = _mm_mul_ps(self.twiddle4re, x5p12); let t_a6_6 = _mm_mul_ps(self.twiddle2re, x6p11); let t_a6_7 = _mm_mul_ps(self.twiddle8re, x7p10); let t_a6_8 = _mm_mul_ps(self.twiddle3re, x8p9); let t_a7_1 = _mm_mul_ps(self.twiddle7re, x1p16); let t_a7_2 = _mm_mul_ps(self.twiddle3re, x2p15); let t_a7_3 = _mm_mul_ps(self.twiddle4re, x3p14); let t_a7_4 = _mm_mul_ps(self.twiddle6re, x4p13); let t_a7_5 = _mm_mul_ps(self.twiddle1re, x5p12); let t_a7_6 = _mm_mul_ps(self.twiddle8re, x6p11); let t_a7_7 = _mm_mul_ps(self.twiddle2re, x7p10); let t_a7_8 = _mm_mul_ps(self.twiddle5re, x8p9); let t_a8_1 = _mm_mul_ps(self.twiddle8re, x1p16); let t_a8_2 = _mm_mul_ps(self.twiddle1re, x2p15); let t_a8_3 = _mm_mul_ps(self.twiddle7re, x3p14); let t_a8_4 = _mm_mul_ps(self.twiddle2re, x4p13); let t_a8_5 = _mm_mul_ps(self.twiddle6re, x5p12); let t_a8_6 = _mm_mul_ps(self.twiddle3re, x6p11); let t_a8_7 = _mm_mul_ps(self.twiddle5re, x7p10); let t_a8_8 = _mm_mul_ps(self.twiddle4re, x8p9); let t_b1_1 = _mm_mul_ps(self.twiddle1im, x1m16); let t_b1_2 = _mm_mul_ps(self.twiddle2im, x2m15); let t_b1_3 = _mm_mul_ps(self.twiddle3im, x3m14); let t_b1_4 = _mm_mul_ps(self.twiddle4im, x4m13); let t_b1_5 = _mm_mul_ps(self.twiddle5im, x5m12); let t_b1_6 = _mm_mul_ps(self.twiddle6im, x6m11); let t_b1_7 = _mm_mul_ps(self.twiddle7im, x7m10); let t_b1_8 = _mm_mul_ps(self.twiddle8im, x8m9); let t_b2_1 = _mm_mul_ps(self.twiddle2im, x1m16); let t_b2_2 = _mm_mul_ps(self.twiddle4im, x2m15); let t_b2_3 = _mm_mul_ps(self.twiddle6im, x3m14); let t_b2_4 = _mm_mul_ps(self.twiddle8im, x4m13); let t_b2_5 = _mm_mul_ps(self.twiddle7im, x5m12); let t_b2_6 = _mm_mul_ps(self.twiddle5im, x6m11); let t_b2_7 = _mm_mul_ps(self.twiddle3im, x7m10); let t_b2_8 = _mm_mul_ps(self.twiddle1im, x8m9); let t_b3_1 = _mm_mul_ps(self.twiddle3im, x1m16); let t_b3_2 = _mm_mul_ps(self.twiddle6im, x2m15); let t_b3_3 = _mm_mul_ps(self.twiddle8im, x3m14); let t_b3_4 = _mm_mul_ps(self.twiddle5im, x4m13); let t_b3_5 = _mm_mul_ps(self.twiddle2im, x5m12); let t_b3_6 = _mm_mul_ps(self.twiddle1im, x6m11); let t_b3_7 = _mm_mul_ps(self.twiddle4im, x7m10); let t_b3_8 = _mm_mul_ps(self.twiddle7im, x8m9); let t_b4_1 = _mm_mul_ps(self.twiddle4im, x1m16); let t_b4_2 = _mm_mul_ps(self.twiddle8im, x2m15); let t_b4_3 = _mm_mul_ps(self.twiddle5im, x3m14); let t_b4_4 = _mm_mul_ps(self.twiddle1im, x4m13); let t_b4_5 = _mm_mul_ps(self.twiddle3im, x5m12); let t_b4_6 = _mm_mul_ps(self.twiddle7im, x6m11); let t_b4_7 = _mm_mul_ps(self.twiddle6im, x7m10); let t_b4_8 = _mm_mul_ps(self.twiddle2im, x8m9); let t_b5_1 = _mm_mul_ps(self.twiddle5im, x1m16); let t_b5_2 = _mm_mul_ps(self.twiddle7im, x2m15); let t_b5_3 = _mm_mul_ps(self.twiddle2im, x3m14); let t_b5_4 = _mm_mul_ps(self.twiddle3im, x4m13); let t_b5_5 = _mm_mul_ps(self.twiddle8im, x5m12); let t_b5_6 = _mm_mul_ps(self.twiddle4im, x6m11); let t_b5_7 = _mm_mul_ps(self.twiddle1im, x7m10); let t_b5_8 = _mm_mul_ps(self.twiddle6im, x8m9); let t_b6_1 = _mm_mul_ps(self.twiddle6im, x1m16); let t_b6_2 = _mm_mul_ps(self.twiddle5im, x2m15); let t_b6_3 = _mm_mul_ps(self.twiddle1im, x3m14); let t_b6_4 = _mm_mul_ps(self.twiddle7im, x4m13); let t_b6_5 = _mm_mul_ps(self.twiddle4im, x5m12); let t_b6_6 = _mm_mul_ps(self.twiddle2im, x6m11); let t_b6_7 = _mm_mul_ps(self.twiddle8im, x7m10); let t_b6_8 = _mm_mul_ps(self.twiddle3im, x8m9); let t_b7_1 = _mm_mul_ps(self.twiddle7im, x1m16); let t_b7_2 = _mm_mul_ps(self.twiddle3im, x2m15); let t_b7_3 = _mm_mul_ps(self.twiddle4im, x3m14); let t_b7_4 = _mm_mul_ps(self.twiddle6im, x4m13); let t_b7_5 = _mm_mul_ps(self.twiddle1im, x5m12); let t_b7_6 = _mm_mul_ps(self.twiddle8im, x6m11); let t_b7_7 = _mm_mul_ps(self.twiddle2im, x7m10); let t_b7_8 = _mm_mul_ps(self.twiddle5im, x8m9); let t_b8_1 = _mm_mul_ps(self.twiddle8im, x1m16); let t_b8_2 = _mm_mul_ps(self.twiddle1im, x2m15); let t_b8_3 = _mm_mul_ps(self.twiddle7im, x3m14); let t_b8_4 = _mm_mul_ps(self.twiddle2im, x4m13); let t_b8_5 = _mm_mul_ps(self.twiddle6im, x5m12); let t_b8_6 = _mm_mul_ps(self.twiddle3im, x6m11); let t_b8_7 = _mm_mul_ps(self.twiddle5im, x7m10); let t_b8_8 = _mm_mul_ps(self.twiddle4im, x8m9); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8); let t_a6 = calc_f32!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8); let t_a7 = calc_f32!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8); let t_a8 = calc_f32!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 - t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8); let t_b3 = calc_f32!(t_b3_1 + t_b3_2 - t_b3_3 - t_b3_4 - t_b3_5 + t_b3_6 + t_b3_7 + t_b3_8); let t_b4 = calc_f32!(t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 + t_b4_5 + t_b4_6 - t_b4_7 - t_b4_8); let t_b5 = calc_f32!(t_b5_1 - t_b5_2 - t_b5_3 + t_b5_4 + t_b5_5 - t_b5_6 + t_b5_7 + t_b5_8); let t_b6 = calc_f32!(t_b6_1 - t_b6_2 + t_b6_3 + t_b6_4 - t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8); let t_b7 = calc_f32!(t_b7_1 - t_b7_2 + t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 + t_b7_8); let t_b8 = calc_f32!(t_b8_1 - t_b8_2 + t_b8_3 - t_b8_4 + t_b8_5 - t_b8_6 + t_b8_7 - t_b8_8); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let t_b7_rot = self.rotate.rotate_both(t_b7); let t_b8_rot = self.rotate.rotate_both(t_b8); let y0 = calc_f32!(x0 + x1p16 + x2p15 + x3p14 + x4p13 + x5p12 + x6p11 + x7p10 + x8p9); let [y1, y16] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y15] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y14] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y13] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y12] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y11] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); let [y7, y10] = parallel_fft2_interleaved_f32(t_a7, t_b7_rot); let [y8, y9] = parallel_fft2_interleaved_f32(t_a8, t_b8_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16] } } // _ _____ __ _ _ _ _ _ // / |___ | / /_ | || | | |__ (_) |_ // | | / / _____ | '_ \| || |_| '_ \| | __| // | | / / |_____| | (_) |__ _| |_) | | |_ // |_|/_/ \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly17 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: __m128d, twiddle1im: __m128d, twiddle2re: __m128d, twiddle2im: __m128d, twiddle3re: __m128d, twiddle3im: __m128d, twiddle4re: __m128d, twiddle4im: __m128d, twiddle5re: __m128d, twiddle5im: __m128d, twiddle6re: __m128d, twiddle6im: __m128d, twiddle7re: __m128d, twiddle7im: __m128d, twiddle8re: __m128d, twiddle8im: __m128d, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly17, 17, |this: &SseF64Butterfly17<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly17, 17, |this: &SseF64Butterfly17<_>| this .direction); impl SseF64Butterfly17 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 17, direction); let tw2: Complex = twiddles::compute_twiddle(2, 17, direction); let tw3: Complex = twiddles::compute_twiddle(3, 17, direction); let tw4: Complex = twiddles::compute_twiddle(4, 17, direction); let tw5: Complex = twiddles::compute_twiddle(5, 17, direction); let tw6: Complex = twiddles::compute_twiddle(6, 17, direction); let tw7: Complex = twiddles::compute_twiddle(7, 17, direction); let tw8: Complex = twiddles::compute_twiddle(8, 17, direction); let twiddle1re = unsafe { _mm_set_pd(tw1.re, tw1.re) }; let twiddle1im = unsafe { _mm_set_pd(tw1.im, tw1.im) }; let twiddle2re = unsafe { _mm_set_pd(tw2.re, tw2.re) }; let twiddle2im = unsafe { _mm_set_pd(tw2.im, tw2.im) }; let twiddle3re = unsafe { _mm_set_pd(tw3.re, tw3.re) }; let twiddle3im = unsafe { _mm_set_pd(tw3.im, tw3.im) }; let twiddle4re = unsafe { _mm_set_pd(tw4.re, tw4.re) }; let twiddle4im = unsafe { _mm_set_pd(tw4.im, tw4.im) }; let twiddle5re = unsafe { _mm_set_pd(tw5.re, tw5.re) }; let twiddle5im = unsafe { _mm_set_pd(tw5.im, tw5.im) }; let twiddle6re = unsafe { _mm_set_pd(tw6.re, tw6.re) }; let twiddle6im = unsafe { _mm_set_pd(tw6.im, tw6.im) }; let twiddle7re = unsafe { _mm_set_pd(tw7.re, tw7.re) }; let twiddle7im = unsafe { _mm_set_pd(tw7.im, tw7.im) }; let twiddle8re = unsafe { _mm_set_pd(tw8.re, tw8.re) }; let twiddle8im = unsafe { _mm_set_pd(tw8.im, tw8.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [__m128d; 17]) -> [__m128d; 17] { let [x1p16, x1m16] = solo_fft2_f64(values[1], values[16]); let [x2p15, x2m15] = solo_fft2_f64(values[2], values[15]); let [x3p14, x3m14] = solo_fft2_f64(values[3], values[14]); let [x4p13, x4m13] = solo_fft2_f64(values[4], values[13]); let [x5p12, x5m12] = solo_fft2_f64(values[5], values[12]); let [x6p11, x6m11] = solo_fft2_f64(values[6], values[11]); let [x7p10, x7m10] = solo_fft2_f64(values[7], values[10]); let [x8p9, x8m9] = solo_fft2_f64(values[8], values[9]); let t_a1_1 = _mm_mul_pd(self.twiddle1re, x1p16); let t_a1_2 = _mm_mul_pd(self.twiddle2re, x2p15); let t_a1_3 = _mm_mul_pd(self.twiddle3re, x3p14); let t_a1_4 = _mm_mul_pd(self.twiddle4re, x4p13); let t_a1_5 = _mm_mul_pd(self.twiddle5re, x5p12); let t_a1_6 = _mm_mul_pd(self.twiddle6re, x6p11); let t_a1_7 = _mm_mul_pd(self.twiddle7re, x7p10); let t_a1_8 = _mm_mul_pd(self.twiddle8re, x8p9); let t_a2_1 = _mm_mul_pd(self.twiddle2re, x1p16); let t_a2_2 = _mm_mul_pd(self.twiddle4re, x2p15); let t_a2_3 = _mm_mul_pd(self.twiddle6re, x3p14); let t_a2_4 = _mm_mul_pd(self.twiddle8re, x4p13); let t_a2_5 = _mm_mul_pd(self.twiddle7re, x5p12); let t_a2_6 = _mm_mul_pd(self.twiddle5re, x6p11); let t_a2_7 = _mm_mul_pd(self.twiddle3re, x7p10); let t_a2_8 = _mm_mul_pd(self.twiddle1re, x8p9); let t_a3_1 = _mm_mul_pd(self.twiddle3re, x1p16); let t_a3_2 = _mm_mul_pd(self.twiddle6re, x2p15); let t_a3_3 = _mm_mul_pd(self.twiddle8re, x3p14); let t_a3_4 = _mm_mul_pd(self.twiddle5re, x4p13); let t_a3_5 = _mm_mul_pd(self.twiddle2re, x5p12); let t_a3_6 = _mm_mul_pd(self.twiddle1re, x6p11); let t_a3_7 = _mm_mul_pd(self.twiddle4re, x7p10); let t_a3_8 = _mm_mul_pd(self.twiddle7re, x8p9); let t_a4_1 = _mm_mul_pd(self.twiddle4re, x1p16); let t_a4_2 = _mm_mul_pd(self.twiddle8re, x2p15); let t_a4_3 = _mm_mul_pd(self.twiddle5re, x3p14); let t_a4_4 = _mm_mul_pd(self.twiddle1re, x4p13); let t_a4_5 = _mm_mul_pd(self.twiddle3re, x5p12); let t_a4_6 = _mm_mul_pd(self.twiddle7re, x6p11); let t_a4_7 = _mm_mul_pd(self.twiddle6re, x7p10); let t_a4_8 = _mm_mul_pd(self.twiddle2re, x8p9); let t_a5_1 = _mm_mul_pd(self.twiddle5re, x1p16); let t_a5_2 = _mm_mul_pd(self.twiddle7re, x2p15); let t_a5_3 = _mm_mul_pd(self.twiddle2re, x3p14); let t_a5_4 = _mm_mul_pd(self.twiddle3re, x4p13); let t_a5_5 = _mm_mul_pd(self.twiddle8re, x5p12); let t_a5_6 = _mm_mul_pd(self.twiddle4re, x6p11); let t_a5_7 = _mm_mul_pd(self.twiddle1re, x7p10); let t_a5_8 = _mm_mul_pd(self.twiddle6re, x8p9); let t_a6_1 = _mm_mul_pd(self.twiddle6re, x1p16); let t_a6_2 = _mm_mul_pd(self.twiddle5re, x2p15); let t_a6_3 = _mm_mul_pd(self.twiddle1re, x3p14); let t_a6_4 = _mm_mul_pd(self.twiddle7re, x4p13); let t_a6_5 = _mm_mul_pd(self.twiddle4re, x5p12); let t_a6_6 = _mm_mul_pd(self.twiddle2re, x6p11); let t_a6_7 = _mm_mul_pd(self.twiddle8re, x7p10); let t_a6_8 = _mm_mul_pd(self.twiddle3re, x8p9); let t_a7_1 = _mm_mul_pd(self.twiddle7re, x1p16); let t_a7_2 = _mm_mul_pd(self.twiddle3re, x2p15); let t_a7_3 = _mm_mul_pd(self.twiddle4re, x3p14); let t_a7_4 = _mm_mul_pd(self.twiddle6re, x4p13); let t_a7_5 = _mm_mul_pd(self.twiddle1re, x5p12); let t_a7_6 = _mm_mul_pd(self.twiddle8re, x6p11); let t_a7_7 = _mm_mul_pd(self.twiddle2re, x7p10); let t_a7_8 = _mm_mul_pd(self.twiddle5re, x8p9); let t_a8_1 = _mm_mul_pd(self.twiddle8re, x1p16); let t_a8_2 = _mm_mul_pd(self.twiddle1re, x2p15); let t_a8_3 = _mm_mul_pd(self.twiddle7re, x3p14); let t_a8_4 = _mm_mul_pd(self.twiddle2re, x4p13); let t_a8_5 = _mm_mul_pd(self.twiddle6re, x5p12); let t_a8_6 = _mm_mul_pd(self.twiddle3re, x6p11); let t_a8_7 = _mm_mul_pd(self.twiddle5re, x7p10); let t_a8_8 = _mm_mul_pd(self.twiddle4re, x8p9); let t_b1_1 = _mm_mul_pd(self.twiddle1im, x1m16); let t_b1_2 = _mm_mul_pd(self.twiddle2im, x2m15); let t_b1_3 = _mm_mul_pd(self.twiddle3im, x3m14); let t_b1_4 = _mm_mul_pd(self.twiddle4im, x4m13); let t_b1_5 = _mm_mul_pd(self.twiddle5im, x5m12); let t_b1_6 = _mm_mul_pd(self.twiddle6im, x6m11); let t_b1_7 = _mm_mul_pd(self.twiddle7im, x7m10); let t_b1_8 = _mm_mul_pd(self.twiddle8im, x8m9); let t_b2_1 = _mm_mul_pd(self.twiddle2im, x1m16); let t_b2_2 = _mm_mul_pd(self.twiddle4im, x2m15); let t_b2_3 = _mm_mul_pd(self.twiddle6im, x3m14); let t_b2_4 = _mm_mul_pd(self.twiddle8im, x4m13); let t_b2_5 = _mm_mul_pd(self.twiddle7im, x5m12); let t_b2_6 = _mm_mul_pd(self.twiddle5im, x6m11); let t_b2_7 = _mm_mul_pd(self.twiddle3im, x7m10); let t_b2_8 = _mm_mul_pd(self.twiddle1im, x8m9); let t_b3_1 = _mm_mul_pd(self.twiddle3im, x1m16); let t_b3_2 = _mm_mul_pd(self.twiddle6im, x2m15); let t_b3_3 = _mm_mul_pd(self.twiddle8im, x3m14); let t_b3_4 = _mm_mul_pd(self.twiddle5im, x4m13); let t_b3_5 = _mm_mul_pd(self.twiddle2im, x5m12); let t_b3_6 = _mm_mul_pd(self.twiddle1im, x6m11); let t_b3_7 = _mm_mul_pd(self.twiddle4im, x7m10); let t_b3_8 = _mm_mul_pd(self.twiddle7im, x8m9); let t_b4_1 = _mm_mul_pd(self.twiddle4im, x1m16); let t_b4_2 = _mm_mul_pd(self.twiddle8im, x2m15); let t_b4_3 = _mm_mul_pd(self.twiddle5im, x3m14); let t_b4_4 = _mm_mul_pd(self.twiddle1im, x4m13); let t_b4_5 = _mm_mul_pd(self.twiddle3im, x5m12); let t_b4_6 = _mm_mul_pd(self.twiddle7im, x6m11); let t_b4_7 = _mm_mul_pd(self.twiddle6im, x7m10); let t_b4_8 = _mm_mul_pd(self.twiddle2im, x8m9); let t_b5_1 = _mm_mul_pd(self.twiddle5im, x1m16); let t_b5_2 = _mm_mul_pd(self.twiddle7im, x2m15); let t_b5_3 = _mm_mul_pd(self.twiddle2im, x3m14); let t_b5_4 = _mm_mul_pd(self.twiddle3im, x4m13); let t_b5_5 = _mm_mul_pd(self.twiddle8im, x5m12); let t_b5_6 = _mm_mul_pd(self.twiddle4im, x6m11); let t_b5_7 = _mm_mul_pd(self.twiddle1im, x7m10); let t_b5_8 = _mm_mul_pd(self.twiddle6im, x8m9); let t_b6_1 = _mm_mul_pd(self.twiddle6im, x1m16); let t_b6_2 = _mm_mul_pd(self.twiddle5im, x2m15); let t_b6_3 = _mm_mul_pd(self.twiddle1im, x3m14); let t_b6_4 = _mm_mul_pd(self.twiddle7im, x4m13); let t_b6_5 = _mm_mul_pd(self.twiddle4im, x5m12); let t_b6_6 = _mm_mul_pd(self.twiddle2im, x6m11); let t_b6_7 = _mm_mul_pd(self.twiddle8im, x7m10); let t_b6_8 = _mm_mul_pd(self.twiddle3im, x8m9); let t_b7_1 = _mm_mul_pd(self.twiddle7im, x1m16); let t_b7_2 = _mm_mul_pd(self.twiddle3im, x2m15); let t_b7_3 = _mm_mul_pd(self.twiddle4im, x3m14); let t_b7_4 = _mm_mul_pd(self.twiddle6im, x4m13); let t_b7_5 = _mm_mul_pd(self.twiddle1im, x5m12); let t_b7_6 = _mm_mul_pd(self.twiddle8im, x6m11); let t_b7_7 = _mm_mul_pd(self.twiddle2im, x7m10); let t_b7_8 = _mm_mul_pd(self.twiddle5im, x8m9); let t_b8_1 = _mm_mul_pd(self.twiddle8im, x1m16); let t_b8_2 = _mm_mul_pd(self.twiddle1im, x2m15); let t_b8_3 = _mm_mul_pd(self.twiddle7im, x3m14); let t_b8_4 = _mm_mul_pd(self.twiddle2im, x4m13); let t_b8_5 = _mm_mul_pd(self.twiddle6im, x5m12); let t_b8_6 = _mm_mul_pd(self.twiddle3im, x6m11); let t_b8_7 = _mm_mul_pd(self.twiddle5im, x7m10); let t_b8_8 = _mm_mul_pd(self.twiddle4im, x8m9); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8); let t_a6 = calc_f64!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8); let t_a7 = calc_f64!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8); let t_a8 = calc_f64!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 - t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8); let t_b3 = calc_f64!(t_b3_1 + t_b3_2 - t_b3_3 - t_b3_4 - t_b3_5 + t_b3_6 + t_b3_7 + t_b3_8); let t_b4 = calc_f64!(t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 + t_b4_5 + t_b4_6 - t_b4_7 - t_b4_8); let t_b5 = calc_f64!(t_b5_1 - t_b5_2 - t_b5_3 + t_b5_4 + t_b5_5 - t_b5_6 + t_b5_7 + t_b5_8); let t_b6 = calc_f64!(t_b6_1 - t_b6_2 + t_b6_3 + t_b6_4 - t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8); let t_b7 = calc_f64!(t_b7_1 - t_b7_2 + t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 + t_b7_8); let t_b8 = calc_f64!(t_b8_1 - t_b8_2 + t_b8_3 - t_b8_4 + t_b8_5 - t_b8_6 + t_b8_7 - t_b8_8); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let t_b7_rot = self.rotate.rotate(t_b7); let t_b8_rot = self.rotate.rotate(t_b8); let y0 = calc_f64!(x0 + x1p16 + x2p15 + x3p14 + x4p13 + x5p12 + x6p11 + x7p10 + x8p9); let [y1, y16] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y15] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y14] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y13] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y12] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y11] = solo_fft2_f64(t_a6, t_b6_rot); let [y7, y10] = solo_fft2_f64(t_a7, t_b7_rot); let [y8, y9] = solo_fft2_f64(t_a8, t_b8_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16] } } // _ ___ _________ _ _ _ // / |/ _ \ |___ /___ \| |__ (_) |_ // | | (_) | _____ |_ \ __) | '_ \| | __| // | |\__, | |_____| ___) / __/| |_) | | |_ // |_| /_/ |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly19 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: __m128, twiddle1im: __m128, twiddle2re: __m128, twiddle2im: __m128, twiddle3re: __m128, twiddle3im: __m128, twiddle4re: __m128, twiddle4im: __m128, twiddle5re: __m128, twiddle5im: __m128, twiddle6re: __m128, twiddle6im: __m128, twiddle7re: __m128, twiddle7im: __m128, twiddle8re: __m128, twiddle8im: __m128, twiddle9re: __m128, twiddle9im: __m128, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly19, 19, |this: &SseF32Butterfly19<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly19, 19, |this: &SseF32Butterfly19<_>| this .direction); impl SseF32Butterfly19 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 19, direction); let tw2: Complex = twiddles::compute_twiddle(2, 19, direction); let tw3: Complex = twiddles::compute_twiddle(3, 19, direction); let tw4: Complex = twiddles::compute_twiddle(4, 19, direction); let tw5: Complex = twiddles::compute_twiddle(5, 19, direction); let tw6: Complex = twiddles::compute_twiddle(6, 19, direction); let tw7: Complex = twiddles::compute_twiddle(7, 19, direction); let tw8: Complex = twiddles::compute_twiddle(8, 19, direction); let tw9: Complex = twiddles::compute_twiddle(9, 19, direction); let twiddle1re = unsafe { _mm_load1_ps(&tw1.re) }; let twiddle1im = unsafe { _mm_load1_ps(&tw1.im) }; let twiddle2re = unsafe { _mm_load1_ps(&tw2.re) }; let twiddle2im = unsafe { _mm_load1_ps(&tw2.im) }; let twiddle3re = unsafe { _mm_load1_ps(&tw3.re) }; let twiddle3im = unsafe { _mm_load1_ps(&tw3.im) }; let twiddle4re = unsafe { _mm_load1_ps(&tw4.re) }; let twiddle4im = unsafe { _mm_load1_ps(&tw4.im) }; let twiddle5re = unsafe { _mm_load1_ps(&tw5.re) }; let twiddle5im = unsafe { _mm_load1_ps(&tw5.im) }; let twiddle6re = unsafe { _mm_load1_ps(&tw6.re) }; let twiddle6im = unsafe { _mm_load1_ps(&tw6.im) }; let twiddle7re = unsafe { _mm_load1_ps(&tw7.re) }; let twiddle7im = unsafe { _mm_load1_ps(&tw7.im) }; let twiddle8re = unsafe { _mm_load1_ps(&tw8.re) }; let twiddle8im = unsafe { _mm_load1_ps(&tw8.im) }; let twiddle9re = unsafe { _mm_load1_ps(&tw9.re) }; let twiddle9im = unsafe { _mm_load1_ps(&tw9.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[9]), extract_hi_lo_f32(input_packed[0], input_packed[10]), extract_lo_hi_f32(input_packed[1], input_packed[10]), extract_hi_lo_f32(input_packed[1], input_packed[11]), extract_lo_hi_f32(input_packed[2], input_packed[11]), extract_hi_lo_f32(input_packed[2], input_packed[12]), extract_lo_hi_f32(input_packed[3], input_packed[12]), extract_hi_lo_f32(input_packed[3], input_packed[13]), extract_lo_hi_f32(input_packed[4], input_packed[13]), extract_hi_lo_f32(input_packed[4], input_packed[14]), extract_lo_hi_f32(input_packed[5], input_packed[14]), extract_hi_lo_f32(input_packed[5], input_packed[15]), extract_lo_hi_f32(input_packed[6], input_packed[15]), extract_hi_lo_f32(input_packed[6], input_packed[16]), extract_lo_hi_f32(input_packed[7], input_packed[16]), extract_hi_lo_f32(input_packed[7], input_packed[17]), extract_lo_hi_f32(input_packed[8], input_packed[17]), extract_hi_lo_f32(input_packed[8], input_packed[18]), extract_lo_hi_f32(input_packed[9], input_packed[18]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_lo_f32(out[14], out[15]), extract_lo_lo_f32(out[16], out[17]), extract_lo_hi_f32(out[18], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), extract_hi_hi_f32(out[15], out[16]), extract_hi_hi_f32(out[17], out[18]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [__m128; 19]) -> [__m128; 19] { let [x1p18, x1m18] = parallel_fft2_interleaved_f32(values[1], values[18]); let [x2p17, x2m17] = parallel_fft2_interleaved_f32(values[2], values[17]); let [x3p16, x3m16] = parallel_fft2_interleaved_f32(values[3], values[16]); let [x4p15, x4m15] = parallel_fft2_interleaved_f32(values[4], values[15]); let [x5p14, x5m14] = parallel_fft2_interleaved_f32(values[5], values[14]); let [x6p13, x6m13] = parallel_fft2_interleaved_f32(values[6], values[13]); let [x7p12, x7m12] = parallel_fft2_interleaved_f32(values[7], values[12]); let [x8p11, x8m11] = parallel_fft2_interleaved_f32(values[8], values[11]); let [x9p10, x9m10] = parallel_fft2_interleaved_f32(values[9], values[10]); let t_a1_1 = _mm_mul_ps(self.twiddle1re, x1p18); let t_a1_2 = _mm_mul_ps(self.twiddle2re, x2p17); let t_a1_3 = _mm_mul_ps(self.twiddle3re, x3p16); let t_a1_4 = _mm_mul_ps(self.twiddle4re, x4p15); let t_a1_5 = _mm_mul_ps(self.twiddle5re, x5p14); let t_a1_6 = _mm_mul_ps(self.twiddle6re, x6p13); let t_a1_7 = _mm_mul_ps(self.twiddle7re, x7p12); let t_a1_8 = _mm_mul_ps(self.twiddle8re, x8p11); let t_a1_9 = _mm_mul_ps(self.twiddle9re, x9p10); let t_a2_1 = _mm_mul_ps(self.twiddle2re, x1p18); let t_a2_2 = _mm_mul_ps(self.twiddle4re, x2p17); let t_a2_3 = _mm_mul_ps(self.twiddle6re, x3p16); let t_a2_4 = _mm_mul_ps(self.twiddle8re, x4p15); let t_a2_5 = _mm_mul_ps(self.twiddle9re, x5p14); let t_a2_6 = _mm_mul_ps(self.twiddle7re, x6p13); let t_a2_7 = _mm_mul_ps(self.twiddle5re, x7p12); let t_a2_8 = _mm_mul_ps(self.twiddle3re, x8p11); let t_a2_9 = _mm_mul_ps(self.twiddle1re, x9p10); let t_a3_1 = _mm_mul_ps(self.twiddle3re, x1p18); let t_a3_2 = _mm_mul_ps(self.twiddle6re, x2p17); let t_a3_3 = _mm_mul_ps(self.twiddle9re, x3p16); let t_a3_4 = _mm_mul_ps(self.twiddle7re, x4p15); let t_a3_5 = _mm_mul_ps(self.twiddle4re, x5p14); let t_a3_6 = _mm_mul_ps(self.twiddle1re, x6p13); let t_a3_7 = _mm_mul_ps(self.twiddle2re, x7p12); let t_a3_8 = _mm_mul_ps(self.twiddle5re, x8p11); let t_a3_9 = _mm_mul_ps(self.twiddle8re, x9p10); let t_a4_1 = _mm_mul_ps(self.twiddle4re, x1p18); let t_a4_2 = _mm_mul_ps(self.twiddle8re, x2p17); let t_a4_3 = _mm_mul_ps(self.twiddle7re, x3p16); let t_a4_4 = _mm_mul_ps(self.twiddle3re, x4p15); let t_a4_5 = _mm_mul_ps(self.twiddle1re, x5p14); let t_a4_6 = _mm_mul_ps(self.twiddle5re, x6p13); let t_a4_7 = _mm_mul_ps(self.twiddle9re, x7p12); let t_a4_8 = _mm_mul_ps(self.twiddle6re, x8p11); let t_a4_9 = _mm_mul_ps(self.twiddle2re, x9p10); let t_a5_1 = _mm_mul_ps(self.twiddle5re, x1p18); let t_a5_2 = _mm_mul_ps(self.twiddle9re, x2p17); let t_a5_3 = _mm_mul_ps(self.twiddle4re, x3p16); let t_a5_4 = _mm_mul_ps(self.twiddle1re, x4p15); let t_a5_5 = _mm_mul_ps(self.twiddle6re, x5p14); let t_a5_6 = _mm_mul_ps(self.twiddle8re, x6p13); let t_a5_7 = _mm_mul_ps(self.twiddle3re, x7p12); let t_a5_8 = _mm_mul_ps(self.twiddle2re, x8p11); let t_a5_9 = _mm_mul_ps(self.twiddle7re, x9p10); let t_a6_1 = _mm_mul_ps(self.twiddle6re, x1p18); let t_a6_2 = _mm_mul_ps(self.twiddle7re, x2p17); let t_a6_3 = _mm_mul_ps(self.twiddle1re, x3p16); let t_a6_4 = _mm_mul_ps(self.twiddle5re, x4p15); let t_a6_5 = _mm_mul_ps(self.twiddle8re, x5p14); let t_a6_6 = _mm_mul_ps(self.twiddle2re, x6p13); let t_a6_7 = _mm_mul_ps(self.twiddle4re, x7p12); let t_a6_8 = _mm_mul_ps(self.twiddle9re, x8p11); let t_a6_9 = _mm_mul_ps(self.twiddle3re, x9p10); let t_a7_1 = _mm_mul_ps(self.twiddle7re, x1p18); let t_a7_2 = _mm_mul_ps(self.twiddle5re, x2p17); let t_a7_3 = _mm_mul_ps(self.twiddle2re, x3p16); let t_a7_4 = _mm_mul_ps(self.twiddle9re, x4p15); let t_a7_5 = _mm_mul_ps(self.twiddle3re, x5p14); let t_a7_6 = _mm_mul_ps(self.twiddle4re, x6p13); let t_a7_7 = _mm_mul_ps(self.twiddle8re, x7p12); let t_a7_8 = _mm_mul_ps(self.twiddle1re, x8p11); let t_a7_9 = _mm_mul_ps(self.twiddle6re, x9p10); let t_a8_1 = _mm_mul_ps(self.twiddle8re, x1p18); let t_a8_2 = _mm_mul_ps(self.twiddle3re, x2p17); let t_a8_3 = _mm_mul_ps(self.twiddle5re, x3p16); let t_a8_4 = _mm_mul_ps(self.twiddle6re, x4p15); let t_a8_5 = _mm_mul_ps(self.twiddle2re, x5p14); let t_a8_6 = _mm_mul_ps(self.twiddle9re, x6p13); let t_a8_7 = _mm_mul_ps(self.twiddle1re, x7p12); let t_a8_8 = _mm_mul_ps(self.twiddle7re, x8p11); let t_a8_9 = _mm_mul_ps(self.twiddle4re, x9p10); let t_a9_1 = _mm_mul_ps(self.twiddle9re, x1p18); let t_a9_2 = _mm_mul_ps(self.twiddle1re, x2p17); let t_a9_3 = _mm_mul_ps(self.twiddle8re, x3p16); let t_a9_4 = _mm_mul_ps(self.twiddle2re, x4p15); let t_a9_5 = _mm_mul_ps(self.twiddle7re, x5p14); let t_a9_6 = _mm_mul_ps(self.twiddle3re, x6p13); let t_a9_7 = _mm_mul_ps(self.twiddle6re, x7p12); let t_a9_8 = _mm_mul_ps(self.twiddle4re, x8p11); let t_a9_9 = _mm_mul_ps(self.twiddle5re, x9p10); let t_b1_1 = _mm_mul_ps(self.twiddle1im, x1m18); let t_b1_2 = _mm_mul_ps(self.twiddle2im, x2m17); let t_b1_3 = _mm_mul_ps(self.twiddle3im, x3m16); let t_b1_4 = _mm_mul_ps(self.twiddle4im, x4m15); let t_b1_5 = _mm_mul_ps(self.twiddle5im, x5m14); let t_b1_6 = _mm_mul_ps(self.twiddle6im, x6m13); let t_b1_7 = _mm_mul_ps(self.twiddle7im, x7m12); let t_b1_8 = _mm_mul_ps(self.twiddle8im, x8m11); let t_b1_9 = _mm_mul_ps(self.twiddle9im, x9m10); let t_b2_1 = _mm_mul_ps(self.twiddle2im, x1m18); let t_b2_2 = _mm_mul_ps(self.twiddle4im, x2m17); let t_b2_3 = _mm_mul_ps(self.twiddle6im, x3m16); let t_b2_4 = _mm_mul_ps(self.twiddle8im, x4m15); let t_b2_5 = _mm_mul_ps(self.twiddle9im, x5m14); let t_b2_6 = _mm_mul_ps(self.twiddle7im, x6m13); let t_b2_7 = _mm_mul_ps(self.twiddle5im, x7m12); let t_b2_8 = _mm_mul_ps(self.twiddle3im, x8m11); let t_b2_9 = _mm_mul_ps(self.twiddle1im, x9m10); let t_b3_1 = _mm_mul_ps(self.twiddle3im, x1m18); let t_b3_2 = _mm_mul_ps(self.twiddle6im, x2m17); let t_b3_3 = _mm_mul_ps(self.twiddle9im, x3m16); let t_b3_4 = _mm_mul_ps(self.twiddle7im, x4m15); let t_b3_5 = _mm_mul_ps(self.twiddle4im, x5m14); let t_b3_6 = _mm_mul_ps(self.twiddle1im, x6m13); let t_b3_7 = _mm_mul_ps(self.twiddle2im, x7m12); let t_b3_8 = _mm_mul_ps(self.twiddle5im, x8m11); let t_b3_9 = _mm_mul_ps(self.twiddle8im, x9m10); let t_b4_1 = _mm_mul_ps(self.twiddle4im, x1m18); let t_b4_2 = _mm_mul_ps(self.twiddle8im, x2m17); let t_b4_3 = _mm_mul_ps(self.twiddle7im, x3m16); let t_b4_4 = _mm_mul_ps(self.twiddle3im, x4m15); let t_b4_5 = _mm_mul_ps(self.twiddle1im, x5m14); let t_b4_6 = _mm_mul_ps(self.twiddle5im, x6m13); let t_b4_7 = _mm_mul_ps(self.twiddle9im, x7m12); let t_b4_8 = _mm_mul_ps(self.twiddle6im, x8m11); let t_b4_9 = _mm_mul_ps(self.twiddle2im, x9m10); let t_b5_1 = _mm_mul_ps(self.twiddle5im, x1m18); let t_b5_2 = _mm_mul_ps(self.twiddle9im, x2m17); let t_b5_3 = _mm_mul_ps(self.twiddle4im, x3m16); let t_b5_4 = _mm_mul_ps(self.twiddle1im, x4m15); let t_b5_5 = _mm_mul_ps(self.twiddle6im, x5m14); let t_b5_6 = _mm_mul_ps(self.twiddle8im, x6m13); let t_b5_7 = _mm_mul_ps(self.twiddle3im, x7m12); let t_b5_8 = _mm_mul_ps(self.twiddle2im, x8m11); let t_b5_9 = _mm_mul_ps(self.twiddle7im, x9m10); let t_b6_1 = _mm_mul_ps(self.twiddle6im, x1m18); let t_b6_2 = _mm_mul_ps(self.twiddle7im, x2m17); let t_b6_3 = _mm_mul_ps(self.twiddle1im, x3m16); let t_b6_4 = _mm_mul_ps(self.twiddle5im, x4m15); let t_b6_5 = _mm_mul_ps(self.twiddle8im, x5m14); let t_b6_6 = _mm_mul_ps(self.twiddle2im, x6m13); let t_b6_7 = _mm_mul_ps(self.twiddle4im, x7m12); let t_b6_8 = _mm_mul_ps(self.twiddle9im, x8m11); let t_b6_9 = _mm_mul_ps(self.twiddle3im, x9m10); let t_b7_1 = _mm_mul_ps(self.twiddle7im, x1m18); let t_b7_2 = _mm_mul_ps(self.twiddle5im, x2m17); let t_b7_3 = _mm_mul_ps(self.twiddle2im, x3m16); let t_b7_4 = _mm_mul_ps(self.twiddle9im, x4m15); let t_b7_5 = _mm_mul_ps(self.twiddle3im, x5m14); let t_b7_6 = _mm_mul_ps(self.twiddle4im, x6m13); let t_b7_7 = _mm_mul_ps(self.twiddle8im, x7m12); let t_b7_8 = _mm_mul_ps(self.twiddle1im, x8m11); let t_b7_9 = _mm_mul_ps(self.twiddle6im, x9m10); let t_b8_1 = _mm_mul_ps(self.twiddle8im, x1m18); let t_b8_2 = _mm_mul_ps(self.twiddle3im, x2m17); let t_b8_3 = _mm_mul_ps(self.twiddle5im, x3m16); let t_b8_4 = _mm_mul_ps(self.twiddle6im, x4m15); let t_b8_5 = _mm_mul_ps(self.twiddle2im, x5m14); let t_b8_6 = _mm_mul_ps(self.twiddle9im, x6m13); let t_b8_7 = _mm_mul_ps(self.twiddle1im, x7m12); let t_b8_8 = _mm_mul_ps(self.twiddle7im, x8m11); let t_b8_9 = _mm_mul_ps(self.twiddle4im, x9m10); let t_b9_1 = _mm_mul_ps(self.twiddle9im, x1m18); let t_b9_2 = _mm_mul_ps(self.twiddle1im, x2m17); let t_b9_3 = _mm_mul_ps(self.twiddle8im, x3m16); let t_b9_4 = _mm_mul_ps(self.twiddle2im, x4m15); let t_b9_5 = _mm_mul_ps(self.twiddle7im, x5m14); let t_b9_6 = _mm_mul_ps(self.twiddle3im, x6m13); let t_b9_7 = _mm_mul_ps(self.twiddle6im, x7m12); let t_b9_8 = _mm_mul_ps(self.twiddle4im, x8m11); let t_b9_9 = _mm_mul_ps(self.twiddle5im, x9m10); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9); let t_a6 = calc_f32!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9); let t_a7 = calc_f32!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9); let t_a8 = calc_f32!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9); let t_a9 = calc_f32!(x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 - t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8 - t_b2_9); let t_b3 = calc_f32!(t_b3_1 + t_b3_2 + t_b3_3 - t_b3_4 - t_b3_5 - t_b3_6 + t_b3_7 + t_b3_8 + t_b3_9); let t_b4 = calc_f32!(t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 + t_b4_5 + t_b4_6 + t_b4_7 - t_b4_8 - t_b4_9); let t_b5 = calc_f32!(t_b5_1 - t_b5_2 - t_b5_3 + t_b5_4 + t_b5_5 - t_b5_6 - t_b5_7 + t_b5_8 + t_b5_9); let t_b6 = calc_f32!(t_b6_1 - t_b6_2 - t_b6_3 + t_b6_4 - t_b6_5 - t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9); let t_b7 = calc_f32!(t_b7_1 - t_b7_2 + t_b7_3 + t_b7_4 - t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9); let t_b8 = calc_f32!(t_b8_1 - t_b8_2 + t_b8_3 - t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 - t_b8_9); let t_b9 = calc_f32!(t_b9_1 - t_b9_2 + t_b9_3 - t_b9_4 + t_b9_5 - t_b9_6 + t_b9_7 - t_b9_8 + t_b9_9); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let t_b7_rot = self.rotate.rotate_both(t_b7); let t_b8_rot = self.rotate.rotate_both(t_b8); let t_b9_rot = self.rotate.rotate_both(t_b9); let y0 = calc_f32!(x0 + x1p18 + x2p17 + x3p16 + x4p15 + x5p14 + x6p13 + x7p12 + x8p11 + x9p10); let [y1, y18] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y17] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y16] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y15] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y14] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y13] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); let [y7, y12] = parallel_fft2_interleaved_f32(t_a7, t_b7_rot); let [y8, y11] = parallel_fft2_interleaved_f32(t_a8, t_b8_rot); let [y9, y10] = parallel_fft2_interleaved_f32(t_a9, t_b9_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18] } } // _ ___ __ _ _ _ _ _ // / |/ _ \ / /_ | || | | |__ (_) |_ // | | (_) | _____ | '_ \| || |_| '_ \| | __| // | |\__, | |_____| | (_) |__ _| |_) | | |_ // |_| /_/ \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly19 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: __m128d, twiddle1im: __m128d, twiddle2re: __m128d, twiddle2im: __m128d, twiddle3re: __m128d, twiddle3im: __m128d, twiddle4re: __m128d, twiddle4im: __m128d, twiddle5re: __m128d, twiddle5im: __m128d, twiddle6re: __m128d, twiddle6im: __m128d, twiddle7re: __m128d, twiddle7im: __m128d, twiddle8re: __m128d, twiddle8im: __m128d, twiddle9re: __m128d, twiddle9im: __m128d, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly19, 19, |this: &SseF64Butterfly19<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly19, 19, |this: &SseF64Butterfly19<_>| this .direction); impl SseF64Butterfly19 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 19, direction); let tw2: Complex = twiddles::compute_twiddle(2, 19, direction); let tw3: Complex = twiddles::compute_twiddle(3, 19, direction); let tw4: Complex = twiddles::compute_twiddle(4, 19, direction); let tw5: Complex = twiddles::compute_twiddle(5, 19, direction); let tw6: Complex = twiddles::compute_twiddle(6, 19, direction); let tw7: Complex = twiddles::compute_twiddle(7, 19, direction); let tw8: Complex = twiddles::compute_twiddle(8, 19, direction); let tw9: Complex = twiddles::compute_twiddle(9, 19, direction); let twiddle1re = unsafe { _mm_set_pd(tw1.re, tw1.re) }; let twiddle1im = unsafe { _mm_set_pd(tw1.im, tw1.im) }; let twiddle2re = unsafe { _mm_set_pd(tw2.re, tw2.re) }; let twiddle2im = unsafe { _mm_set_pd(tw2.im, tw2.im) }; let twiddle3re = unsafe { _mm_set_pd(tw3.re, tw3.re) }; let twiddle3im = unsafe { _mm_set_pd(tw3.im, tw3.im) }; let twiddle4re = unsafe { _mm_set_pd(tw4.re, tw4.re) }; let twiddle4im = unsafe { _mm_set_pd(tw4.im, tw4.im) }; let twiddle5re = unsafe { _mm_set_pd(tw5.re, tw5.re) }; let twiddle5im = unsafe { _mm_set_pd(tw5.im, tw5.im) }; let twiddle6re = unsafe { _mm_set_pd(tw6.re, tw6.re) }; let twiddle6im = unsafe { _mm_set_pd(tw6.im, tw6.im) }; let twiddle7re = unsafe { _mm_set_pd(tw7.re, tw7.re) }; let twiddle7im = unsafe { _mm_set_pd(tw7.im, tw7.im) }; let twiddle8re = unsafe { _mm_set_pd(tw8.re, tw8.re) }; let twiddle8im = unsafe { _mm_set_pd(tw8.im, tw8.im) }; let twiddle9re = unsafe { _mm_set_pd(tw9.re, tw9.re) }; let twiddle9im = unsafe { _mm_set_pd(tw9.im, tw9.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [__m128d; 19]) -> [__m128d; 19] { let [x1p18, x1m18] = solo_fft2_f64(values[1], values[18]); let [x2p17, x2m17] = solo_fft2_f64(values[2], values[17]); let [x3p16, x3m16] = solo_fft2_f64(values[3], values[16]); let [x4p15, x4m15] = solo_fft2_f64(values[4], values[15]); let [x5p14, x5m14] = solo_fft2_f64(values[5], values[14]); let [x6p13, x6m13] = solo_fft2_f64(values[6], values[13]); let [x7p12, x7m12] = solo_fft2_f64(values[7], values[12]); let [x8p11, x8m11] = solo_fft2_f64(values[8], values[11]); let [x9p10, x9m10] = solo_fft2_f64(values[9], values[10]); let t_a1_1 = _mm_mul_pd(self.twiddle1re, x1p18); let t_a1_2 = _mm_mul_pd(self.twiddle2re, x2p17); let t_a1_3 = _mm_mul_pd(self.twiddle3re, x3p16); let t_a1_4 = _mm_mul_pd(self.twiddle4re, x4p15); let t_a1_5 = _mm_mul_pd(self.twiddle5re, x5p14); let t_a1_6 = _mm_mul_pd(self.twiddle6re, x6p13); let t_a1_7 = _mm_mul_pd(self.twiddle7re, x7p12); let t_a1_8 = _mm_mul_pd(self.twiddle8re, x8p11); let t_a1_9 = _mm_mul_pd(self.twiddle9re, x9p10); let t_a2_1 = _mm_mul_pd(self.twiddle2re, x1p18); let t_a2_2 = _mm_mul_pd(self.twiddle4re, x2p17); let t_a2_3 = _mm_mul_pd(self.twiddle6re, x3p16); let t_a2_4 = _mm_mul_pd(self.twiddle8re, x4p15); let t_a2_5 = _mm_mul_pd(self.twiddle9re, x5p14); let t_a2_6 = _mm_mul_pd(self.twiddle7re, x6p13); let t_a2_7 = _mm_mul_pd(self.twiddle5re, x7p12); let t_a2_8 = _mm_mul_pd(self.twiddle3re, x8p11); let t_a2_9 = _mm_mul_pd(self.twiddle1re, x9p10); let t_a3_1 = _mm_mul_pd(self.twiddle3re, x1p18); let t_a3_2 = _mm_mul_pd(self.twiddle6re, x2p17); let t_a3_3 = _mm_mul_pd(self.twiddle9re, x3p16); let t_a3_4 = _mm_mul_pd(self.twiddle7re, x4p15); let t_a3_5 = _mm_mul_pd(self.twiddle4re, x5p14); let t_a3_6 = _mm_mul_pd(self.twiddle1re, x6p13); let t_a3_7 = _mm_mul_pd(self.twiddle2re, x7p12); let t_a3_8 = _mm_mul_pd(self.twiddle5re, x8p11); let t_a3_9 = _mm_mul_pd(self.twiddle8re, x9p10); let t_a4_1 = _mm_mul_pd(self.twiddle4re, x1p18); let t_a4_2 = _mm_mul_pd(self.twiddle8re, x2p17); let t_a4_3 = _mm_mul_pd(self.twiddle7re, x3p16); let t_a4_4 = _mm_mul_pd(self.twiddle3re, x4p15); let t_a4_5 = _mm_mul_pd(self.twiddle1re, x5p14); let t_a4_6 = _mm_mul_pd(self.twiddle5re, x6p13); let t_a4_7 = _mm_mul_pd(self.twiddle9re, x7p12); let t_a4_8 = _mm_mul_pd(self.twiddle6re, x8p11); let t_a4_9 = _mm_mul_pd(self.twiddle2re, x9p10); let t_a5_1 = _mm_mul_pd(self.twiddle5re, x1p18); let t_a5_2 = _mm_mul_pd(self.twiddle9re, x2p17); let t_a5_3 = _mm_mul_pd(self.twiddle4re, x3p16); let t_a5_4 = _mm_mul_pd(self.twiddle1re, x4p15); let t_a5_5 = _mm_mul_pd(self.twiddle6re, x5p14); let t_a5_6 = _mm_mul_pd(self.twiddle8re, x6p13); let t_a5_7 = _mm_mul_pd(self.twiddle3re, x7p12); let t_a5_8 = _mm_mul_pd(self.twiddle2re, x8p11); let t_a5_9 = _mm_mul_pd(self.twiddle7re, x9p10); let t_a6_1 = _mm_mul_pd(self.twiddle6re, x1p18); let t_a6_2 = _mm_mul_pd(self.twiddle7re, x2p17); let t_a6_3 = _mm_mul_pd(self.twiddle1re, x3p16); let t_a6_4 = _mm_mul_pd(self.twiddle5re, x4p15); let t_a6_5 = _mm_mul_pd(self.twiddle8re, x5p14); let t_a6_6 = _mm_mul_pd(self.twiddle2re, x6p13); let t_a6_7 = _mm_mul_pd(self.twiddle4re, x7p12); let t_a6_8 = _mm_mul_pd(self.twiddle9re, x8p11); let t_a6_9 = _mm_mul_pd(self.twiddle3re, x9p10); let t_a7_1 = _mm_mul_pd(self.twiddle7re, x1p18); let t_a7_2 = _mm_mul_pd(self.twiddle5re, x2p17); let t_a7_3 = _mm_mul_pd(self.twiddle2re, x3p16); let t_a7_4 = _mm_mul_pd(self.twiddle9re, x4p15); let t_a7_5 = _mm_mul_pd(self.twiddle3re, x5p14); let t_a7_6 = _mm_mul_pd(self.twiddle4re, x6p13); let t_a7_7 = _mm_mul_pd(self.twiddle8re, x7p12); let t_a7_8 = _mm_mul_pd(self.twiddle1re, x8p11); let t_a7_9 = _mm_mul_pd(self.twiddle6re, x9p10); let t_a8_1 = _mm_mul_pd(self.twiddle8re, x1p18); let t_a8_2 = _mm_mul_pd(self.twiddle3re, x2p17); let t_a8_3 = _mm_mul_pd(self.twiddle5re, x3p16); let t_a8_4 = _mm_mul_pd(self.twiddle6re, x4p15); let t_a8_5 = _mm_mul_pd(self.twiddle2re, x5p14); let t_a8_6 = _mm_mul_pd(self.twiddle9re, x6p13); let t_a8_7 = _mm_mul_pd(self.twiddle1re, x7p12); let t_a8_8 = _mm_mul_pd(self.twiddle7re, x8p11); let t_a8_9 = _mm_mul_pd(self.twiddle4re, x9p10); let t_a9_1 = _mm_mul_pd(self.twiddle9re, x1p18); let t_a9_2 = _mm_mul_pd(self.twiddle1re, x2p17); let t_a9_3 = _mm_mul_pd(self.twiddle8re, x3p16); let t_a9_4 = _mm_mul_pd(self.twiddle2re, x4p15); let t_a9_5 = _mm_mul_pd(self.twiddle7re, x5p14); let t_a9_6 = _mm_mul_pd(self.twiddle3re, x6p13); let t_a9_7 = _mm_mul_pd(self.twiddle6re, x7p12); let t_a9_8 = _mm_mul_pd(self.twiddle4re, x8p11); let t_a9_9 = _mm_mul_pd(self.twiddle5re, x9p10); let t_b1_1 = _mm_mul_pd(self.twiddle1im, x1m18); let t_b1_2 = _mm_mul_pd(self.twiddle2im, x2m17); let t_b1_3 = _mm_mul_pd(self.twiddle3im, x3m16); let t_b1_4 = _mm_mul_pd(self.twiddle4im, x4m15); let t_b1_5 = _mm_mul_pd(self.twiddle5im, x5m14); let t_b1_6 = _mm_mul_pd(self.twiddle6im, x6m13); let t_b1_7 = _mm_mul_pd(self.twiddle7im, x7m12); let t_b1_8 = _mm_mul_pd(self.twiddle8im, x8m11); let t_b1_9 = _mm_mul_pd(self.twiddle9im, x9m10); let t_b2_1 = _mm_mul_pd(self.twiddle2im, x1m18); let t_b2_2 = _mm_mul_pd(self.twiddle4im, x2m17); let t_b2_3 = _mm_mul_pd(self.twiddle6im, x3m16); let t_b2_4 = _mm_mul_pd(self.twiddle8im, x4m15); let t_b2_5 = _mm_mul_pd(self.twiddle9im, x5m14); let t_b2_6 = _mm_mul_pd(self.twiddle7im, x6m13); let t_b2_7 = _mm_mul_pd(self.twiddle5im, x7m12); let t_b2_8 = _mm_mul_pd(self.twiddle3im, x8m11); let t_b2_9 = _mm_mul_pd(self.twiddle1im, x9m10); let t_b3_1 = _mm_mul_pd(self.twiddle3im, x1m18); let t_b3_2 = _mm_mul_pd(self.twiddle6im, x2m17); let t_b3_3 = _mm_mul_pd(self.twiddle9im, x3m16); let t_b3_4 = _mm_mul_pd(self.twiddle7im, x4m15); let t_b3_5 = _mm_mul_pd(self.twiddle4im, x5m14); let t_b3_6 = _mm_mul_pd(self.twiddle1im, x6m13); let t_b3_7 = _mm_mul_pd(self.twiddle2im, x7m12); let t_b3_8 = _mm_mul_pd(self.twiddle5im, x8m11); let t_b3_9 = _mm_mul_pd(self.twiddle8im, x9m10); let t_b4_1 = _mm_mul_pd(self.twiddle4im, x1m18); let t_b4_2 = _mm_mul_pd(self.twiddle8im, x2m17); let t_b4_3 = _mm_mul_pd(self.twiddle7im, x3m16); let t_b4_4 = _mm_mul_pd(self.twiddle3im, x4m15); let t_b4_5 = _mm_mul_pd(self.twiddle1im, x5m14); let t_b4_6 = _mm_mul_pd(self.twiddle5im, x6m13); let t_b4_7 = _mm_mul_pd(self.twiddle9im, x7m12); let t_b4_8 = _mm_mul_pd(self.twiddle6im, x8m11); let t_b4_9 = _mm_mul_pd(self.twiddle2im, x9m10); let t_b5_1 = _mm_mul_pd(self.twiddle5im, x1m18); let t_b5_2 = _mm_mul_pd(self.twiddle9im, x2m17); let t_b5_3 = _mm_mul_pd(self.twiddle4im, x3m16); let t_b5_4 = _mm_mul_pd(self.twiddle1im, x4m15); let t_b5_5 = _mm_mul_pd(self.twiddle6im, x5m14); let t_b5_6 = _mm_mul_pd(self.twiddle8im, x6m13); let t_b5_7 = _mm_mul_pd(self.twiddle3im, x7m12); let t_b5_8 = _mm_mul_pd(self.twiddle2im, x8m11); let t_b5_9 = _mm_mul_pd(self.twiddle7im, x9m10); let t_b6_1 = _mm_mul_pd(self.twiddle6im, x1m18); let t_b6_2 = _mm_mul_pd(self.twiddle7im, x2m17); let t_b6_3 = _mm_mul_pd(self.twiddle1im, x3m16); let t_b6_4 = _mm_mul_pd(self.twiddle5im, x4m15); let t_b6_5 = _mm_mul_pd(self.twiddle8im, x5m14); let t_b6_6 = _mm_mul_pd(self.twiddle2im, x6m13); let t_b6_7 = _mm_mul_pd(self.twiddle4im, x7m12); let t_b6_8 = _mm_mul_pd(self.twiddle9im, x8m11); let t_b6_9 = _mm_mul_pd(self.twiddle3im, x9m10); let t_b7_1 = _mm_mul_pd(self.twiddle7im, x1m18); let t_b7_2 = _mm_mul_pd(self.twiddle5im, x2m17); let t_b7_3 = _mm_mul_pd(self.twiddle2im, x3m16); let t_b7_4 = _mm_mul_pd(self.twiddle9im, x4m15); let t_b7_5 = _mm_mul_pd(self.twiddle3im, x5m14); let t_b7_6 = _mm_mul_pd(self.twiddle4im, x6m13); let t_b7_7 = _mm_mul_pd(self.twiddle8im, x7m12); let t_b7_8 = _mm_mul_pd(self.twiddle1im, x8m11); let t_b7_9 = _mm_mul_pd(self.twiddle6im, x9m10); let t_b8_1 = _mm_mul_pd(self.twiddle8im, x1m18); let t_b8_2 = _mm_mul_pd(self.twiddle3im, x2m17); let t_b8_3 = _mm_mul_pd(self.twiddle5im, x3m16); let t_b8_4 = _mm_mul_pd(self.twiddle6im, x4m15); let t_b8_5 = _mm_mul_pd(self.twiddle2im, x5m14); let t_b8_6 = _mm_mul_pd(self.twiddle9im, x6m13); let t_b8_7 = _mm_mul_pd(self.twiddle1im, x7m12); let t_b8_8 = _mm_mul_pd(self.twiddle7im, x8m11); let t_b8_9 = _mm_mul_pd(self.twiddle4im, x9m10); let t_b9_1 = _mm_mul_pd(self.twiddle9im, x1m18); let t_b9_2 = _mm_mul_pd(self.twiddle1im, x2m17); let t_b9_3 = _mm_mul_pd(self.twiddle8im, x3m16); let t_b9_4 = _mm_mul_pd(self.twiddle2im, x4m15); let t_b9_5 = _mm_mul_pd(self.twiddle7im, x5m14); let t_b9_6 = _mm_mul_pd(self.twiddle3im, x6m13); let t_b9_7 = _mm_mul_pd(self.twiddle6im, x7m12); let t_b9_8 = _mm_mul_pd(self.twiddle4im, x8m11); let t_b9_9 = _mm_mul_pd(self.twiddle5im, x9m10); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9); let t_a6 = calc_f64!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9); let t_a7 = calc_f64!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9); let t_a8 = calc_f64!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9); let t_a9 = calc_f64!(x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 - t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8 - t_b2_9); let t_b3 = calc_f64!(t_b3_1 + t_b3_2 + t_b3_3 - t_b3_4 - t_b3_5 - t_b3_6 + t_b3_7 + t_b3_8 + t_b3_9); let t_b4 = calc_f64!(t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 + t_b4_5 + t_b4_6 + t_b4_7 - t_b4_8 - t_b4_9); let t_b5 = calc_f64!(t_b5_1 - t_b5_2 - t_b5_3 + t_b5_4 + t_b5_5 - t_b5_6 - t_b5_7 + t_b5_8 + t_b5_9); let t_b6 = calc_f64!(t_b6_1 - t_b6_2 - t_b6_3 + t_b6_4 - t_b6_5 - t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9); let t_b7 = calc_f64!(t_b7_1 - t_b7_2 + t_b7_3 + t_b7_4 - t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9); let t_b8 = calc_f64!(t_b8_1 - t_b8_2 + t_b8_3 - t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 - t_b8_9); let t_b9 = calc_f64!(t_b9_1 - t_b9_2 + t_b9_3 - t_b9_4 + t_b9_5 - t_b9_6 + t_b9_7 - t_b9_8 + t_b9_9); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let t_b7_rot = self.rotate.rotate(t_b7); let t_b8_rot = self.rotate.rotate(t_b8); let t_b9_rot = self.rotate.rotate(t_b9); let y0 = calc_f64!(x0 + x1p18 + x2p17 + x3p16 + x4p15 + x5p14 + x6p13 + x7p12 + x8p11 + x9p10); let [y1, y18] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y17] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y16] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y15] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y14] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y13] = solo_fft2_f64(t_a6, t_b6_rot); let [y7, y12] = solo_fft2_f64(t_a7, t_b7_rot); let [y8, y11] = solo_fft2_f64(t_a8, t_b8_rot); let [y9, y10] = solo_fft2_f64(t_a9, t_b9_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18] } } // ____ _____ _________ _ _ _ // |___ \|___ / |___ /___ \| |__ (_) |_ // __) | |_ \ _____ |_ \ __) | '_ \| | __| // / __/ ___) | |_____| ___) / __/| |_) | | |_ // |_____|____/ |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly23 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: __m128, twiddle1im: __m128, twiddle2re: __m128, twiddle2im: __m128, twiddle3re: __m128, twiddle3im: __m128, twiddle4re: __m128, twiddle4im: __m128, twiddle5re: __m128, twiddle5im: __m128, twiddle6re: __m128, twiddle6im: __m128, twiddle7re: __m128, twiddle7im: __m128, twiddle8re: __m128, twiddle8im: __m128, twiddle9re: __m128, twiddle9im: __m128, twiddle10re: __m128, twiddle10im: __m128, twiddle11re: __m128, twiddle11im: __m128, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly23, 23, |this: &SseF32Butterfly23<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly23, 23, |this: &SseF32Butterfly23<_>| this .direction); impl SseF32Butterfly23 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 23, direction); let tw2: Complex = twiddles::compute_twiddle(2, 23, direction); let tw3: Complex = twiddles::compute_twiddle(3, 23, direction); let tw4: Complex = twiddles::compute_twiddle(4, 23, direction); let tw5: Complex = twiddles::compute_twiddle(5, 23, direction); let tw6: Complex = twiddles::compute_twiddle(6, 23, direction); let tw7: Complex = twiddles::compute_twiddle(7, 23, direction); let tw8: Complex = twiddles::compute_twiddle(8, 23, direction); let tw9: Complex = twiddles::compute_twiddle(9, 23, direction); let tw10: Complex = twiddles::compute_twiddle(10, 23, direction); let tw11: Complex = twiddles::compute_twiddle(11, 23, direction); let twiddle1re = unsafe { _mm_load1_ps(&tw1.re) }; let twiddle1im = unsafe { _mm_load1_ps(&tw1.im) }; let twiddle2re = unsafe { _mm_load1_ps(&tw2.re) }; let twiddle2im = unsafe { _mm_load1_ps(&tw2.im) }; let twiddle3re = unsafe { _mm_load1_ps(&tw3.re) }; let twiddle3im = unsafe { _mm_load1_ps(&tw3.im) }; let twiddle4re = unsafe { _mm_load1_ps(&tw4.re) }; let twiddle4im = unsafe { _mm_load1_ps(&tw4.im) }; let twiddle5re = unsafe { _mm_load1_ps(&tw5.re) }; let twiddle5im = unsafe { _mm_load1_ps(&tw5.im) }; let twiddle6re = unsafe { _mm_load1_ps(&tw6.re) }; let twiddle6im = unsafe { _mm_load1_ps(&tw6.im) }; let twiddle7re = unsafe { _mm_load1_ps(&tw7.re) }; let twiddle7im = unsafe { _mm_load1_ps(&tw7.im) }; let twiddle8re = unsafe { _mm_load1_ps(&tw8.re) }; let twiddle8im = unsafe { _mm_load1_ps(&tw8.im) }; let twiddle9re = unsafe { _mm_load1_ps(&tw9.re) }; let twiddle9im = unsafe { _mm_load1_ps(&tw9.im) }; let twiddle10re = unsafe { _mm_load1_ps(&tw10.re) }; let twiddle10im = unsafe { _mm_load1_ps(&tw10.im) }; let twiddle11re = unsafe { _mm_load1_ps(&tw11.re) }; let twiddle11im = unsafe { _mm_load1_ps(&tw11.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[11]), extract_hi_lo_f32(input_packed[0], input_packed[12]), extract_lo_hi_f32(input_packed[1], input_packed[12]), extract_hi_lo_f32(input_packed[1], input_packed[13]), extract_lo_hi_f32(input_packed[2], input_packed[13]), extract_hi_lo_f32(input_packed[2], input_packed[14]), extract_lo_hi_f32(input_packed[3], input_packed[14]), extract_hi_lo_f32(input_packed[3], input_packed[15]), extract_lo_hi_f32(input_packed[4], input_packed[15]), extract_hi_lo_f32(input_packed[4], input_packed[16]), extract_lo_hi_f32(input_packed[5], input_packed[16]), extract_hi_lo_f32(input_packed[5], input_packed[17]), extract_lo_hi_f32(input_packed[6], input_packed[17]), extract_hi_lo_f32(input_packed[6], input_packed[18]), extract_lo_hi_f32(input_packed[7], input_packed[18]), extract_hi_lo_f32(input_packed[7], input_packed[19]), extract_lo_hi_f32(input_packed[8], input_packed[19]), extract_hi_lo_f32(input_packed[8], input_packed[20]), extract_lo_hi_f32(input_packed[9], input_packed[20]), extract_hi_lo_f32(input_packed[9], input_packed[21]), extract_lo_hi_f32(input_packed[10], input_packed[21]), extract_hi_lo_f32(input_packed[10], input_packed[22]), extract_lo_hi_f32(input_packed[11], input_packed[22]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_lo_f32(out[14], out[15]), extract_lo_lo_f32(out[16], out[17]), extract_lo_lo_f32(out[18], out[19]), extract_lo_lo_f32(out[20], out[21]), extract_lo_hi_f32(out[22], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), extract_hi_hi_f32(out[15], out[16]), extract_hi_hi_f32(out[17], out[18]), extract_hi_hi_f32(out[19], out[20]), extract_hi_hi_f32(out[21], out[22]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [__m128; 23]) -> [__m128; 23] { let [x1p22, x1m22] = parallel_fft2_interleaved_f32(values[1], values[22]); let [x2p21, x2m21] = parallel_fft2_interleaved_f32(values[2], values[21]); let [x3p20, x3m20] = parallel_fft2_interleaved_f32(values[3], values[20]); let [x4p19, x4m19] = parallel_fft2_interleaved_f32(values[4], values[19]); let [x5p18, x5m18] = parallel_fft2_interleaved_f32(values[5], values[18]); let [x6p17, x6m17] = parallel_fft2_interleaved_f32(values[6], values[17]); let [x7p16, x7m16] = parallel_fft2_interleaved_f32(values[7], values[16]); let [x8p15, x8m15] = parallel_fft2_interleaved_f32(values[8], values[15]); let [x9p14, x9m14] = parallel_fft2_interleaved_f32(values[9], values[14]); let [x10p13, x10m13] = parallel_fft2_interleaved_f32(values[10], values[13]); let [x11p12, x11m12] = parallel_fft2_interleaved_f32(values[11], values[12]); let t_a1_1 = _mm_mul_ps(self.twiddle1re, x1p22); let t_a1_2 = _mm_mul_ps(self.twiddle2re, x2p21); let t_a1_3 = _mm_mul_ps(self.twiddle3re, x3p20); let t_a1_4 = _mm_mul_ps(self.twiddle4re, x4p19); let t_a1_5 = _mm_mul_ps(self.twiddle5re, x5p18); let t_a1_6 = _mm_mul_ps(self.twiddle6re, x6p17); let t_a1_7 = _mm_mul_ps(self.twiddle7re, x7p16); let t_a1_8 = _mm_mul_ps(self.twiddle8re, x8p15); let t_a1_9 = _mm_mul_ps(self.twiddle9re, x9p14); let t_a1_10 = _mm_mul_ps(self.twiddle10re, x10p13); let t_a1_11 = _mm_mul_ps(self.twiddle11re, x11p12); let t_a2_1 = _mm_mul_ps(self.twiddle2re, x1p22); let t_a2_2 = _mm_mul_ps(self.twiddle4re, x2p21); let t_a2_3 = _mm_mul_ps(self.twiddle6re, x3p20); let t_a2_4 = _mm_mul_ps(self.twiddle8re, x4p19); let t_a2_5 = _mm_mul_ps(self.twiddle10re, x5p18); let t_a2_6 = _mm_mul_ps(self.twiddle11re, x6p17); let t_a2_7 = _mm_mul_ps(self.twiddle9re, x7p16); let t_a2_8 = _mm_mul_ps(self.twiddle7re, x8p15); let t_a2_9 = _mm_mul_ps(self.twiddle5re, x9p14); let t_a2_10 = _mm_mul_ps(self.twiddle3re, x10p13); let t_a2_11 = _mm_mul_ps(self.twiddle1re, x11p12); let t_a3_1 = _mm_mul_ps(self.twiddle3re, x1p22); let t_a3_2 = _mm_mul_ps(self.twiddle6re, x2p21); let t_a3_3 = _mm_mul_ps(self.twiddle9re, x3p20); let t_a3_4 = _mm_mul_ps(self.twiddle11re, x4p19); let t_a3_5 = _mm_mul_ps(self.twiddle8re, x5p18); let t_a3_6 = _mm_mul_ps(self.twiddle5re, x6p17); let t_a3_7 = _mm_mul_ps(self.twiddle2re, x7p16); let t_a3_8 = _mm_mul_ps(self.twiddle1re, x8p15); let t_a3_9 = _mm_mul_ps(self.twiddle4re, x9p14); let t_a3_10 = _mm_mul_ps(self.twiddle7re, x10p13); let t_a3_11 = _mm_mul_ps(self.twiddle10re, x11p12); let t_a4_1 = _mm_mul_ps(self.twiddle4re, x1p22); let t_a4_2 = _mm_mul_ps(self.twiddle8re, x2p21); let t_a4_3 = _mm_mul_ps(self.twiddle11re, x3p20); let t_a4_4 = _mm_mul_ps(self.twiddle7re, x4p19); let t_a4_5 = _mm_mul_ps(self.twiddle3re, x5p18); let t_a4_6 = _mm_mul_ps(self.twiddle1re, x6p17); let t_a4_7 = _mm_mul_ps(self.twiddle5re, x7p16); let t_a4_8 = _mm_mul_ps(self.twiddle9re, x8p15); let t_a4_9 = _mm_mul_ps(self.twiddle10re, x9p14); let t_a4_10 = _mm_mul_ps(self.twiddle6re, x10p13); let t_a4_11 = _mm_mul_ps(self.twiddle2re, x11p12); let t_a5_1 = _mm_mul_ps(self.twiddle5re, x1p22); let t_a5_2 = _mm_mul_ps(self.twiddle10re, x2p21); let t_a5_3 = _mm_mul_ps(self.twiddle8re, x3p20); let t_a5_4 = _mm_mul_ps(self.twiddle3re, x4p19); let t_a5_5 = _mm_mul_ps(self.twiddle2re, x5p18); let t_a5_6 = _mm_mul_ps(self.twiddle7re, x6p17); let t_a5_7 = _mm_mul_ps(self.twiddle11re, x7p16); let t_a5_8 = _mm_mul_ps(self.twiddle6re, x8p15); let t_a5_9 = _mm_mul_ps(self.twiddle1re, x9p14); let t_a5_10 = _mm_mul_ps(self.twiddle4re, x10p13); let t_a5_11 = _mm_mul_ps(self.twiddle9re, x11p12); let t_a6_1 = _mm_mul_ps(self.twiddle6re, x1p22); let t_a6_2 = _mm_mul_ps(self.twiddle11re, x2p21); let t_a6_3 = _mm_mul_ps(self.twiddle5re, x3p20); let t_a6_4 = _mm_mul_ps(self.twiddle1re, x4p19); let t_a6_5 = _mm_mul_ps(self.twiddle7re, x5p18); let t_a6_6 = _mm_mul_ps(self.twiddle10re, x6p17); let t_a6_7 = _mm_mul_ps(self.twiddle4re, x7p16); let t_a6_8 = _mm_mul_ps(self.twiddle2re, x8p15); let t_a6_9 = _mm_mul_ps(self.twiddle8re, x9p14); let t_a6_10 = _mm_mul_ps(self.twiddle9re, x10p13); let t_a6_11 = _mm_mul_ps(self.twiddle3re, x11p12); let t_a7_1 = _mm_mul_ps(self.twiddle7re, x1p22); let t_a7_2 = _mm_mul_ps(self.twiddle9re, x2p21); let t_a7_3 = _mm_mul_ps(self.twiddle2re, x3p20); let t_a7_4 = _mm_mul_ps(self.twiddle5re, x4p19); let t_a7_5 = _mm_mul_ps(self.twiddle11re, x5p18); let t_a7_6 = _mm_mul_ps(self.twiddle4re, x6p17); let t_a7_7 = _mm_mul_ps(self.twiddle3re, x7p16); let t_a7_8 = _mm_mul_ps(self.twiddle10re, x8p15); let t_a7_9 = _mm_mul_ps(self.twiddle6re, x9p14); let t_a7_10 = _mm_mul_ps(self.twiddle1re, x10p13); let t_a7_11 = _mm_mul_ps(self.twiddle8re, x11p12); let t_a8_1 = _mm_mul_ps(self.twiddle8re, x1p22); let t_a8_2 = _mm_mul_ps(self.twiddle7re, x2p21); let t_a8_3 = _mm_mul_ps(self.twiddle1re, x3p20); let t_a8_4 = _mm_mul_ps(self.twiddle9re, x4p19); let t_a8_5 = _mm_mul_ps(self.twiddle6re, x5p18); let t_a8_6 = _mm_mul_ps(self.twiddle2re, x6p17); let t_a8_7 = _mm_mul_ps(self.twiddle10re, x7p16); let t_a8_8 = _mm_mul_ps(self.twiddle5re, x8p15); let t_a8_9 = _mm_mul_ps(self.twiddle3re, x9p14); let t_a8_10 = _mm_mul_ps(self.twiddle11re, x10p13); let t_a8_11 = _mm_mul_ps(self.twiddle4re, x11p12); let t_a9_1 = _mm_mul_ps(self.twiddle9re, x1p22); let t_a9_2 = _mm_mul_ps(self.twiddle5re, x2p21); let t_a9_3 = _mm_mul_ps(self.twiddle4re, x3p20); let t_a9_4 = _mm_mul_ps(self.twiddle10re, x4p19); let t_a9_5 = _mm_mul_ps(self.twiddle1re, x5p18); let t_a9_6 = _mm_mul_ps(self.twiddle8re, x6p17); let t_a9_7 = _mm_mul_ps(self.twiddle6re, x7p16); let t_a9_8 = _mm_mul_ps(self.twiddle3re, x8p15); let t_a9_9 = _mm_mul_ps(self.twiddle11re, x9p14); let t_a9_10 = _mm_mul_ps(self.twiddle2re, x10p13); let t_a9_11 = _mm_mul_ps(self.twiddle7re, x11p12); let t_a10_1 = _mm_mul_ps(self.twiddle10re, x1p22); let t_a10_2 = _mm_mul_ps(self.twiddle3re, x2p21); let t_a10_3 = _mm_mul_ps(self.twiddle7re, x3p20); let t_a10_4 = _mm_mul_ps(self.twiddle6re, x4p19); let t_a10_5 = _mm_mul_ps(self.twiddle4re, x5p18); let t_a10_6 = _mm_mul_ps(self.twiddle9re, x6p17); let t_a10_7 = _mm_mul_ps(self.twiddle1re, x7p16); let t_a10_8 = _mm_mul_ps(self.twiddle11re, x8p15); let t_a10_9 = _mm_mul_ps(self.twiddle2re, x9p14); let t_a10_10 = _mm_mul_ps(self.twiddle8re, x10p13); let t_a10_11 = _mm_mul_ps(self.twiddle5re, x11p12); let t_a11_1 = _mm_mul_ps(self.twiddle11re, x1p22); let t_a11_2 = _mm_mul_ps(self.twiddle1re, x2p21); let t_a11_3 = _mm_mul_ps(self.twiddle10re, x3p20); let t_a11_4 = _mm_mul_ps(self.twiddle2re, x4p19); let t_a11_5 = _mm_mul_ps(self.twiddle9re, x5p18); let t_a11_6 = _mm_mul_ps(self.twiddle3re, x6p17); let t_a11_7 = _mm_mul_ps(self.twiddle8re, x7p16); let t_a11_8 = _mm_mul_ps(self.twiddle4re, x8p15); let t_a11_9 = _mm_mul_ps(self.twiddle7re, x9p14); let t_a11_10 = _mm_mul_ps(self.twiddle5re, x10p13); let t_a11_11 = _mm_mul_ps(self.twiddle6re, x11p12); let t_b1_1 = _mm_mul_ps(self.twiddle1im, x1m22); let t_b1_2 = _mm_mul_ps(self.twiddle2im, x2m21); let t_b1_3 = _mm_mul_ps(self.twiddle3im, x3m20); let t_b1_4 = _mm_mul_ps(self.twiddle4im, x4m19); let t_b1_5 = _mm_mul_ps(self.twiddle5im, x5m18); let t_b1_6 = _mm_mul_ps(self.twiddle6im, x6m17); let t_b1_7 = _mm_mul_ps(self.twiddle7im, x7m16); let t_b1_8 = _mm_mul_ps(self.twiddle8im, x8m15); let t_b1_9 = _mm_mul_ps(self.twiddle9im, x9m14); let t_b1_10 = _mm_mul_ps(self.twiddle10im, x10m13); let t_b1_11 = _mm_mul_ps(self.twiddle11im, x11m12); let t_b2_1 = _mm_mul_ps(self.twiddle2im, x1m22); let t_b2_2 = _mm_mul_ps(self.twiddle4im, x2m21); let t_b2_3 = _mm_mul_ps(self.twiddle6im, x3m20); let t_b2_4 = _mm_mul_ps(self.twiddle8im, x4m19); let t_b2_5 = _mm_mul_ps(self.twiddle10im, x5m18); let t_b2_6 = _mm_mul_ps(self.twiddle11im, x6m17); let t_b2_7 = _mm_mul_ps(self.twiddle9im, x7m16); let t_b2_8 = _mm_mul_ps(self.twiddle7im, x8m15); let t_b2_9 = _mm_mul_ps(self.twiddle5im, x9m14); let t_b2_10 = _mm_mul_ps(self.twiddle3im, x10m13); let t_b2_11 = _mm_mul_ps(self.twiddle1im, x11m12); let t_b3_1 = _mm_mul_ps(self.twiddle3im, x1m22); let t_b3_2 = _mm_mul_ps(self.twiddle6im, x2m21); let t_b3_3 = _mm_mul_ps(self.twiddle9im, x3m20); let t_b3_4 = _mm_mul_ps(self.twiddle11im, x4m19); let t_b3_5 = _mm_mul_ps(self.twiddle8im, x5m18); let t_b3_6 = _mm_mul_ps(self.twiddle5im, x6m17); let t_b3_7 = _mm_mul_ps(self.twiddle2im, x7m16); let t_b3_8 = _mm_mul_ps(self.twiddle1im, x8m15); let t_b3_9 = _mm_mul_ps(self.twiddle4im, x9m14); let t_b3_10 = _mm_mul_ps(self.twiddle7im, x10m13); let t_b3_11 = _mm_mul_ps(self.twiddle10im, x11m12); let t_b4_1 = _mm_mul_ps(self.twiddle4im, x1m22); let t_b4_2 = _mm_mul_ps(self.twiddle8im, x2m21); let t_b4_3 = _mm_mul_ps(self.twiddle11im, x3m20); let t_b4_4 = _mm_mul_ps(self.twiddle7im, x4m19); let t_b4_5 = _mm_mul_ps(self.twiddle3im, x5m18); let t_b4_6 = _mm_mul_ps(self.twiddle1im, x6m17); let t_b4_7 = _mm_mul_ps(self.twiddle5im, x7m16); let t_b4_8 = _mm_mul_ps(self.twiddle9im, x8m15); let t_b4_9 = _mm_mul_ps(self.twiddle10im, x9m14); let t_b4_10 = _mm_mul_ps(self.twiddle6im, x10m13); let t_b4_11 = _mm_mul_ps(self.twiddle2im, x11m12); let t_b5_1 = _mm_mul_ps(self.twiddle5im, x1m22); let t_b5_2 = _mm_mul_ps(self.twiddle10im, x2m21); let t_b5_3 = _mm_mul_ps(self.twiddle8im, x3m20); let t_b5_4 = _mm_mul_ps(self.twiddle3im, x4m19); let t_b5_5 = _mm_mul_ps(self.twiddle2im, x5m18); let t_b5_6 = _mm_mul_ps(self.twiddle7im, x6m17); let t_b5_7 = _mm_mul_ps(self.twiddle11im, x7m16); let t_b5_8 = _mm_mul_ps(self.twiddle6im, x8m15); let t_b5_9 = _mm_mul_ps(self.twiddle1im, x9m14); let t_b5_10 = _mm_mul_ps(self.twiddle4im, x10m13); let t_b5_11 = _mm_mul_ps(self.twiddle9im, x11m12); let t_b6_1 = _mm_mul_ps(self.twiddle6im, x1m22); let t_b6_2 = _mm_mul_ps(self.twiddle11im, x2m21); let t_b6_3 = _mm_mul_ps(self.twiddle5im, x3m20); let t_b6_4 = _mm_mul_ps(self.twiddle1im, x4m19); let t_b6_5 = _mm_mul_ps(self.twiddle7im, x5m18); let t_b6_6 = _mm_mul_ps(self.twiddle10im, x6m17); let t_b6_7 = _mm_mul_ps(self.twiddle4im, x7m16); let t_b6_8 = _mm_mul_ps(self.twiddle2im, x8m15); let t_b6_9 = _mm_mul_ps(self.twiddle8im, x9m14); let t_b6_10 = _mm_mul_ps(self.twiddle9im, x10m13); let t_b6_11 = _mm_mul_ps(self.twiddle3im, x11m12); let t_b7_1 = _mm_mul_ps(self.twiddle7im, x1m22); let t_b7_2 = _mm_mul_ps(self.twiddle9im, x2m21); let t_b7_3 = _mm_mul_ps(self.twiddle2im, x3m20); let t_b7_4 = _mm_mul_ps(self.twiddle5im, x4m19); let t_b7_5 = _mm_mul_ps(self.twiddle11im, x5m18); let t_b7_6 = _mm_mul_ps(self.twiddle4im, x6m17); let t_b7_7 = _mm_mul_ps(self.twiddle3im, x7m16); let t_b7_8 = _mm_mul_ps(self.twiddle10im, x8m15); let t_b7_9 = _mm_mul_ps(self.twiddle6im, x9m14); let t_b7_10 = _mm_mul_ps(self.twiddle1im, x10m13); let t_b7_11 = _mm_mul_ps(self.twiddle8im, x11m12); let t_b8_1 = _mm_mul_ps(self.twiddle8im, x1m22); let t_b8_2 = _mm_mul_ps(self.twiddle7im, x2m21); let t_b8_3 = _mm_mul_ps(self.twiddle1im, x3m20); let t_b8_4 = _mm_mul_ps(self.twiddle9im, x4m19); let t_b8_5 = _mm_mul_ps(self.twiddle6im, x5m18); let t_b8_6 = _mm_mul_ps(self.twiddle2im, x6m17); let t_b8_7 = _mm_mul_ps(self.twiddle10im, x7m16); let t_b8_8 = _mm_mul_ps(self.twiddle5im, x8m15); let t_b8_9 = _mm_mul_ps(self.twiddle3im, x9m14); let t_b8_10 = _mm_mul_ps(self.twiddle11im, x10m13); let t_b8_11 = _mm_mul_ps(self.twiddle4im, x11m12); let t_b9_1 = _mm_mul_ps(self.twiddle9im, x1m22); let t_b9_2 = _mm_mul_ps(self.twiddle5im, x2m21); let t_b9_3 = _mm_mul_ps(self.twiddle4im, x3m20); let t_b9_4 = _mm_mul_ps(self.twiddle10im, x4m19); let t_b9_5 = _mm_mul_ps(self.twiddle1im, x5m18); let t_b9_6 = _mm_mul_ps(self.twiddle8im, x6m17); let t_b9_7 = _mm_mul_ps(self.twiddle6im, x7m16); let t_b9_8 = _mm_mul_ps(self.twiddle3im, x8m15); let t_b9_9 = _mm_mul_ps(self.twiddle11im, x9m14); let t_b9_10 = _mm_mul_ps(self.twiddle2im, x10m13); let t_b9_11 = _mm_mul_ps(self.twiddle7im, x11m12); let t_b10_1 = _mm_mul_ps(self.twiddle10im, x1m22); let t_b10_2 = _mm_mul_ps(self.twiddle3im, x2m21); let t_b10_3 = _mm_mul_ps(self.twiddle7im, x3m20); let t_b10_4 = _mm_mul_ps(self.twiddle6im, x4m19); let t_b10_5 = _mm_mul_ps(self.twiddle4im, x5m18); let t_b10_6 = _mm_mul_ps(self.twiddle9im, x6m17); let t_b10_7 = _mm_mul_ps(self.twiddle1im, x7m16); let t_b10_8 = _mm_mul_ps(self.twiddle11im, x8m15); let t_b10_9 = _mm_mul_ps(self.twiddle2im, x9m14); let t_b10_10 = _mm_mul_ps(self.twiddle8im, x10m13); let t_b10_11 = _mm_mul_ps(self.twiddle5im, x11m12); let t_b11_1 = _mm_mul_ps(self.twiddle11im, x1m22); let t_b11_2 = _mm_mul_ps(self.twiddle1im, x2m21); let t_b11_3 = _mm_mul_ps(self.twiddle10im, x3m20); let t_b11_4 = _mm_mul_ps(self.twiddle2im, x4m19); let t_b11_5 = _mm_mul_ps(self.twiddle9im, x5m18); let t_b11_6 = _mm_mul_ps(self.twiddle3im, x6m17); let t_b11_7 = _mm_mul_ps(self.twiddle8im, x7m16); let t_b11_8 = _mm_mul_ps(self.twiddle4im, x8m15); let t_b11_9 = _mm_mul_ps(self.twiddle7im, x9m14); let t_b11_10 = _mm_mul_ps(self.twiddle5im, x10m13); let t_b11_11 = _mm_mul_ps(self.twiddle6im, x11m12); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11); let t_a6 = calc_f32!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11); let t_a7 = calc_f32!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11); let t_a8 = calc_f32!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11); let t_a9 = calc_f32!(x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11); let t_a10 = calc_f32!(x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11); let t_a11 = calc_f32!(x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11); let t_b3 = calc_f32!(t_b3_1 + t_b3_2 + t_b3_3 - t_b3_4 - t_b3_5 - t_b3_6 - t_b3_7 + t_b3_8 + t_b3_9 + t_b3_10 + t_b3_11); let t_b4 = calc_f32!(t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 - t_b4_5 + t_b4_6 + t_b4_7 + t_b4_8 - t_b4_9 - t_b4_10 - t_b4_11); let t_b5 = calc_f32!(t_b5_1 + t_b5_2 - t_b5_3 - t_b5_4 + t_b5_5 + t_b5_6 - t_b5_7 - t_b5_8 - t_b5_9 + t_b5_10 + t_b5_11); let t_b6 = calc_f32!(t_b6_1 - t_b6_2 - t_b6_3 + t_b6_4 + t_b6_5 - t_b6_6 - t_b6_7 + t_b6_8 + t_b6_9 - t_b6_10 - t_b6_11); let t_b7 = calc_f32!(t_b7_1 - t_b7_2 - t_b7_3 + t_b7_4 - t_b7_5 - t_b7_6 + t_b7_7 + t_b7_8 - t_b7_9 + t_b7_10 + t_b7_11); let t_b8 = calc_f32!(t_b8_1 - t_b8_2 + t_b8_3 + t_b8_4 - t_b8_5 + t_b8_6 + t_b8_7 - t_b8_8 + t_b8_9 + t_b8_10 - t_b8_11); let t_b9 = calc_f32!(t_b9_1 - t_b9_2 + t_b9_3 - t_b9_4 - t_b9_5 + t_b9_6 - t_b9_7 + t_b9_8 - t_b9_9 - t_b9_10 + t_b9_11); let t_b10 = calc_f32!(t_b10_1 - t_b10_2 + t_b10_3 - t_b10_4 + t_b10_5 - t_b10_6 + t_b10_7 + t_b10_8 - t_b10_9 + t_b10_10 - t_b10_11); let t_b11 = calc_f32!(t_b11_1 - t_b11_2 + t_b11_3 - t_b11_4 + t_b11_5 - t_b11_6 + t_b11_7 - t_b11_8 + t_b11_9 - t_b11_10 + t_b11_11); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let t_b7_rot = self.rotate.rotate_both(t_b7); let t_b8_rot = self.rotate.rotate_both(t_b8); let t_b9_rot = self.rotate.rotate_both(t_b9); let t_b10_rot = self.rotate.rotate_both(t_b10); let t_b11_rot = self.rotate.rotate_both(t_b11); let y0 = calc_f32!(x0 + x1p22 + x2p21 + x3p20 + x4p19 + x5p18 + x6p17 + x7p16 + x8p15 + x9p14 + x10p13 + x11p12); let [y1, y22] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y21] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y20] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y19] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y18] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y17] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); let [y7, y16] = parallel_fft2_interleaved_f32(t_a7, t_b7_rot); let [y8, y15] = parallel_fft2_interleaved_f32(t_a8, t_b8_rot); let [y9, y14] = parallel_fft2_interleaved_f32(t_a9, t_b9_rot); let [y10, y13] = parallel_fft2_interleaved_f32(t_a10, t_b10_rot); let [y11, y12] = parallel_fft2_interleaved_f32(t_a11, t_b11_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22] } } // ____ _____ __ _ _ _ _ _ // |___ \|___ / / /_ | || | | |__ (_) |_ // __) | |_ \ _____ | '_ \| || |_| '_ \| | __| // / __/ ___) | |_____| | (_) |__ _| |_) | | |_ // |_____|____/ \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly23 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: __m128d, twiddle1im: __m128d, twiddle2re: __m128d, twiddle2im: __m128d, twiddle3re: __m128d, twiddle3im: __m128d, twiddle4re: __m128d, twiddle4im: __m128d, twiddle5re: __m128d, twiddle5im: __m128d, twiddle6re: __m128d, twiddle6im: __m128d, twiddle7re: __m128d, twiddle7im: __m128d, twiddle8re: __m128d, twiddle8im: __m128d, twiddle9re: __m128d, twiddle9im: __m128d, twiddle10re: __m128d, twiddle10im: __m128d, twiddle11re: __m128d, twiddle11im: __m128d, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly23, 23, |this: &SseF64Butterfly23<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly23, 23, |this: &SseF64Butterfly23<_>| this .direction); impl SseF64Butterfly23 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 23, direction); let tw2: Complex = twiddles::compute_twiddle(2, 23, direction); let tw3: Complex = twiddles::compute_twiddle(3, 23, direction); let tw4: Complex = twiddles::compute_twiddle(4, 23, direction); let tw5: Complex = twiddles::compute_twiddle(5, 23, direction); let tw6: Complex = twiddles::compute_twiddle(6, 23, direction); let tw7: Complex = twiddles::compute_twiddle(7, 23, direction); let tw8: Complex = twiddles::compute_twiddle(8, 23, direction); let tw9: Complex = twiddles::compute_twiddle(9, 23, direction); let tw10: Complex = twiddles::compute_twiddle(10, 23, direction); let tw11: Complex = twiddles::compute_twiddle(11, 23, direction); let twiddle1re = unsafe { _mm_set_pd(tw1.re, tw1.re) }; let twiddle1im = unsafe { _mm_set_pd(tw1.im, tw1.im) }; let twiddle2re = unsafe { _mm_set_pd(tw2.re, tw2.re) }; let twiddle2im = unsafe { _mm_set_pd(tw2.im, tw2.im) }; let twiddle3re = unsafe { _mm_set_pd(tw3.re, tw3.re) }; let twiddle3im = unsafe { _mm_set_pd(tw3.im, tw3.im) }; let twiddle4re = unsafe { _mm_set_pd(tw4.re, tw4.re) }; let twiddle4im = unsafe { _mm_set_pd(tw4.im, tw4.im) }; let twiddle5re = unsafe { _mm_set_pd(tw5.re, tw5.re) }; let twiddle5im = unsafe { _mm_set_pd(tw5.im, tw5.im) }; let twiddle6re = unsafe { _mm_set_pd(tw6.re, tw6.re) }; let twiddle6im = unsafe { _mm_set_pd(tw6.im, tw6.im) }; let twiddle7re = unsafe { _mm_set_pd(tw7.re, tw7.re) }; let twiddle7im = unsafe { _mm_set_pd(tw7.im, tw7.im) }; let twiddle8re = unsafe { _mm_set_pd(tw8.re, tw8.re) }; let twiddle8im = unsafe { _mm_set_pd(tw8.im, tw8.im) }; let twiddle9re = unsafe { _mm_set_pd(tw9.re, tw9.re) }; let twiddle9im = unsafe { _mm_set_pd(tw9.im, tw9.im) }; let twiddle10re = unsafe { _mm_set_pd(tw10.re, tw10.re) }; let twiddle10im = unsafe { _mm_set_pd(tw10.im, tw10.im) }; let twiddle11re = unsafe { _mm_set_pd(tw11.re, tw11.re) }; let twiddle11im = unsafe { _mm_set_pd(tw11.im, tw11.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [__m128d; 23]) -> [__m128d; 23] { let [x1p22, x1m22] = solo_fft2_f64(values[1], values[22]); let [x2p21, x2m21] = solo_fft2_f64(values[2], values[21]); let [x3p20, x3m20] = solo_fft2_f64(values[3], values[20]); let [x4p19, x4m19] = solo_fft2_f64(values[4], values[19]); let [x5p18, x5m18] = solo_fft2_f64(values[5], values[18]); let [x6p17, x6m17] = solo_fft2_f64(values[6], values[17]); let [x7p16, x7m16] = solo_fft2_f64(values[7], values[16]); let [x8p15, x8m15] = solo_fft2_f64(values[8], values[15]); let [x9p14, x9m14] = solo_fft2_f64(values[9], values[14]); let [x10p13, x10m13] = solo_fft2_f64(values[10], values[13]); let [x11p12, x11m12] = solo_fft2_f64(values[11], values[12]); let t_a1_1 = _mm_mul_pd(self.twiddle1re, x1p22); let t_a1_2 = _mm_mul_pd(self.twiddle2re, x2p21); let t_a1_3 = _mm_mul_pd(self.twiddle3re, x3p20); let t_a1_4 = _mm_mul_pd(self.twiddle4re, x4p19); let t_a1_5 = _mm_mul_pd(self.twiddle5re, x5p18); let t_a1_6 = _mm_mul_pd(self.twiddle6re, x6p17); let t_a1_7 = _mm_mul_pd(self.twiddle7re, x7p16); let t_a1_8 = _mm_mul_pd(self.twiddle8re, x8p15); let t_a1_9 = _mm_mul_pd(self.twiddle9re, x9p14); let t_a1_10 = _mm_mul_pd(self.twiddle10re, x10p13); let t_a1_11 = _mm_mul_pd(self.twiddle11re, x11p12); let t_a2_1 = _mm_mul_pd(self.twiddle2re, x1p22); let t_a2_2 = _mm_mul_pd(self.twiddle4re, x2p21); let t_a2_3 = _mm_mul_pd(self.twiddle6re, x3p20); let t_a2_4 = _mm_mul_pd(self.twiddle8re, x4p19); let t_a2_5 = _mm_mul_pd(self.twiddle10re, x5p18); let t_a2_6 = _mm_mul_pd(self.twiddle11re, x6p17); let t_a2_7 = _mm_mul_pd(self.twiddle9re, x7p16); let t_a2_8 = _mm_mul_pd(self.twiddle7re, x8p15); let t_a2_9 = _mm_mul_pd(self.twiddle5re, x9p14); let t_a2_10 = _mm_mul_pd(self.twiddle3re, x10p13); let t_a2_11 = _mm_mul_pd(self.twiddle1re, x11p12); let t_a3_1 = _mm_mul_pd(self.twiddle3re, x1p22); let t_a3_2 = _mm_mul_pd(self.twiddle6re, x2p21); let t_a3_3 = _mm_mul_pd(self.twiddle9re, x3p20); let t_a3_4 = _mm_mul_pd(self.twiddle11re, x4p19); let t_a3_5 = _mm_mul_pd(self.twiddle8re, x5p18); let t_a3_6 = _mm_mul_pd(self.twiddle5re, x6p17); let t_a3_7 = _mm_mul_pd(self.twiddle2re, x7p16); let t_a3_8 = _mm_mul_pd(self.twiddle1re, x8p15); let t_a3_9 = _mm_mul_pd(self.twiddle4re, x9p14); let t_a3_10 = _mm_mul_pd(self.twiddle7re, x10p13); let t_a3_11 = _mm_mul_pd(self.twiddle10re, x11p12); let t_a4_1 = _mm_mul_pd(self.twiddle4re, x1p22); let t_a4_2 = _mm_mul_pd(self.twiddle8re, x2p21); let t_a4_3 = _mm_mul_pd(self.twiddle11re, x3p20); let t_a4_4 = _mm_mul_pd(self.twiddle7re, x4p19); let t_a4_5 = _mm_mul_pd(self.twiddle3re, x5p18); let t_a4_6 = _mm_mul_pd(self.twiddle1re, x6p17); let t_a4_7 = _mm_mul_pd(self.twiddle5re, x7p16); let t_a4_8 = _mm_mul_pd(self.twiddle9re, x8p15); let t_a4_9 = _mm_mul_pd(self.twiddle10re, x9p14); let t_a4_10 = _mm_mul_pd(self.twiddle6re, x10p13); let t_a4_11 = _mm_mul_pd(self.twiddle2re, x11p12); let t_a5_1 = _mm_mul_pd(self.twiddle5re, x1p22); let t_a5_2 = _mm_mul_pd(self.twiddle10re, x2p21); let t_a5_3 = _mm_mul_pd(self.twiddle8re, x3p20); let t_a5_4 = _mm_mul_pd(self.twiddle3re, x4p19); let t_a5_5 = _mm_mul_pd(self.twiddle2re, x5p18); let t_a5_6 = _mm_mul_pd(self.twiddle7re, x6p17); let t_a5_7 = _mm_mul_pd(self.twiddle11re, x7p16); let t_a5_8 = _mm_mul_pd(self.twiddle6re, x8p15); let t_a5_9 = _mm_mul_pd(self.twiddle1re, x9p14); let t_a5_10 = _mm_mul_pd(self.twiddle4re, x10p13); let t_a5_11 = _mm_mul_pd(self.twiddle9re, x11p12); let t_a6_1 = _mm_mul_pd(self.twiddle6re, x1p22); let t_a6_2 = _mm_mul_pd(self.twiddle11re, x2p21); let t_a6_3 = _mm_mul_pd(self.twiddle5re, x3p20); let t_a6_4 = _mm_mul_pd(self.twiddle1re, x4p19); let t_a6_5 = _mm_mul_pd(self.twiddle7re, x5p18); let t_a6_6 = _mm_mul_pd(self.twiddle10re, x6p17); let t_a6_7 = _mm_mul_pd(self.twiddle4re, x7p16); let t_a6_8 = _mm_mul_pd(self.twiddle2re, x8p15); let t_a6_9 = _mm_mul_pd(self.twiddle8re, x9p14); let t_a6_10 = _mm_mul_pd(self.twiddle9re, x10p13); let t_a6_11 = _mm_mul_pd(self.twiddle3re, x11p12); let t_a7_1 = _mm_mul_pd(self.twiddle7re, x1p22); let t_a7_2 = _mm_mul_pd(self.twiddle9re, x2p21); let t_a7_3 = _mm_mul_pd(self.twiddle2re, x3p20); let t_a7_4 = _mm_mul_pd(self.twiddle5re, x4p19); let t_a7_5 = _mm_mul_pd(self.twiddle11re, x5p18); let t_a7_6 = _mm_mul_pd(self.twiddle4re, x6p17); let t_a7_7 = _mm_mul_pd(self.twiddle3re, x7p16); let t_a7_8 = _mm_mul_pd(self.twiddle10re, x8p15); let t_a7_9 = _mm_mul_pd(self.twiddle6re, x9p14); let t_a7_10 = _mm_mul_pd(self.twiddle1re, x10p13); let t_a7_11 = _mm_mul_pd(self.twiddle8re, x11p12); let t_a8_1 = _mm_mul_pd(self.twiddle8re, x1p22); let t_a8_2 = _mm_mul_pd(self.twiddle7re, x2p21); let t_a8_3 = _mm_mul_pd(self.twiddle1re, x3p20); let t_a8_4 = _mm_mul_pd(self.twiddle9re, x4p19); let t_a8_5 = _mm_mul_pd(self.twiddle6re, x5p18); let t_a8_6 = _mm_mul_pd(self.twiddle2re, x6p17); let t_a8_7 = _mm_mul_pd(self.twiddle10re, x7p16); let t_a8_8 = _mm_mul_pd(self.twiddle5re, x8p15); let t_a8_9 = _mm_mul_pd(self.twiddle3re, x9p14); let t_a8_10 = _mm_mul_pd(self.twiddle11re, x10p13); let t_a8_11 = _mm_mul_pd(self.twiddle4re, x11p12); let t_a9_1 = _mm_mul_pd(self.twiddle9re, x1p22); let t_a9_2 = _mm_mul_pd(self.twiddle5re, x2p21); let t_a9_3 = _mm_mul_pd(self.twiddle4re, x3p20); let t_a9_4 = _mm_mul_pd(self.twiddle10re, x4p19); let t_a9_5 = _mm_mul_pd(self.twiddle1re, x5p18); let t_a9_6 = _mm_mul_pd(self.twiddle8re, x6p17); let t_a9_7 = _mm_mul_pd(self.twiddle6re, x7p16); let t_a9_8 = _mm_mul_pd(self.twiddle3re, x8p15); let t_a9_9 = _mm_mul_pd(self.twiddle11re, x9p14); let t_a9_10 = _mm_mul_pd(self.twiddle2re, x10p13); let t_a9_11 = _mm_mul_pd(self.twiddle7re, x11p12); let t_a10_1 = _mm_mul_pd(self.twiddle10re, x1p22); let t_a10_2 = _mm_mul_pd(self.twiddle3re, x2p21); let t_a10_3 = _mm_mul_pd(self.twiddle7re, x3p20); let t_a10_4 = _mm_mul_pd(self.twiddle6re, x4p19); let t_a10_5 = _mm_mul_pd(self.twiddle4re, x5p18); let t_a10_6 = _mm_mul_pd(self.twiddle9re, x6p17); let t_a10_7 = _mm_mul_pd(self.twiddle1re, x7p16); let t_a10_8 = _mm_mul_pd(self.twiddle11re, x8p15); let t_a10_9 = _mm_mul_pd(self.twiddle2re, x9p14); let t_a10_10 = _mm_mul_pd(self.twiddle8re, x10p13); let t_a10_11 = _mm_mul_pd(self.twiddle5re, x11p12); let t_a11_1 = _mm_mul_pd(self.twiddle11re, x1p22); let t_a11_2 = _mm_mul_pd(self.twiddle1re, x2p21); let t_a11_3 = _mm_mul_pd(self.twiddle10re, x3p20); let t_a11_4 = _mm_mul_pd(self.twiddle2re, x4p19); let t_a11_5 = _mm_mul_pd(self.twiddle9re, x5p18); let t_a11_6 = _mm_mul_pd(self.twiddle3re, x6p17); let t_a11_7 = _mm_mul_pd(self.twiddle8re, x7p16); let t_a11_8 = _mm_mul_pd(self.twiddle4re, x8p15); let t_a11_9 = _mm_mul_pd(self.twiddle7re, x9p14); let t_a11_10 = _mm_mul_pd(self.twiddle5re, x10p13); let t_a11_11 = _mm_mul_pd(self.twiddle6re, x11p12); let t_b1_1 = _mm_mul_pd(self.twiddle1im, x1m22); let t_b1_2 = _mm_mul_pd(self.twiddle2im, x2m21); let t_b1_3 = _mm_mul_pd(self.twiddle3im, x3m20); let t_b1_4 = _mm_mul_pd(self.twiddle4im, x4m19); let t_b1_5 = _mm_mul_pd(self.twiddle5im, x5m18); let t_b1_6 = _mm_mul_pd(self.twiddle6im, x6m17); let t_b1_7 = _mm_mul_pd(self.twiddle7im, x7m16); let t_b1_8 = _mm_mul_pd(self.twiddle8im, x8m15); let t_b1_9 = _mm_mul_pd(self.twiddle9im, x9m14); let t_b1_10 = _mm_mul_pd(self.twiddle10im, x10m13); let t_b1_11 = _mm_mul_pd(self.twiddle11im, x11m12); let t_b2_1 = _mm_mul_pd(self.twiddle2im, x1m22); let t_b2_2 = _mm_mul_pd(self.twiddle4im, x2m21); let t_b2_3 = _mm_mul_pd(self.twiddle6im, x3m20); let t_b2_4 = _mm_mul_pd(self.twiddle8im, x4m19); let t_b2_5 = _mm_mul_pd(self.twiddle10im, x5m18); let t_b2_6 = _mm_mul_pd(self.twiddle11im, x6m17); let t_b2_7 = _mm_mul_pd(self.twiddle9im, x7m16); let t_b2_8 = _mm_mul_pd(self.twiddle7im, x8m15); let t_b2_9 = _mm_mul_pd(self.twiddle5im, x9m14); let t_b2_10 = _mm_mul_pd(self.twiddle3im, x10m13); let t_b2_11 = _mm_mul_pd(self.twiddle1im, x11m12); let t_b3_1 = _mm_mul_pd(self.twiddle3im, x1m22); let t_b3_2 = _mm_mul_pd(self.twiddle6im, x2m21); let t_b3_3 = _mm_mul_pd(self.twiddle9im, x3m20); let t_b3_4 = _mm_mul_pd(self.twiddle11im, x4m19); let t_b3_5 = _mm_mul_pd(self.twiddle8im, x5m18); let t_b3_6 = _mm_mul_pd(self.twiddle5im, x6m17); let t_b3_7 = _mm_mul_pd(self.twiddle2im, x7m16); let t_b3_8 = _mm_mul_pd(self.twiddle1im, x8m15); let t_b3_9 = _mm_mul_pd(self.twiddle4im, x9m14); let t_b3_10 = _mm_mul_pd(self.twiddle7im, x10m13); let t_b3_11 = _mm_mul_pd(self.twiddle10im, x11m12); let t_b4_1 = _mm_mul_pd(self.twiddle4im, x1m22); let t_b4_2 = _mm_mul_pd(self.twiddle8im, x2m21); let t_b4_3 = _mm_mul_pd(self.twiddle11im, x3m20); let t_b4_4 = _mm_mul_pd(self.twiddle7im, x4m19); let t_b4_5 = _mm_mul_pd(self.twiddle3im, x5m18); let t_b4_6 = _mm_mul_pd(self.twiddle1im, x6m17); let t_b4_7 = _mm_mul_pd(self.twiddle5im, x7m16); let t_b4_8 = _mm_mul_pd(self.twiddle9im, x8m15); let t_b4_9 = _mm_mul_pd(self.twiddle10im, x9m14); let t_b4_10 = _mm_mul_pd(self.twiddle6im, x10m13); let t_b4_11 = _mm_mul_pd(self.twiddle2im, x11m12); let t_b5_1 = _mm_mul_pd(self.twiddle5im, x1m22); let t_b5_2 = _mm_mul_pd(self.twiddle10im, x2m21); let t_b5_3 = _mm_mul_pd(self.twiddle8im, x3m20); let t_b5_4 = _mm_mul_pd(self.twiddle3im, x4m19); let t_b5_5 = _mm_mul_pd(self.twiddle2im, x5m18); let t_b5_6 = _mm_mul_pd(self.twiddle7im, x6m17); let t_b5_7 = _mm_mul_pd(self.twiddle11im, x7m16); let t_b5_8 = _mm_mul_pd(self.twiddle6im, x8m15); let t_b5_9 = _mm_mul_pd(self.twiddle1im, x9m14); let t_b5_10 = _mm_mul_pd(self.twiddle4im, x10m13); let t_b5_11 = _mm_mul_pd(self.twiddle9im, x11m12); let t_b6_1 = _mm_mul_pd(self.twiddle6im, x1m22); let t_b6_2 = _mm_mul_pd(self.twiddle11im, x2m21); let t_b6_3 = _mm_mul_pd(self.twiddle5im, x3m20); let t_b6_4 = _mm_mul_pd(self.twiddle1im, x4m19); let t_b6_5 = _mm_mul_pd(self.twiddle7im, x5m18); let t_b6_6 = _mm_mul_pd(self.twiddle10im, x6m17); let t_b6_7 = _mm_mul_pd(self.twiddle4im, x7m16); let t_b6_8 = _mm_mul_pd(self.twiddle2im, x8m15); let t_b6_9 = _mm_mul_pd(self.twiddle8im, x9m14); let t_b6_10 = _mm_mul_pd(self.twiddle9im, x10m13); let t_b6_11 = _mm_mul_pd(self.twiddle3im, x11m12); let t_b7_1 = _mm_mul_pd(self.twiddle7im, x1m22); let t_b7_2 = _mm_mul_pd(self.twiddle9im, x2m21); let t_b7_3 = _mm_mul_pd(self.twiddle2im, x3m20); let t_b7_4 = _mm_mul_pd(self.twiddle5im, x4m19); let t_b7_5 = _mm_mul_pd(self.twiddle11im, x5m18); let t_b7_6 = _mm_mul_pd(self.twiddle4im, x6m17); let t_b7_7 = _mm_mul_pd(self.twiddle3im, x7m16); let t_b7_8 = _mm_mul_pd(self.twiddle10im, x8m15); let t_b7_9 = _mm_mul_pd(self.twiddle6im, x9m14); let t_b7_10 = _mm_mul_pd(self.twiddle1im, x10m13); let t_b7_11 = _mm_mul_pd(self.twiddle8im, x11m12); let t_b8_1 = _mm_mul_pd(self.twiddle8im, x1m22); let t_b8_2 = _mm_mul_pd(self.twiddle7im, x2m21); let t_b8_3 = _mm_mul_pd(self.twiddle1im, x3m20); let t_b8_4 = _mm_mul_pd(self.twiddle9im, x4m19); let t_b8_5 = _mm_mul_pd(self.twiddle6im, x5m18); let t_b8_6 = _mm_mul_pd(self.twiddle2im, x6m17); let t_b8_7 = _mm_mul_pd(self.twiddle10im, x7m16); let t_b8_8 = _mm_mul_pd(self.twiddle5im, x8m15); let t_b8_9 = _mm_mul_pd(self.twiddle3im, x9m14); let t_b8_10 = _mm_mul_pd(self.twiddle11im, x10m13); let t_b8_11 = _mm_mul_pd(self.twiddle4im, x11m12); let t_b9_1 = _mm_mul_pd(self.twiddle9im, x1m22); let t_b9_2 = _mm_mul_pd(self.twiddle5im, x2m21); let t_b9_3 = _mm_mul_pd(self.twiddle4im, x3m20); let t_b9_4 = _mm_mul_pd(self.twiddle10im, x4m19); let t_b9_5 = _mm_mul_pd(self.twiddle1im, x5m18); let t_b9_6 = _mm_mul_pd(self.twiddle8im, x6m17); let t_b9_7 = _mm_mul_pd(self.twiddle6im, x7m16); let t_b9_8 = _mm_mul_pd(self.twiddle3im, x8m15); let t_b9_9 = _mm_mul_pd(self.twiddle11im, x9m14); let t_b9_10 = _mm_mul_pd(self.twiddle2im, x10m13); let t_b9_11 = _mm_mul_pd(self.twiddle7im, x11m12); let t_b10_1 = _mm_mul_pd(self.twiddle10im, x1m22); let t_b10_2 = _mm_mul_pd(self.twiddle3im, x2m21); let t_b10_3 = _mm_mul_pd(self.twiddle7im, x3m20); let t_b10_4 = _mm_mul_pd(self.twiddle6im, x4m19); let t_b10_5 = _mm_mul_pd(self.twiddle4im, x5m18); let t_b10_6 = _mm_mul_pd(self.twiddle9im, x6m17); let t_b10_7 = _mm_mul_pd(self.twiddle1im, x7m16); let t_b10_8 = _mm_mul_pd(self.twiddle11im, x8m15); let t_b10_9 = _mm_mul_pd(self.twiddle2im, x9m14); let t_b10_10 = _mm_mul_pd(self.twiddle8im, x10m13); let t_b10_11 = _mm_mul_pd(self.twiddle5im, x11m12); let t_b11_1 = _mm_mul_pd(self.twiddle11im, x1m22); let t_b11_2 = _mm_mul_pd(self.twiddle1im, x2m21); let t_b11_3 = _mm_mul_pd(self.twiddle10im, x3m20); let t_b11_4 = _mm_mul_pd(self.twiddle2im, x4m19); let t_b11_5 = _mm_mul_pd(self.twiddle9im, x5m18); let t_b11_6 = _mm_mul_pd(self.twiddle3im, x6m17); let t_b11_7 = _mm_mul_pd(self.twiddle8im, x7m16); let t_b11_8 = _mm_mul_pd(self.twiddle4im, x8m15); let t_b11_9 = _mm_mul_pd(self.twiddle7im, x9m14); let t_b11_10 = _mm_mul_pd(self.twiddle5im, x10m13); let t_b11_11 = _mm_mul_pd(self.twiddle6im, x11m12); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11); let t_a6 = calc_f64!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11); let t_a7 = calc_f64!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11); let t_a8 = calc_f64!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11); let t_a9 = calc_f64!(x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11); let t_a10 = calc_f64!(x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11); let t_a11 = calc_f64!(x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11); let t_b3 = calc_f64!(t_b3_1 + t_b3_2 + t_b3_3 - t_b3_4 - t_b3_5 - t_b3_6 - t_b3_7 + t_b3_8 + t_b3_9 + t_b3_10 + t_b3_11); let t_b4 = calc_f64!(t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 - t_b4_5 + t_b4_6 + t_b4_7 + t_b4_8 - t_b4_9 - t_b4_10 - t_b4_11); let t_b5 = calc_f64!(t_b5_1 + t_b5_2 - t_b5_3 - t_b5_4 + t_b5_5 + t_b5_6 - t_b5_7 - t_b5_8 - t_b5_9 + t_b5_10 + t_b5_11); let t_b6 = calc_f64!(t_b6_1 - t_b6_2 - t_b6_3 + t_b6_4 + t_b6_5 - t_b6_6 - t_b6_7 + t_b6_8 + t_b6_9 - t_b6_10 - t_b6_11); let t_b7 = calc_f64!(t_b7_1 - t_b7_2 - t_b7_3 + t_b7_4 - t_b7_5 - t_b7_6 + t_b7_7 + t_b7_8 - t_b7_9 + t_b7_10 + t_b7_11); let t_b8 = calc_f64!(t_b8_1 - t_b8_2 + t_b8_3 + t_b8_4 - t_b8_5 + t_b8_6 + t_b8_7 - t_b8_8 + t_b8_9 + t_b8_10 - t_b8_11); let t_b9 = calc_f64!(t_b9_1 - t_b9_2 + t_b9_3 - t_b9_4 - t_b9_5 + t_b9_6 - t_b9_7 + t_b9_8 - t_b9_9 - t_b9_10 + t_b9_11); let t_b10 = calc_f64!(t_b10_1 - t_b10_2 + t_b10_3 - t_b10_4 + t_b10_5 - t_b10_6 + t_b10_7 + t_b10_8 - t_b10_9 + t_b10_10 - t_b10_11); let t_b11 = calc_f64!(t_b11_1 - t_b11_2 + t_b11_3 - t_b11_4 + t_b11_5 - t_b11_6 + t_b11_7 - t_b11_8 + t_b11_9 - t_b11_10 + t_b11_11); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let t_b7_rot = self.rotate.rotate(t_b7); let t_b8_rot = self.rotate.rotate(t_b8); let t_b9_rot = self.rotate.rotate(t_b9); let t_b10_rot = self.rotate.rotate(t_b10); let t_b11_rot = self.rotate.rotate(t_b11); let y0 = calc_f64!(x0 + x1p22 + x2p21 + x3p20 + x4p19 + x5p18 + x6p17 + x7p16 + x8p15 + x9p14 + x10p13 + x11p12); let [y1, y22] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y21] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y20] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y19] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y18] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y17] = solo_fft2_f64(t_a6, t_b6_rot); let [y7, y16] = solo_fft2_f64(t_a7, t_b7_rot); let [y8, y15] = solo_fft2_f64(t_a8, t_b8_rot); let [y9, y14] = solo_fft2_f64(t_a9, t_b9_rot); let [y10, y13] = solo_fft2_f64(t_a10, t_b10_rot); let [y11, y12] = solo_fft2_f64(t_a11, t_b11_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22] } } // ____ ___ _________ _ _ _ // |___ \ / _ \ |___ /___ \| |__ (_) |_ // __) | (_) | _____ |_ \ __) | '_ \| | __| // / __/ \__, | |_____| ___) / __/| |_) | | |_ // |_____| /_/ |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly29 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: __m128, twiddle1im: __m128, twiddle2re: __m128, twiddle2im: __m128, twiddle3re: __m128, twiddle3im: __m128, twiddle4re: __m128, twiddle4im: __m128, twiddle5re: __m128, twiddle5im: __m128, twiddle6re: __m128, twiddle6im: __m128, twiddle7re: __m128, twiddle7im: __m128, twiddle8re: __m128, twiddle8im: __m128, twiddle9re: __m128, twiddle9im: __m128, twiddle10re: __m128, twiddle10im: __m128, twiddle11re: __m128, twiddle11im: __m128, twiddle12re: __m128, twiddle12im: __m128, twiddle13re: __m128, twiddle13im: __m128, twiddle14re: __m128, twiddle14im: __m128, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly29, 29, |this: &SseF32Butterfly29<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly29, 29, |this: &SseF32Butterfly29<_>| this .direction); impl SseF32Butterfly29 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 29, direction); let tw2: Complex = twiddles::compute_twiddle(2, 29, direction); let tw3: Complex = twiddles::compute_twiddle(3, 29, direction); let tw4: Complex = twiddles::compute_twiddle(4, 29, direction); let tw5: Complex = twiddles::compute_twiddle(5, 29, direction); let tw6: Complex = twiddles::compute_twiddle(6, 29, direction); let tw7: Complex = twiddles::compute_twiddle(7, 29, direction); let tw8: Complex = twiddles::compute_twiddle(8, 29, direction); let tw9: Complex = twiddles::compute_twiddle(9, 29, direction); let tw10: Complex = twiddles::compute_twiddle(10, 29, direction); let tw11: Complex = twiddles::compute_twiddle(11, 29, direction); let tw12: Complex = twiddles::compute_twiddle(12, 29, direction); let tw13: Complex = twiddles::compute_twiddle(13, 29, direction); let tw14: Complex = twiddles::compute_twiddle(14, 29, direction); let twiddle1re = unsafe { _mm_load1_ps(&tw1.re) }; let twiddle1im = unsafe { _mm_load1_ps(&tw1.im) }; let twiddle2re = unsafe { _mm_load1_ps(&tw2.re) }; let twiddle2im = unsafe { _mm_load1_ps(&tw2.im) }; let twiddle3re = unsafe { _mm_load1_ps(&tw3.re) }; let twiddle3im = unsafe { _mm_load1_ps(&tw3.im) }; let twiddle4re = unsafe { _mm_load1_ps(&tw4.re) }; let twiddle4im = unsafe { _mm_load1_ps(&tw4.im) }; let twiddle5re = unsafe { _mm_load1_ps(&tw5.re) }; let twiddle5im = unsafe { _mm_load1_ps(&tw5.im) }; let twiddle6re = unsafe { _mm_load1_ps(&tw6.re) }; let twiddle6im = unsafe { _mm_load1_ps(&tw6.im) }; let twiddle7re = unsafe { _mm_load1_ps(&tw7.re) }; let twiddle7im = unsafe { _mm_load1_ps(&tw7.im) }; let twiddle8re = unsafe { _mm_load1_ps(&tw8.re) }; let twiddle8im = unsafe { _mm_load1_ps(&tw8.im) }; let twiddle9re = unsafe { _mm_load1_ps(&tw9.re) }; let twiddle9im = unsafe { _mm_load1_ps(&tw9.im) }; let twiddle10re = unsafe { _mm_load1_ps(&tw10.re) }; let twiddle10im = unsafe { _mm_load1_ps(&tw10.im) }; let twiddle11re = unsafe { _mm_load1_ps(&tw11.re) }; let twiddle11im = unsafe { _mm_load1_ps(&tw11.im) }; let twiddle12re = unsafe { _mm_load1_ps(&tw12.re) }; let twiddle12im = unsafe { _mm_load1_ps(&tw12.im) }; let twiddle13re = unsafe { _mm_load1_ps(&tw13.re) }; let twiddle13im = unsafe { _mm_load1_ps(&tw13.im) }; let twiddle14re = unsafe { _mm_load1_ps(&tw14.re) }; let twiddle14im = unsafe { _mm_load1_ps(&tw14.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, twiddle12re, twiddle12im, twiddle13re, twiddle13im, twiddle14re, twiddle14im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[14]), extract_hi_lo_f32(input_packed[0], input_packed[15]), extract_lo_hi_f32(input_packed[1], input_packed[15]), extract_hi_lo_f32(input_packed[1], input_packed[16]), extract_lo_hi_f32(input_packed[2], input_packed[16]), extract_hi_lo_f32(input_packed[2], input_packed[17]), extract_lo_hi_f32(input_packed[3], input_packed[17]), extract_hi_lo_f32(input_packed[3], input_packed[18]), extract_lo_hi_f32(input_packed[4], input_packed[18]), extract_hi_lo_f32(input_packed[4], input_packed[19]), extract_lo_hi_f32(input_packed[5], input_packed[19]), extract_hi_lo_f32(input_packed[5], input_packed[20]), extract_lo_hi_f32(input_packed[6], input_packed[20]), extract_hi_lo_f32(input_packed[6], input_packed[21]), extract_lo_hi_f32(input_packed[7], input_packed[21]), extract_hi_lo_f32(input_packed[7], input_packed[22]), extract_lo_hi_f32(input_packed[8], input_packed[22]), extract_hi_lo_f32(input_packed[8], input_packed[23]), extract_lo_hi_f32(input_packed[9], input_packed[23]), extract_hi_lo_f32(input_packed[9], input_packed[24]), extract_lo_hi_f32(input_packed[10], input_packed[24]), extract_hi_lo_f32(input_packed[10], input_packed[25]), extract_lo_hi_f32(input_packed[11], input_packed[25]), extract_hi_lo_f32(input_packed[11], input_packed[26]), extract_lo_hi_f32(input_packed[12], input_packed[26]), extract_hi_lo_f32(input_packed[12], input_packed[27]), extract_lo_hi_f32(input_packed[13], input_packed[27]), extract_hi_lo_f32(input_packed[13], input_packed[28]), extract_lo_hi_f32(input_packed[14], input_packed[28]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_lo_f32(out[14], out[15]), extract_lo_lo_f32(out[16], out[17]), extract_lo_lo_f32(out[18], out[19]), extract_lo_lo_f32(out[20], out[21]), extract_lo_lo_f32(out[22], out[23]), extract_lo_lo_f32(out[24], out[25]), extract_lo_lo_f32(out[26], out[27]), extract_lo_hi_f32(out[28], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), extract_hi_hi_f32(out[15], out[16]), extract_hi_hi_f32(out[17], out[18]), extract_hi_hi_f32(out[19], out[20]), extract_hi_hi_f32(out[21], out[22]), extract_hi_hi_f32(out[23], out[24]), extract_hi_hi_f32(out[25], out[26]), extract_hi_hi_f32(out[27], out[28]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [__m128; 29]) -> [__m128; 29] { let [x1p28, x1m28] = parallel_fft2_interleaved_f32(values[1], values[28]); let [x2p27, x2m27] = parallel_fft2_interleaved_f32(values[2], values[27]); let [x3p26, x3m26] = parallel_fft2_interleaved_f32(values[3], values[26]); let [x4p25, x4m25] = parallel_fft2_interleaved_f32(values[4], values[25]); let [x5p24, x5m24] = parallel_fft2_interleaved_f32(values[5], values[24]); let [x6p23, x6m23] = parallel_fft2_interleaved_f32(values[6], values[23]); let [x7p22, x7m22] = parallel_fft2_interleaved_f32(values[7], values[22]); let [x8p21, x8m21] = parallel_fft2_interleaved_f32(values[8], values[21]); let [x9p20, x9m20] = parallel_fft2_interleaved_f32(values[9], values[20]); let [x10p19, x10m19] = parallel_fft2_interleaved_f32(values[10], values[19]); let [x11p18, x11m18] = parallel_fft2_interleaved_f32(values[11], values[18]); let [x12p17, x12m17] = parallel_fft2_interleaved_f32(values[12], values[17]); let [x13p16, x13m16] = parallel_fft2_interleaved_f32(values[13], values[16]); let [x14p15, x14m15] = parallel_fft2_interleaved_f32(values[14], values[15]); let t_a1_1 = _mm_mul_ps(self.twiddle1re, x1p28); let t_a1_2 = _mm_mul_ps(self.twiddle2re, x2p27); let t_a1_3 = _mm_mul_ps(self.twiddle3re, x3p26); let t_a1_4 = _mm_mul_ps(self.twiddle4re, x4p25); let t_a1_5 = _mm_mul_ps(self.twiddle5re, x5p24); let t_a1_6 = _mm_mul_ps(self.twiddle6re, x6p23); let t_a1_7 = _mm_mul_ps(self.twiddle7re, x7p22); let t_a1_8 = _mm_mul_ps(self.twiddle8re, x8p21); let t_a1_9 = _mm_mul_ps(self.twiddle9re, x9p20); let t_a1_10 = _mm_mul_ps(self.twiddle10re, x10p19); let t_a1_11 = _mm_mul_ps(self.twiddle11re, x11p18); let t_a1_12 = _mm_mul_ps(self.twiddle12re, x12p17); let t_a1_13 = _mm_mul_ps(self.twiddle13re, x13p16); let t_a1_14 = _mm_mul_ps(self.twiddle14re, x14p15); let t_a2_1 = _mm_mul_ps(self.twiddle2re, x1p28); let t_a2_2 = _mm_mul_ps(self.twiddle4re, x2p27); let t_a2_3 = _mm_mul_ps(self.twiddle6re, x3p26); let t_a2_4 = _mm_mul_ps(self.twiddle8re, x4p25); let t_a2_5 = _mm_mul_ps(self.twiddle10re, x5p24); let t_a2_6 = _mm_mul_ps(self.twiddle12re, x6p23); let t_a2_7 = _mm_mul_ps(self.twiddle14re, x7p22); let t_a2_8 = _mm_mul_ps(self.twiddle13re, x8p21); let t_a2_9 = _mm_mul_ps(self.twiddle11re, x9p20); let t_a2_10 = _mm_mul_ps(self.twiddle9re, x10p19); let t_a2_11 = _mm_mul_ps(self.twiddle7re, x11p18); let t_a2_12 = _mm_mul_ps(self.twiddle5re, x12p17); let t_a2_13 = _mm_mul_ps(self.twiddle3re, x13p16); let t_a2_14 = _mm_mul_ps(self.twiddle1re, x14p15); let t_a3_1 = _mm_mul_ps(self.twiddle3re, x1p28); let t_a3_2 = _mm_mul_ps(self.twiddle6re, x2p27); let t_a3_3 = _mm_mul_ps(self.twiddle9re, x3p26); let t_a3_4 = _mm_mul_ps(self.twiddle12re, x4p25); let t_a3_5 = _mm_mul_ps(self.twiddle14re, x5p24); let t_a3_6 = _mm_mul_ps(self.twiddle11re, x6p23); let t_a3_7 = _mm_mul_ps(self.twiddle8re, x7p22); let t_a3_8 = _mm_mul_ps(self.twiddle5re, x8p21); let t_a3_9 = _mm_mul_ps(self.twiddle2re, x9p20); let t_a3_10 = _mm_mul_ps(self.twiddle1re, x10p19); let t_a3_11 = _mm_mul_ps(self.twiddle4re, x11p18); let t_a3_12 = _mm_mul_ps(self.twiddle7re, x12p17); let t_a3_13 = _mm_mul_ps(self.twiddle10re, x13p16); let t_a3_14 = _mm_mul_ps(self.twiddle13re, x14p15); let t_a4_1 = _mm_mul_ps(self.twiddle4re, x1p28); let t_a4_2 = _mm_mul_ps(self.twiddle8re, x2p27); let t_a4_3 = _mm_mul_ps(self.twiddle12re, x3p26); let t_a4_4 = _mm_mul_ps(self.twiddle13re, x4p25); let t_a4_5 = _mm_mul_ps(self.twiddle9re, x5p24); let t_a4_6 = _mm_mul_ps(self.twiddle5re, x6p23); let t_a4_7 = _mm_mul_ps(self.twiddle1re, x7p22); let t_a4_8 = _mm_mul_ps(self.twiddle3re, x8p21); let t_a4_9 = _mm_mul_ps(self.twiddle7re, x9p20); let t_a4_10 = _mm_mul_ps(self.twiddle11re, x10p19); let t_a4_11 = _mm_mul_ps(self.twiddle14re, x11p18); let t_a4_12 = _mm_mul_ps(self.twiddle10re, x12p17); let t_a4_13 = _mm_mul_ps(self.twiddle6re, x13p16); let t_a4_14 = _mm_mul_ps(self.twiddle2re, x14p15); let t_a5_1 = _mm_mul_ps(self.twiddle5re, x1p28); let t_a5_2 = _mm_mul_ps(self.twiddle10re, x2p27); let t_a5_3 = _mm_mul_ps(self.twiddle14re, x3p26); let t_a5_4 = _mm_mul_ps(self.twiddle9re, x4p25); let t_a5_5 = _mm_mul_ps(self.twiddle4re, x5p24); let t_a5_6 = _mm_mul_ps(self.twiddle1re, x6p23); let t_a5_7 = _mm_mul_ps(self.twiddle6re, x7p22); let t_a5_8 = _mm_mul_ps(self.twiddle11re, x8p21); let t_a5_9 = _mm_mul_ps(self.twiddle13re, x9p20); let t_a5_10 = _mm_mul_ps(self.twiddle8re, x10p19); let t_a5_11 = _mm_mul_ps(self.twiddle3re, x11p18); let t_a5_12 = _mm_mul_ps(self.twiddle2re, x12p17); let t_a5_13 = _mm_mul_ps(self.twiddle7re, x13p16); let t_a5_14 = _mm_mul_ps(self.twiddle12re, x14p15); let t_a6_1 = _mm_mul_ps(self.twiddle6re, x1p28); let t_a6_2 = _mm_mul_ps(self.twiddle12re, x2p27); let t_a6_3 = _mm_mul_ps(self.twiddle11re, x3p26); let t_a6_4 = _mm_mul_ps(self.twiddle5re, x4p25); let t_a6_5 = _mm_mul_ps(self.twiddle1re, x5p24); let t_a6_6 = _mm_mul_ps(self.twiddle7re, x6p23); let t_a6_7 = _mm_mul_ps(self.twiddle13re, x7p22); let t_a6_8 = _mm_mul_ps(self.twiddle10re, x8p21); let t_a6_9 = _mm_mul_ps(self.twiddle4re, x9p20); let t_a6_10 = _mm_mul_ps(self.twiddle2re, x10p19); let t_a6_11 = _mm_mul_ps(self.twiddle8re, x11p18); let t_a6_12 = _mm_mul_ps(self.twiddle14re, x12p17); let t_a6_13 = _mm_mul_ps(self.twiddle9re, x13p16); let t_a6_14 = _mm_mul_ps(self.twiddle3re, x14p15); let t_a7_1 = _mm_mul_ps(self.twiddle7re, x1p28); let t_a7_2 = _mm_mul_ps(self.twiddle14re, x2p27); let t_a7_3 = _mm_mul_ps(self.twiddle8re, x3p26); let t_a7_4 = _mm_mul_ps(self.twiddle1re, x4p25); let t_a7_5 = _mm_mul_ps(self.twiddle6re, x5p24); let t_a7_6 = _mm_mul_ps(self.twiddle13re, x6p23); let t_a7_7 = _mm_mul_ps(self.twiddle9re, x7p22); let t_a7_8 = _mm_mul_ps(self.twiddle2re, x8p21); let t_a7_9 = _mm_mul_ps(self.twiddle5re, x9p20); let t_a7_10 = _mm_mul_ps(self.twiddle12re, x10p19); let t_a7_11 = _mm_mul_ps(self.twiddle10re, x11p18); let t_a7_12 = _mm_mul_ps(self.twiddle3re, x12p17); let t_a7_13 = _mm_mul_ps(self.twiddle4re, x13p16); let t_a7_14 = _mm_mul_ps(self.twiddle11re, x14p15); let t_a8_1 = _mm_mul_ps(self.twiddle8re, x1p28); let t_a8_2 = _mm_mul_ps(self.twiddle13re, x2p27); let t_a8_3 = _mm_mul_ps(self.twiddle5re, x3p26); let t_a8_4 = _mm_mul_ps(self.twiddle3re, x4p25); let t_a8_5 = _mm_mul_ps(self.twiddle11re, x5p24); let t_a8_6 = _mm_mul_ps(self.twiddle10re, x6p23); let t_a8_7 = _mm_mul_ps(self.twiddle2re, x7p22); let t_a8_8 = _mm_mul_ps(self.twiddle6re, x8p21); let t_a8_9 = _mm_mul_ps(self.twiddle14re, x9p20); let t_a8_10 = _mm_mul_ps(self.twiddle7re, x10p19); let t_a8_11 = _mm_mul_ps(self.twiddle1re, x11p18); let t_a8_12 = _mm_mul_ps(self.twiddle9re, x12p17); let t_a8_13 = _mm_mul_ps(self.twiddle12re, x13p16); let t_a8_14 = _mm_mul_ps(self.twiddle4re, x14p15); let t_a9_1 = _mm_mul_ps(self.twiddle9re, x1p28); let t_a9_2 = _mm_mul_ps(self.twiddle11re, x2p27); let t_a9_3 = _mm_mul_ps(self.twiddle2re, x3p26); let t_a9_4 = _mm_mul_ps(self.twiddle7re, x4p25); let t_a9_5 = _mm_mul_ps(self.twiddle13re, x5p24); let t_a9_6 = _mm_mul_ps(self.twiddle4re, x6p23); let t_a9_7 = _mm_mul_ps(self.twiddle5re, x7p22); let t_a9_8 = _mm_mul_ps(self.twiddle14re, x8p21); let t_a9_9 = _mm_mul_ps(self.twiddle6re, x9p20); let t_a9_10 = _mm_mul_ps(self.twiddle3re, x10p19); let t_a9_11 = _mm_mul_ps(self.twiddle12re, x11p18); let t_a9_12 = _mm_mul_ps(self.twiddle8re, x12p17); let t_a9_13 = _mm_mul_ps(self.twiddle1re, x13p16); let t_a9_14 = _mm_mul_ps(self.twiddle10re, x14p15); let t_a10_1 = _mm_mul_ps(self.twiddle10re, x1p28); let t_a10_2 = _mm_mul_ps(self.twiddle9re, x2p27); let t_a10_3 = _mm_mul_ps(self.twiddle1re, x3p26); let t_a10_4 = _mm_mul_ps(self.twiddle11re, x4p25); let t_a10_5 = _mm_mul_ps(self.twiddle8re, x5p24); let t_a10_6 = _mm_mul_ps(self.twiddle2re, x6p23); let t_a10_7 = _mm_mul_ps(self.twiddle12re, x7p22); let t_a10_8 = _mm_mul_ps(self.twiddle7re, x8p21); let t_a10_9 = _mm_mul_ps(self.twiddle3re, x9p20); let t_a10_10 = _mm_mul_ps(self.twiddle13re, x10p19); let t_a10_11 = _mm_mul_ps(self.twiddle6re, x11p18); let t_a10_12 = _mm_mul_ps(self.twiddle4re, x12p17); let t_a10_13 = _mm_mul_ps(self.twiddle14re, x13p16); let t_a10_14 = _mm_mul_ps(self.twiddle5re, x14p15); let t_a11_1 = _mm_mul_ps(self.twiddle11re, x1p28); let t_a11_2 = _mm_mul_ps(self.twiddle7re, x2p27); let t_a11_3 = _mm_mul_ps(self.twiddle4re, x3p26); let t_a11_4 = _mm_mul_ps(self.twiddle14re, x4p25); let t_a11_5 = _mm_mul_ps(self.twiddle3re, x5p24); let t_a11_6 = _mm_mul_ps(self.twiddle8re, x6p23); let t_a11_7 = _mm_mul_ps(self.twiddle10re, x7p22); let t_a11_8 = _mm_mul_ps(self.twiddle1re, x8p21); let t_a11_9 = _mm_mul_ps(self.twiddle12re, x9p20); let t_a11_10 = _mm_mul_ps(self.twiddle6re, x10p19); let t_a11_11 = _mm_mul_ps(self.twiddle5re, x11p18); let t_a11_12 = _mm_mul_ps(self.twiddle13re, x12p17); let t_a11_13 = _mm_mul_ps(self.twiddle2re, x13p16); let t_a11_14 = _mm_mul_ps(self.twiddle9re, x14p15); let t_a12_1 = _mm_mul_ps(self.twiddle12re, x1p28); let t_a12_2 = _mm_mul_ps(self.twiddle5re, x2p27); let t_a12_3 = _mm_mul_ps(self.twiddle7re, x3p26); let t_a12_4 = _mm_mul_ps(self.twiddle10re, x4p25); let t_a12_5 = _mm_mul_ps(self.twiddle2re, x5p24); let t_a12_6 = _mm_mul_ps(self.twiddle14re, x6p23); let t_a12_7 = _mm_mul_ps(self.twiddle3re, x7p22); let t_a12_8 = _mm_mul_ps(self.twiddle9re, x8p21); let t_a12_9 = _mm_mul_ps(self.twiddle8re, x9p20); let t_a12_10 = _mm_mul_ps(self.twiddle4re, x10p19); let t_a12_11 = _mm_mul_ps(self.twiddle13re, x11p18); let t_a12_12 = _mm_mul_ps(self.twiddle1re, x12p17); let t_a12_13 = _mm_mul_ps(self.twiddle11re, x13p16); let t_a12_14 = _mm_mul_ps(self.twiddle6re, x14p15); let t_a13_1 = _mm_mul_ps(self.twiddle13re, x1p28); let t_a13_2 = _mm_mul_ps(self.twiddle3re, x2p27); let t_a13_3 = _mm_mul_ps(self.twiddle10re, x3p26); let t_a13_4 = _mm_mul_ps(self.twiddle6re, x4p25); let t_a13_5 = _mm_mul_ps(self.twiddle7re, x5p24); let t_a13_6 = _mm_mul_ps(self.twiddle9re, x6p23); let t_a13_7 = _mm_mul_ps(self.twiddle4re, x7p22); let t_a13_8 = _mm_mul_ps(self.twiddle12re, x8p21); let t_a13_9 = _mm_mul_ps(self.twiddle1re, x9p20); let t_a13_10 = _mm_mul_ps(self.twiddle14re, x10p19); let t_a13_11 = _mm_mul_ps(self.twiddle2re, x11p18); let t_a13_12 = _mm_mul_ps(self.twiddle11re, x12p17); let t_a13_13 = _mm_mul_ps(self.twiddle5re, x13p16); let t_a13_14 = _mm_mul_ps(self.twiddle8re, x14p15); let t_a14_1 = _mm_mul_ps(self.twiddle14re, x1p28); let t_a14_2 = _mm_mul_ps(self.twiddle1re, x2p27); let t_a14_3 = _mm_mul_ps(self.twiddle13re, x3p26); let t_a14_4 = _mm_mul_ps(self.twiddle2re, x4p25); let t_a14_5 = _mm_mul_ps(self.twiddle12re, x5p24); let t_a14_6 = _mm_mul_ps(self.twiddle3re, x6p23); let t_a14_7 = _mm_mul_ps(self.twiddle11re, x7p22); let t_a14_8 = _mm_mul_ps(self.twiddle4re, x8p21); let t_a14_9 = _mm_mul_ps(self.twiddle10re, x9p20); let t_a14_10 = _mm_mul_ps(self.twiddle5re, x10p19); let t_a14_11 = _mm_mul_ps(self.twiddle9re, x11p18); let t_a14_12 = _mm_mul_ps(self.twiddle6re, x12p17); let t_a14_13 = _mm_mul_ps(self.twiddle8re, x13p16); let t_a14_14 = _mm_mul_ps(self.twiddle7re, x14p15); let t_b1_1 = _mm_mul_ps(self.twiddle1im, x1m28); let t_b1_2 = _mm_mul_ps(self.twiddle2im, x2m27); let t_b1_3 = _mm_mul_ps(self.twiddle3im, x3m26); let t_b1_4 = _mm_mul_ps(self.twiddle4im, x4m25); let t_b1_5 = _mm_mul_ps(self.twiddle5im, x5m24); let t_b1_6 = _mm_mul_ps(self.twiddle6im, x6m23); let t_b1_7 = _mm_mul_ps(self.twiddle7im, x7m22); let t_b1_8 = _mm_mul_ps(self.twiddle8im, x8m21); let t_b1_9 = _mm_mul_ps(self.twiddle9im, x9m20); let t_b1_10 = _mm_mul_ps(self.twiddle10im, x10m19); let t_b1_11 = _mm_mul_ps(self.twiddle11im, x11m18); let t_b1_12 = _mm_mul_ps(self.twiddle12im, x12m17); let t_b1_13 = _mm_mul_ps(self.twiddle13im, x13m16); let t_b1_14 = _mm_mul_ps(self.twiddle14im, x14m15); let t_b2_1 = _mm_mul_ps(self.twiddle2im, x1m28); let t_b2_2 = _mm_mul_ps(self.twiddle4im, x2m27); let t_b2_3 = _mm_mul_ps(self.twiddle6im, x3m26); let t_b2_4 = _mm_mul_ps(self.twiddle8im, x4m25); let t_b2_5 = _mm_mul_ps(self.twiddle10im, x5m24); let t_b2_6 = _mm_mul_ps(self.twiddle12im, x6m23); let t_b2_7 = _mm_mul_ps(self.twiddle14im, x7m22); let t_b2_8 = _mm_mul_ps(self.twiddle13im, x8m21); let t_b2_9 = _mm_mul_ps(self.twiddle11im, x9m20); let t_b2_10 = _mm_mul_ps(self.twiddle9im, x10m19); let t_b2_11 = _mm_mul_ps(self.twiddle7im, x11m18); let t_b2_12 = _mm_mul_ps(self.twiddle5im, x12m17); let t_b2_13 = _mm_mul_ps(self.twiddle3im, x13m16); let t_b2_14 = _mm_mul_ps(self.twiddle1im, x14m15); let t_b3_1 = _mm_mul_ps(self.twiddle3im, x1m28); let t_b3_2 = _mm_mul_ps(self.twiddle6im, x2m27); let t_b3_3 = _mm_mul_ps(self.twiddle9im, x3m26); let t_b3_4 = _mm_mul_ps(self.twiddle12im, x4m25); let t_b3_5 = _mm_mul_ps(self.twiddle14im, x5m24); let t_b3_6 = _mm_mul_ps(self.twiddle11im, x6m23); let t_b3_7 = _mm_mul_ps(self.twiddle8im, x7m22); let t_b3_8 = _mm_mul_ps(self.twiddle5im, x8m21); let t_b3_9 = _mm_mul_ps(self.twiddle2im, x9m20); let t_b3_10 = _mm_mul_ps(self.twiddle1im, x10m19); let t_b3_11 = _mm_mul_ps(self.twiddle4im, x11m18); let t_b3_12 = _mm_mul_ps(self.twiddle7im, x12m17); let t_b3_13 = _mm_mul_ps(self.twiddle10im, x13m16); let t_b3_14 = _mm_mul_ps(self.twiddle13im, x14m15); let t_b4_1 = _mm_mul_ps(self.twiddle4im, x1m28); let t_b4_2 = _mm_mul_ps(self.twiddle8im, x2m27); let t_b4_3 = _mm_mul_ps(self.twiddle12im, x3m26); let t_b4_4 = _mm_mul_ps(self.twiddle13im, x4m25); let t_b4_5 = _mm_mul_ps(self.twiddle9im, x5m24); let t_b4_6 = _mm_mul_ps(self.twiddle5im, x6m23); let t_b4_7 = _mm_mul_ps(self.twiddle1im, x7m22); let t_b4_8 = _mm_mul_ps(self.twiddle3im, x8m21); let t_b4_9 = _mm_mul_ps(self.twiddle7im, x9m20); let t_b4_10 = _mm_mul_ps(self.twiddle11im, x10m19); let t_b4_11 = _mm_mul_ps(self.twiddle14im, x11m18); let t_b4_12 = _mm_mul_ps(self.twiddle10im, x12m17); let t_b4_13 = _mm_mul_ps(self.twiddle6im, x13m16); let t_b4_14 = _mm_mul_ps(self.twiddle2im, x14m15); let t_b5_1 = _mm_mul_ps(self.twiddle5im, x1m28); let t_b5_2 = _mm_mul_ps(self.twiddle10im, x2m27); let t_b5_3 = _mm_mul_ps(self.twiddle14im, x3m26); let t_b5_4 = _mm_mul_ps(self.twiddle9im, x4m25); let t_b5_5 = _mm_mul_ps(self.twiddle4im, x5m24); let t_b5_6 = _mm_mul_ps(self.twiddle1im, x6m23); let t_b5_7 = _mm_mul_ps(self.twiddle6im, x7m22); let t_b5_8 = _mm_mul_ps(self.twiddle11im, x8m21); let t_b5_9 = _mm_mul_ps(self.twiddle13im, x9m20); let t_b5_10 = _mm_mul_ps(self.twiddle8im, x10m19); let t_b5_11 = _mm_mul_ps(self.twiddle3im, x11m18); let t_b5_12 = _mm_mul_ps(self.twiddle2im, x12m17); let t_b5_13 = _mm_mul_ps(self.twiddle7im, x13m16); let t_b5_14 = _mm_mul_ps(self.twiddle12im, x14m15); let t_b6_1 = _mm_mul_ps(self.twiddle6im, x1m28); let t_b6_2 = _mm_mul_ps(self.twiddle12im, x2m27); let t_b6_3 = _mm_mul_ps(self.twiddle11im, x3m26); let t_b6_4 = _mm_mul_ps(self.twiddle5im, x4m25); let t_b6_5 = _mm_mul_ps(self.twiddle1im, x5m24); let t_b6_6 = _mm_mul_ps(self.twiddle7im, x6m23); let t_b6_7 = _mm_mul_ps(self.twiddle13im, x7m22); let t_b6_8 = _mm_mul_ps(self.twiddle10im, x8m21); let t_b6_9 = _mm_mul_ps(self.twiddle4im, x9m20); let t_b6_10 = _mm_mul_ps(self.twiddle2im, x10m19); let t_b6_11 = _mm_mul_ps(self.twiddle8im, x11m18); let t_b6_12 = _mm_mul_ps(self.twiddle14im, x12m17); let t_b6_13 = _mm_mul_ps(self.twiddle9im, x13m16); let t_b6_14 = _mm_mul_ps(self.twiddle3im, x14m15); let t_b7_1 = _mm_mul_ps(self.twiddle7im, x1m28); let t_b7_2 = _mm_mul_ps(self.twiddle14im, x2m27); let t_b7_3 = _mm_mul_ps(self.twiddle8im, x3m26); let t_b7_4 = _mm_mul_ps(self.twiddle1im, x4m25); let t_b7_5 = _mm_mul_ps(self.twiddle6im, x5m24); let t_b7_6 = _mm_mul_ps(self.twiddle13im, x6m23); let t_b7_7 = _mm_mul_ps(self.twiddle9im, x7m22); let t_b7_8 = _mm_mul_ps(self.twiddle2im, x8m21); let t_b7_9 = _mm_mul_ps(self.twiddle5im, x9m20); let t_b7_10 = _mm_mul_ps(self.twiddle12im, x10m19); let t_b7_11 = _mm_mul_ps(self.twiddle10im, x11m18); let t_b7_12 = _mm_mul_ps(self.twiddle3im, x12m17); let t_b7_13 = _mm_mul_ps(self.twiddle4im, x13m16); let t_b7_14 = _mm_mul_ps(self.twiddle11im, x14m15); let t_b8_1 = _mm_mul_ps(self.twiddle8im, x1m28); let t_b8_2 = _mm_mul_ps(self.twiddle13im, x2m27); let t_b8_3 = _mm_mul_ps(self.twiddle5im, x3m26); let t_b8_4 = _mm_mul_ps(self.twiddle3im, x4m25); let t_b8_5 = _mm_mul_ps(self.twiddle11im, x5m24); let t_b8_6 = _mm_mul_ps(self.twiddle10im, x6m23); let t_b8_7 = _mm_mul_ps(self.twiddle2im, x7m22); let t_b8_8 = _mm_mul_ps(self.twiddle6im, x8m21); let t_b8_9 = _mm_mul_ps(self.twiddle14im, x9m20); let t_b8_10 = _mm_mul_ps(self.twiddle7im, x10m19); let t_b8_11 = _mm_mul_ps(self.twiddle1im, x11m18); let t_b8_12 = _mm_mul_ps(self.twiddle9im, x12m17); let t_b8_13 = _mm_mul_ps(self.twiddle12im, x13m16); let t_b8_14 = _mm_mul_ps(self.twiddle4im, x14m15); let t_b9_1 = _mm_mul_ps(self.twiddle9im, x1m28); let t_b9_2 = _mm_mul_ps(self.twiddle11im, x2m27); let t_b9_3 = _mm_mul_ps(self.twiddle2im, x3m26); let t_b9_4 = _mm_mul_ps(self.twiddle7im, x4m25); let t_b9_5 = _mm_mul_ps(self.twiddle13im, x5m24); let t_b9_6 = _mm_mul_ps(self.twiddle4im, x6m23); let t_b9_7 = _mm_mul_ps(self.twiddle5im, x7m22); let t_b9_8 = _mm_mul_ps(self.twiddle14im, x8m21); let t_b9_9 = _mm_mul_ps(self.twiddle6im, x9m20); let t_b9_10 = _mm_mul_ps(self.twiddle3im, x10m19); let t_b9_11 = _mm_mul_ps(self.twiddle12im, x11m18); let t_b9_12 = _mm_mul_ps(self.twiddle8im, x12m17); let t_b9_13 = _mm_mul_ps(self.twiddle1im, x13m16); let t_b9_14 = _mm_mul_ps(self.twiddle10im, x14m15); let t_b10_1 = _mm_mul_ps(self.twiddle10im, x1m28); let t_b10_2 = _mm_mul_ps(self.twiddle9im, x2m27); let t_b10_3 = _mm_mul_ps(self.twiddle1im, x3m26); let t_b10_4 = _mm_mul_ps(self.twiddle11im, x4m25); let t_b10_5 = _mm_mul_ps(self.twiddle8im, x5m24); let t_b10_6 = _mm_mul_ps(self.twiddle2im, x6m23); let t_b10_7 = _mm_mul_ps(self.twiddle12im, x7m22); let t_b10_8 = _mm_mul_ps(self.twiddle7im, x8m21); let t_b10_9 = _mm_mul_ps(self.twiddle3im, x9m20); let t_b10_10 = _mm_mul_ps(self.twiddle13im, x10m19); let t_b10_11 = _mm_mul_ps(self.twiddle6im, x11m18); let t_b10_12 = _mm_mul_ps(self.twiddle4im, x12m17); let t_b10_13 = _mm_mul_ps(self.twiddle14im, x13m16); let t_b10_14 = _mm_mul_ps(self.twiddle5im, x14m15); let t_b11_1 = _mm_mul_ps(self.twiddle11im, x1m28); let t_b11_2 = _mm_mul_ps(self.twiddle7im, x2m27); let t_b11_3 = _mm_mul_ps(self.twiddle4im, x3m26); let t_b11_4 = _mm_mul_ps(self.twiddle14im, x4m25); let t_b11_5 = _mm_mul_ps(self.twiddle3im, x5m24); let t_b11_6 = _mm_mul_ps(self.twiddle8im, x6m23); let t_b11_7 = _mm_mul_ps(self.twiddle10im, x7m22); let t_b11_8 = _mm_mul_ps(self.twiddle1im, x8m21); let t_b11_9 = _mm_mul_ps(self.twiddle12im, x9m20); let t_b11_10 = _mm_mul_ps(self.twiddle6im, x10m19); let t_b11_11 = _mm_mul_ps(self.twiddle5im, x11m18); let t_b11_12 = _mm_mul_ps(self.twiddle13im, x12m17); let t_b11_13 = _mm_mul_ps(self.twiddle2im, x13m16); let t_b11_14 = _mm_mul_ps(self.twiddle9im, x14m15); let t_b12_1 = _mm_mul_ps(self.twiddle12im, x1m28); let t_b12_2 = _mm_mul_ps(self.twiddle5im, x2m27); let t_b12_3 = _mm_mul_ps(self.twiddle7im, x3m26); let t_b12_4 = _mm_mul_ps(self.twiddle10im, x4m25); let t_b12_5 = _mm_mul_ps(self.twiddle2im, x5m24); let t_b12_6 = _mm_mul_ps(self.twiddle14im, x6m23); let t_b12_7 = _mm_mul_ps(self.twiddle3im, x7m22); let t_b12_8 = _mm_mul_ps(self.twiddle9im, x8m21); let t_b12_9 = _mm_mul_ps(self.twiddle8im, x9m20); let t_b12_10 = _mm_mul_ps(self.twiddle4im, x10m19); let t_b12_11 = _mm_mul_ps(self.twiddle13im, x11m18); let t_b12_12 = _mm_mul_ps(self.twiddle1im, x12m17); let t_b12_13 = _mm_mul_ps(self.twiddle11im, x13m16); let t_b12_14 = _mm_mul_ps(self.twiddle6im, x14m15); let t_b13_1 = _mm_mul_ps(self.twiddle13im, x1m28); let t_b13_2 = _mm_mul_ps(self.twiddle3im, x2m27); let t_b13_3 = _mm_mul_ps(self.twiddle10im, x3m26); let t_b13_4 = _mm_mul_ps(self.twiddle6im, x4m25); let t_b13_5 = _mm_mul_ps(self.twiddle7im, x5m24); let t_b13_6 = _mm_mul_ps(self.twiddle9im, x6m23); let t_b13_7 = _mm_mul_ps(self.twiddle4im, x7m22); let t_b13_8 = _mm_mul_ps(self.twiddle12im, x8m21); let t_b13_9 = _mm_mul_ps(self.twiddle1im, x9m20); let t_b13_10 = _mm_mul_ps(self.twiddle14im, x10m19); let t_b13_11 = _mm_mul_ps(self.twiddle2im, x11m18); let t_b13_12 = _mm_mul_ps(self.twiddle11im, x12m17); let t_b13_13 = _mm_mul_ps(self.twiddle5im, x13m16); let t_b13_14 = _mm_mul_ps(self.twiddle8im, x14m15); let t_b14_1 = _mm_mul_ps(self.twiddle14im, x1m28); let t_b14_2 = _mm_mul_ps(self.twiddle1im, x2m27); let t_b14_3 = _mm_mul_ps(self.twiddle13im, x3m26); let t_b14_4 = _mm_mul_ps(self.twiddle2im, x4m25); let t_b14_5 = _mm_mul_ps(self.twiddle12im, x5m24); let t_b14_6 = _mm_mul_ps(self.twiddle3im, x6m23); let t_b14_7 = _mm_mul_ps(self.twiddle11im, x7m22); let t_b14_8 = _mm_mul_ps(self.twiddle4im, x8m21); let t_b14_9 = _mm_mul_ps(self.twiddle10im, x9m20); let t_b14_10 = _mm_mul_ps(self.twiddle5im, x10m19); let t_b14_11 = _mm_mul_ps(self.twiddle9im, x11m18); let t_b14_12 = _mm_mul_ps(self.twiddle6im, x12m17); let t_b14_13 = _mm_mul_ps(self.twiddle8im, x13m16); let t_b14_14 = _mm_mul_ps(self.twiddle7im, x14m15); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11 + t_a1_12 + t_a1_13 + t_a1_14); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11 + t_a2_12 + t_a2_13 + t_a2_14); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11 + t_a3_12 + t_a3_13 + t_a3_14); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11 + t_a4_12 + t_a4_13 + t_a4_14); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11 + t_a5_12 + t_a5_13 + t_a5_14); let t_a6 = calc_f32!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11 + t_a6_12 + t_a6_13 + t_a6_14); let t_a7 = calc_f32!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11 + t_a7_12 + t_a7_13 + t_a7_14); let t_a8 = calc_f32!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11 + t_a8_12 + t_a8_13 + t_a8_14); let t_a9 = calc_f32!(x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11 + t_a9_12 + t_a9_13 + t_a9_14); let t_a10 = calc_f32!(x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11 + t_a10_12 + t_a10_13 + t_a10_14); let t_a11 = calc_f32!(x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11 + t_a11_12 + t_a11_13 + t_a11_14); let t_a12 = calc_f32!(x0 + t_a12_1 + t_a12_2 + t_a12_3 + t_a12_4 + t_a12_5 + t_a12_6 + t_a12_7 + t_a12_8 + t_a12_9 + t_a12_10 + t_a12_11 + t_a12_12 + t_a12_13 + t_a12_14); let t_a13 = calc_f32!(x0 + t_a13_1 + t_a13_2 + t_a13_3 + t_a13_4 + t_a13_5 + t_a13_6 + t_a13_7 + t_a13_8 + t_a13_9 + t_a13_10 + t_a13_11 + t_a13_12 + t_a13_13 + t_a13_14); let t_a14 = calc_f32!(x0 + t_a14_1 + t_a14_2 + t_a14_3 + t_a14_4 + t_a14_5 + t_a14_6 + t_a14_7 + t_a14_8 + t_a14_9 + t_a14_10 + t_a14_11 + t_a14_12 + t_a14_13 + t_a14_14); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11 + t_b1_12 + t_b1_13 + t_b1_14); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 + t_b2_6 + t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11 - t_b2_12 - t_b2_13 - t_b2_14); let t_b3 = calc_f32!(t_b3_1 + t_b3_2 + t_b3_3 + t_b3_4 - t_b3_5 - t_b3_6 - t_b3_7 - t_b3_8 - t_b3_9 + t_b3_10 + t_b3_11 + t_b3_12 + t_b3_13 + t_b3_14); let t_b4 = calc_f32!(t_b4_1 + t_b4_2 + t_b4_3 - t_b4_4 - t_b4_5 - t_b4_6 - t_b4_7 + t_b4_8 + t_b4_9 + t_b4_10 - t_b4_11 - t_b4_12 - t_b4_13 - t_b4_14); let t_b5 = calc_f32!(t_b5_1 + t_b5_2 - t_b5_3 - t_b5_4 - t_b5_5 + t_b5_6 + t_b5_7 + t_b5_8 - t_b5_9 - t_b5_10 - t_b5_11 + t_b5_12 + t_b5_13 + t_b5_14); let t_b6 = calc_f32!(t_b6_1 + t_b6_2 - t_b6_3 - t_b6_4 + t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9 + t_b6_10 + t_b6_11 + t_b6_12 - t_b6_13 - t_b6_14); let t_b7 = calc_f32!(t_b7_1 + t_b7_2 - t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9 + t_b7_10 - t_b7_11 - t_b7_12 + t_b7_13 + t_b7_14); let t_b8 = calc_f32!(t_b8_1 - t_b8_2 - t_b8_3 + t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 + t_b8_9 - t_b8_10 + t_b8_11 + t_b8_12 - t_b8_13 - t_b8_14); let t_b9 = calc_f32!(t_b9_1 - t_b9_2 - t_b9_3 + t_b9_4 - t_b9_5 - t_b9_6 + t_b9_7 + t_b9_8 - t_b9_9 + t_b9_10 + t_b9_11 - t_b9_12 + t_b9_13 + t_b9_14); let t_b10 = calc_f32!(t_b10_1 - t_b10_2 + t_b10_3 + t_b10_4 - t_b10_5 + t_b10_6 + t_b10_7 - t_b10_8 + t_b10_9 + t_b10_10 - t_b10_11 + t_b10_12 + t_b10_13 - t_b10_14); let t_b11 = calc_f32!(t_b11_1 - t_b11_2 + t_b11_3 - t_b11_4 - t_b11_5 + t_b11_6 - t_b11_7 + t_b11_8 + t_b11_9 - t_b11_10 + t_b11_11 - t_b11_12 - t_b11_13 + t_b11_14); let t_b12 = calc_f32!(t_b12_1 - t_b12_2 + t_b12_3 - t_b12_4 + t_b12_5 + t_b12_6 - t_b12_7 + t_b12_8 - t_b12_9 + t_b12_10 - t_b12_11 - t_b12_12 + t_b12_13 - t_b12_14); let t_b13 = calc_f32!(t_b13_1 - t_b13_2 + t_b13_3 - t_b13_4 + t_b13_5 - t_b13_6 + t_b13_7 - t_b13_8 + t_b13_9 + t_b13_10 - t_b13_11 + t_b13_12 - t_b13_13 + t_b13_14); let t_b14 = calc_f32!(t_b14_1 - t_b14_2 + t_b14_3 - t_b14_4 + t_b14_5 - t_b14_6 + t_b14_7 - t_b14_8 + t_b14_9 - t_b14_10 + t_b14_11 - t_b14_12 + t_b14_13 - t_b14_14); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let t_b7_rot = self.rotate.rotate_both(t_b7); let t_b8_rot = self.rotate.rotate_both(t_b8); let t_b9_rot = self.rotate.rotate_both(t_b9); let t_b10_rot = self.rotate.rotate_both(t_b10); let t_b11_rot = self.rotate.rotate_both(t_b11); let t_b12_rot = self.rotate.rotate_both(t_b12); let t_b13_rot = self.rotate.rotate_both(t_b13); let t_b14_rot = self.rotate.rotate_both(t_b14); let y0 = calc_f32!(x0 + x1p28 + x2p27 + x3p26 + x4p25 + x5p24 + x6p23 + x7p22 + x8p21 + x9p20 + x10p19 + x11p18 + x12p17 + x13p16 + x14p15); let [y1, y28] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y27] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y26] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y25] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y24] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y23] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); let [y7, y22] = parallel_fft2_interleaved_f32(t_a7, t_b7_rot); let [y8, y21] = parallel_fft2_interleaved_f32(t_a8, t_b8_rot); let [y9, y20] = parallel_fft2_interleaved_f32(t_a9, t_b9_rot); let [y10, y19] = parallel_fft2_interleaved_f32(t_a10, t_b10_rot); let [y11, y18] = parallel_fft2_interleaved_f32(t_a11, t_b11_rot); let [y12, y17] = parallel_fft2_interleaved_f32(t_a12, t_b12_rot); let [y13, y16] = parallel_fft2_interleaved_f32(t_a13, t_b13_rot); let [y14, y15] = parallel_fft2_interleaved_f32(t_a14, t_b14_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22, y23, y24, y25, y26, y27, y28] } } // ____ ___ __ _ _ _ _ _ // |___ \ / _ \ / /_ | || | | |__ (_) |_ // __) | (_) | _____ | '_ \| || |_| '_ \| | __| // / __/ \__, | |_____| | (_) |__ _| |_) | | |_ // |_____| /_/ \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly29 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: __m128d, twiddle1im: __m128d, twiddle2re: __m128d, twiddle2im: __m128d, twiddle3re: __m128d, twiddle3im: __m128d, twiddle4re: __m128d, twiddle4im: __m128d, twiddle5re: __m128d, twiddle5im: __m128d, twiddle6re: __m128d, twiddle6im: __m128d, twiddle7re: __m128d, twiddle7im: __m128d, twiddle8re: __m128d, twiddle8im: __m128d, twiddle9re: __m128d, twiddle9im: __m128d, twiddle10re: __m128d, twiddle10im: __m128d, twiddle11re: __m128d, twiddle11im: __m128d, twiddle12re: __m128d, twiddle12im: __m128d, twiddle13re: __m128d, twiddle13im: __m128d, twiddle14re: __m128d, twiddle14im: __m128d, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly29, 29, |this: &SseF64Butterfly29<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly29, 29, |this: &SseF64Butterfly29<_>| this .direction); impl SseF64Butterfly29 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 29, direction); let tw2: Complex = twiddles::compute_twiddle(2, 29, direction); let tw3: Complex = twiddles::compute_twiddle(3, 29, direction); let tw4: Complex = twiddles::compute_twiddle(4, 29, direction); let tw5: Complex = twiddles::compute_twiddle(5, 29, direction); let tw6: Complex = twiddles::compute_twiddle(6, 29, direction); let tw7: Complex = twiddles::compute_twiddle(7, 29, direction); let tw8: Complex = twiddles::compute_twiddle(8, 29, direction); let tw9: Complex = twiddles::compute_twiddle(9, 29, direction); let tw10: Complex = twiddles::compute_twiddle(10, 29, direction); let tw11: Complex = twiddles::compute_twiddle(11, 29, direction); let tw12: Complex = twiddles::compute_twiddle(12, 29, direction); let tw13: Complex = twiddles::compute_twiddle(13, 29, direction); let tw14: Complex = twiddles::compute_twiddle(14, 29, direction); let twiddle1re = unsafe { _mm_set_pd(tw1.re, tw1.re) }; let twiddle1im = unsafe { _mm_set_pd(tw1.im, tw1.im) }; let twiddle2re = unsafe { _mm_set_pd(tw2.re, tw2.re) }; let twiddle2im = unsafe { _mm_set_pd(tw2.im, tw2.im) }; let twiddle3re = unsafe { _mm_set_pd(tw3.re, tw3.re) }; let twiddle3im = unsafe { _mm_set_pd(tw3.im, tw3.im) }; let twiddle4re = unsafe { _mm_set_pd(tw4.re, tw4.re) }; let twiddle4im = unsafe { _mm_set_pd(tw4.im, tw4.im) }; let twiddle5re = unsafe { _mm_set_pd(tw5.re, tw5.re) }; let twiddle5im = unsafe { _mm_set_pd(tw5.im, tw5.im) }; let twiddle6re = unsafe { _mm_set_pd(tw6.re, tw6.re) }; let twiddle6im = unsafe { _mm_set_pd(tw6.im, tw6.im) }; let twiddle7re = unsafe { _mm_set_pd(tw7.re, tw7.re) }; let twiddle7im = unsafe { _mm_set_pd(tw7.im, tw7.im) }; let twiddle8re = unsafe { _mm_set_pd(tw8.re, tw8.re) }; let twiddle8im = unsafe { _mm_set_pd(tw8.im, tw8.im) }; let twiddle9re = unsafe { _mm_set_pd(tw9.re, tw9.re) }; let twiddle9im = unsafe { _mm_set_pd(tw9.im, tw9.im) }; let twiddle10re = unsafe { _mm_set_pd(tw10.re, tw10.re) }; let twiddle10im = unsafe { _mm_set_pd(tw10.im, tw10.im) }; let twiddle11re = unsafe { _mm_set_pd(tw11.re, tw11.re) }; let twiddle11im = unsafe { _mm_set_pd(tw11.im, tw11.im) }; let twiddle12re = unsafe { _mm_set_pd(tw12.re, tw12.re) }; let twiddle12im = unsafe { _mm_set_pd(tw12.im, tw12.im) }; let twiddle13re = unsafe { _mm_set_pd(tw13.re, tw13.re) }; let twiddle13im = unsafe { _mm_set_pd(tw13.im, tw13.im) }; let twiddle14re = unsafe { _mm_set_pd(tw14.re, tw14.re) }; let twiddle14im = unsafe { _mm_set_pd(tw14.im, tw14.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, twiddle12re, twiddle12im, twiddle13re, twiddle13im, twiddle14re, twiddle14im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [__m128d; 29]) -> [__m128d; 29] { let [x1p28, x1m28] = solo_fft2_f64(values[1], values[28]); let [x2p27, x2m27] = solo_fft2_f64(values[2], values[27]); let [x3p26, x3m26] = solo_fft2_f64(values[3], values[26]); let [x4p25, x4m25] = solo_fft2_f64(values[4], values[25]); let [x5p24, x5m24] = solo_fft2_f64(values[5], values[24]); let [x6p23, x6m23] = solo_fft2_f64(values[6], values[23]); let [x7p22, x7m22] = solo_fft2_f64(values[7], values[22]); let [x8p21, x8m21] = solo_fft2_f64(values[8], values[21]); let [x9p20, x9m20] = solo_fft2_f64(values[9], values[20]); let [x10p19, x10m19] = solo_fft2_f64(values[10], values[19]); let [x11p18, x11m18] = solo_fft2_f64(values[11], values[18]); let [x12p17, x12m17] = solo_fft2_f64(values[12], values[17]); let [x13p16, x13m16] = solo_fft2_f64(values[13], values[16]); let [x14p15, x14m15] = solo_fft2_f64(values[14], values[15]); let t_a1_1 = _mm_mul_pd(self.twiddle1re, x1p28); let t_a1_2 = _mm_mul_pd(self.twiddle2re, x2p27); let t_a1_3 = _mm_mul_pd(self.twiddle3re, x3p26); let t_a1_4 = _mm_mul_pd(self.twiddle4re, x4p25); let t_a1_5 = _mm_mul_pd(self.twiddle5re, x5p24); let t_a1_6 = _mm_mul_pd(self.twiddle6re, x6p23); let t_a1_7 = _mm_mul_pd(self.twiddle7re, x7p22); let t_a1_8 = _mm_mul_pd(self.twiddle8re, x8p21); let t_a1_9 = _mm_mul_pd(self.twiddle9re, x9p20); let t_a1_10 = _mm_mul_pd(self.twiddle10re, x10p19); let t_a1_11 = _mm_mul_pd(self.twiddle11re, x11p18); let t_a1_12 = _mm_mul_pd(self.twiddle12re, x12p17); let t_a1_13 = _mm_mul_pd(self.twiddle13re, x13p16); let t_a1_14 = _mm_mul_pd(self.twiddle14re, x14p15); let t_a2_1 = _mm_mul_pd(self.twiddle2re, x1p28); let t_a2_2 = _mm_mul_pd(self.twiddle4re, x2p27); let t_a2_3 = _mm_mul_pd(self.twiddle6re, x3p26); let t_a2_4 = _mm_mul_pd(self.twiddle8re, x4p25); let t_a2_5 = _mm_mul_pd(self.twiddle10re, x5p24); let t_a2_6 = _mm_mul_pd(self.twiddle12re, x6p23); let t_a2_7 = _mm_mul_pd(self.twiddle14re, x7p22); let t_a2_8 = _mm_mul_pd(self.twiddle13re, x8p21); let t_a2_9 = _mm_mul_pd(self.twiddle11re, x9p20); let t_a2_10 = _mm_mul_pd(self.twiddle9re, x10p19); let t_a2_11 = _mm_mul_pd(self.twiddle7re, x11p18); let t_a2_12 = _mm_mul_pd(self.twiddle5re, x12p17); let t_a2_13 = _mm_mul_pd(self.twiddle3re, x13p16); let t_a2_14 = _mm_mul_pd(self.twiddle1re, x14p15); let t_a3_1 = _mm_mul_pd(self.twiddle3re, x1p28); let t_a3_2 = _mm_mul_pd(self.twiddle6re, x2p27); let t_a3_3 = _mm_mul_pd(self.twiddle9re, x3p26); let t_a3_4 = _mm_mul_pd(self.twiddle12re, x4p25); let t_a3_5 = _mm_mul_pd(self.twiddle14re, x5p24); let t_a3_6 = _mm_mul_pd(self.twiddle11re, x6p23); let t_a3_7 = _mm_mul_pd(self.twiddle8re, x7p22); let t_a3_8 = _mm_mul_pd(self.twiddle5re, x8p21); let t_a3_9 = _mm_mul_pd(self.twiddle2re, x9p20); let t_a3_10 = _mm_mul_pd(self.twiddle1re, x10p19); let t_a3_11 = _mm_mul_pd(self.twiddle4re, x11p18); let t_a3_12 = _mm_mul_pd(self.twiddle7re, x12p17); let t_a3_13 = _mm_mul_pd(self.twiddle10re, x13p16); let t_a3_14 = _mm_mul_pd(self.twiddle13re, x14p15); let t_a4_1 = _mm_mul_pd(self.twiddle4re, x1p28); let t_a4_2 = _mm_mul_pd(self.twiddle8re, x2p27); let t_a4_3 = _mm_mul_pd(self.twiddle12re, x3p26); let t_a4_4 = _mm_mul_pd(self.twiddle13re, x4p25); let t_a4_5 = _mm_mul_pd(self.twiddle9re, x5p24); let t_a4_6 = _mm_mul_pd(self.twiddle5re, x6p23); let t_a4_7 = _mm_mul_pd(self.twiddle1re, x7p22); let t_a4_8 = _mm_mul_pd(self.twiddle3re, x8p21); let t_a4_9 = _mm_mul_pd(self.twiddle7re, x9p20); let t_a4_10 = _mm_mul_pd(self.twiddle11re, x10p19); let t_a4_11 = _mm_mul_pd(self.twiddle14re, x11p18); let t_a4_12 = _mm_mul_pd(self.twiddle10re, x12p17); let t_a4_13 = _mm_mul_pd(self.twiddle6re, x13p16); let t_a4_14 = _mm_mul_pd(self.twiddle2re, x14p15); let t_a5_1 = _mm_mul_pd(self.twiddle5re, x1p28); let t_a5_2 = _mm_mul_pd(self.twiddle10re, x2p27); let t_a5_3 = _mm_mul_pd(self.twiddle14re, x3p26); let t_a5_4 = _mm_mul_pd(self.twiddle9re, x4p25); let t_a5_5 = _mm_mul_pd(self.twiddle4re, x5p24); let t_a5_6 = _mm_mul_pd(self.twiddle1re, x6p23); let t_a5_7 = _mm_mul_pd(self.twiddle6re, x7p22); let t_a5_8 = _mm_mul_pd(self.twiddle11re, x8p21); let t_a5_9 = _mm_mul_pd(self.twiddle13re, x9p20); let t_a5_10 = _mm_mul_pd(self.twiddle8re, x10p19); let t_a5_11 = _mm_mul_pd(self.twiddle3re, x11p18); let t_a5_12 = _mm_mul_pd(self.twiddle2re, x12p17); let t_a5_13 = _mm_mul_pd(self.twiddle7re, x13p16); let t_a5_14 = _mm_mul_pd(self.twiddle12re, x14p15); let t_a6_1 = _mm_mul_pd(self.twiddle6re, x1p28); let t_a6_2 = _mm_mul_pd(self.twiddle12re, x2p27); let t_a6_3 = _mm_mul_pd(self.twiddle11re, x3p26); let t_a6_4 = _mm_mul_pd(self.twiddle5re, x4p25); let t_a6_5 = _mm_mul_pd(self.twiddle1re, x5p24); let t_a6_6 = _mm_mul_pd(self.twiddle7re, x6p23); let t_a6_7 = _mm_mul_pd(self.twiddle13re, x7p22); let t_a6_8 = _mm_mul_pd(self.twiddle10re, x8p21); let t_a6_9 = _mm_mul_pd(self.twiddle4re, x9p20); let t_a6_10 = _mm_mul_pd(self.twiddle2re, x10p19); let t_a6_11 = _mm_mul_pd(self.twiddle8re, x11p18); let t_a6_12 = _mm_mul_pd(self.twiddle14re, x12p17); let t_a6_13 = _mm_mul_pd(self.twiddle9re, x13p16); let t_a6_14 = _mm_mul_pd(self.twiddle3re, x14p15); let t_a7_1 = _mm_mul_pd(self.twiddle7re, x1p28); let t_a7_2 = _mm_mul_pd(self.twiddle14re, x2p27); let t_a7_3 = _mm_mul_pd(self.twiddle8re, x3p26); let t_a7_4 = _mm_mul_pd(self.twiddle1re, x4p25); let t_a7_5 = _mm_mul_pd(self.twiddle6re, x5p24); let t_a7_6 = _mm_mul_pd(self.twiddle13re, x6p23); let t_a7_7 = _mm_mul_pd(self.twiddle9re, x7p22); let t_a7_8 = _mm_mul_pd(self.twiddle2re, x8p21); let t_a7_9 = _mm_mul_pd(self.twiddle5re, x9p20); let t_a7_10 = _mm_mul_pd(self.twiddle12re, x10p19); let t_a7_11 = _mm_mul_pd(self.twiddle10re, x11p18); let t_a7_12 = _mm_mul_pd(self.twiddle3re, x12p17); let t_a7_13 = _mm_mul_pd(self.twiddle4re, x13p16); let t_a7_14 = _mm_mul_pd(self.twiddle11re, x14p15); let t_a8_1 = _mm_mul_pd(self.twiddle8re, x1p28); let t_a8_2 = _mm_mul_pd(self.twiddle13re, x2p27); let t_a8_3 = _mm_mul_pd(self.twiddle5re, x3p26); let t_a8_4 = _mm_mul_pd(self.twiddle3re, x4p25); let t_a8_5 = _mm_mul_pd(self.twiddle11re, x5p24); let t_a8_6 = _mm_mul_pd(self.twiddle10re, x6p23); let t_a8_7 = _mm_mul_pd(self.twiddle2re, x7p22); let t_a8_8 = _mm_mul_pd(self.twiddle6re, x8p21); let t_a8_9 = _mm_mul_pd(self.twiddle14re, x9p20); let t_a8_10 = _mm_mul_pd(self.twiddle7re, x10p19); let t_a8_11 = _mm_mul_pd(self.twiddle1re, x11p18); let t_a8_12 = _mm_mul_pd(self.twiddle9re, x12p17); let t_a8_13 = _mm_mul_pd(self.twiddle12re, x13p16); let t_a8_14 = _mm_mul_pd(self.twiddle4re, x14p15); let t_a9_1 = _mm_mul_pd(self.twiddle9re, x1p28); let t_a9_2 = _mm_mul_pd(self.twiddle11re, x2p27); let t_a9_3 = _mm_mul_pd(self.twiddle2re, x3p26); let t_a9_4 = _mm_mul_pd(self.twiddle7re, x4p25); let t_a9_5 = _mm_mul_pd(self.twiddle13re, x5p24); let t_a9_6 = _mm_mul_pd(self.twiddle4re, x6p23); let t_a9_7 = _mm_mul_pd(self.twiddle5re, x7p22); let t_a9_8 = _mm_mul_pd(self.twiddle14re, x8p21); let t_a9_9 = _mm_mul_pd(self.twiddle6re, x9p20); let t_a9_10 = _mm_mul_pd(self.twiddle3re, x10p19); let t_a9_11 = _mm_mul_pd(self.twiddle12re, x11p18); let t_a9_12 = _mm_mul_pd(self.twiddle8re, x12p17); let t_a9_13 = _mm_mul_pd(self.twiddle1re, x13p16); let t_a9_14 = _mm_mul_pd(self.twiddle10re, x14p15); let t_a10_1 = _mm_mul_pd(self.twiddle10re, x1p28); let t_a10_2 = _mm_mul_pd(self.twiddle9re, x2p27); let t_a10_3 = _mm_mul_pd(self.twiddle1re, x3p26); let t_a10_4 = _mm_mul_pd(self.twiddle11re, x4p25); let t_a10_5 = _mm_mul_pd(self.twiddle8re, x5p24); let t_a10_6 = _mm_mul_pd(self.twiddle2re, x6p23); let t_a10_7 = _mm_mul_pd(self.twiddle12re, x7p22); let t_a10_8 = _mm_mul_pd(self.twiddle7re, x8p21); let t_a10_9 = _mm_mul_pd(self.twiddle3re, x9p20); let t_a10_10 = _mm_mul_pd(self.twiddle13re, x10p19); let t_a10_11 = _mm_mul_pd(self.twiddle6re, x11p18); let t_a10_12 = _mm_mul_pd(self.twiddle4re, x12p17); let t_a10_13 = _mm_mul_pd(self.twiddle14re, x13p16); let t_a10_14 = _mm_mul_pd(self.twiddle5re, x14p15); let t_a11_1 = _mm_mul_pd(self.twiddle11re, x1p28); let t_a11_2 = _mm_mul_pd(self.twiddle7re, x2p27); let t_a11_3 = _mm_mul_pd(self.twiddle4re, x3p26); let t_a11_4 = _mm_mul_pd(self.twiddle14re, x4p25); let t_a11_5 = _mm_mul_pd(self.twiddle3re, x5p24); let t_a11_6 = _mm_mul_pd(self.twiddle8re, x6p23); let t_a11_7 = _mm_mul_pd(self.twiddle10re, x7p22); let t_a11_8 = _mm_mul_pd(self.twiddle1re, x8p21); let t_a11_9 = _mm_mul_pd(self.twiddle12re, x9p20); let t_a11_10 = _mm_mul_pd(self.twiddle6re, x10p19); let t_a11_11 = _mm_mul_pd(self.twiddle5re, x11p18); let t_a11_12 = _mm_mul_pd(self.twiddle13re, x12p17); let t_a11_13 = _mm_mul_pd(self.twiddle2re, x13p16); let t_a11_14 = _mm_mul_pd(self.twiddle9re, x14p15); let t_a12_1 = _mm_mul_pd(self.twiddle12re, x1p28); let t_a12_2 = _mm_mul_pd(self.twiddle5re, x2p27); let t_a12_3 = _mm_mul_pd(self.twiddle7re, x3p26); let t_a12_4 = _mm_mul_pd(self.twiddle10re, x4p25); let t_a12_5 = _mm_mul_pd(self.twiddle2re, x5p24); let t_a12_6 = _mm_mul_pd(self.twiddle14re, x6p23); let t_a12_7 = _mm_mul_pd(self.twiddle3re, x7p22); let t_a12_8 = _mm_mul_pd(self.twiddle9re, x8p21); let t_a12_9 = _mm_mul_pd(self.twiddle8re, x9p20); let t_a12_10 = _mm_mul_pd(self.twiddle4re, x10p19); let t_a12_11 = _mm_mul_pd(self.twiddle13re, x11p18); let t_a12_12 = _mm_mul_pd(self.twiddle1re, x12p17); let t_a12_13 = _mm_mul_pd(self.twiddle11re, x13p16); let t_a12_14 = _mm_mul_pd(self.twiddle6re, x14p15); let t_a13_1 = _mm_mul_pd(self.twiddle13re, x1p28); let t_a13_2 = _mm_mul_pd(self.twiddle3re, x2p27); let t_a13_3 = _mm_mul_pd(self.twiddle10re, x3p26); let t_a13_4 = _mm_mul_pd(self.twiddle6re, x4p25); let t_a13_5 = _mm_mul_pd(self.twiddle7re, x5p24); let t_a13_6 = _mm_mul_pd(self.twiddle9re, x6p23); let t_a13_7 = _mm_mul_pd(self.twiddle4re, x7p22); let t_a13_8 = _mm_mul_pd(self.twiddle12re, x8p21); let t_a13_9 = _mm_mul_pd(self.twiddle1re, x9p20); let t_a13_10 = _mm_mul_pd(self.twiddle14re, x10p19); let t_a13_11 = _mm_mul_pd(self.twiddle2re, x11p18); let t_a13_12 = _mm_mul_pd(self.twiddle11re, x12p17); let t_a13_13 = _mm_mul_pd(self.twiddle5re, x13p16); let t_a13_14 = _mm_mul_pd(self.twiddle8re, x14p15); let t_a14_1 = _mm_mul_pd(self.twiddle14re, x1p28); let t_a14_2 = _mm_mul_pd(self.twiddle1re, x2p27); let t_a14_3 = _mm_mul_pd(self.twiddle13re, x3p26); let t_a14_4 = _mm_mul_pd(self.twiddle2re, x4p25); let t_a14_5 = _mm_mul_pd(self.twiddle12re, x5p24); let t_a14_6 = _mm_mul_pd(self.twiddle3re, x6p23); let t_a14_7 = _mm_mul_pd(self.twiddle11re, x7p22); let t_a14_8 = _mm_mul_pd(self.twiddle4re, x8p21); let t_a14_9 = _mm_mul_pd(self.twiddle10re, x9p20); let t_a14_10 = _mm_mul_pd(self.twiddle5re, x10p19); let t_a14_11 = _mm_mul_pd(self.twiddle9re, x11p18); let t_a14_12 = _mm_mul_pd(self.twiddle6re, x12p17); let t_a14_13 = _mm_mul_pd(self.twiddle8re, x13p16); let t_a14_14 = _mm_mul_pd(self.twiddle7re, x14p15); let t_b1_1 = _mm_mul_pd(self.twiddle1im, x1m28); let t_b1_2 = _mm_mul_pd(self.twiddle2im, x2m27); let t_b1_3 = _mm_mul_pd(self.twiddle3im, x3m26); let t_b1_4 = _mm_mul_pd(self.twiddle4im, x4m25); let t_b1_5 = _mm_mul_pd(self.twiddle5im, x5m24); let t_b1_6 = _mm_mul_pd(self.twiddle6im, x6m23); let t_b1_7 = _mm_mul_pd(self.twiddle7im, x7m22); let t_b1_8 = _mm_mul_pd(self.twiddle8im, x8m21); let t_b1_9 = _mm_mul_pd(self.twiddle9im, x9m20); let t_b1_10 = _mm_mul_pd(self.twiddle10im, x10m19); let t_b1_11 = _mm_mul_pd(self.twiddle11im, x11m18); let t_b1_12 = _mm_mul_pd(self.twiddle12im, x12m17); let t_b1_13 = _mm_mul_pd(self.twiddle13im, x13m16); let t_b1_14 = _mm_mul_pd(self.twiddle14im, x14m15); let t_b2_1 = _mm_mul_pd(self.twiddle2im, x1m28); let t_b2_2 = _mm_mul_pd(self.twiddle4im, x2m27); let t_b2_3 = _mm_mul_pd(self.twiddle6im, x3m26); let t_b2_4 = _mm_mul_pd(self.twiddle8im, x4m25); let t_b2_5 = _mm_mul_pd(self.twiddle10im, x5m24); let t_b2_6 = _mm_mul_pd(self.twiddle12im, x6m23); let t_b2_7 = _mm_mul_pd(self.twiddle14im, x7m22); let t_b2_8 = _mm_mul_pd(self.twiddle13im, x8m21); let t_b2_9 = _mm_mul_pd(self.twiddle11im, x9m20); let t_b2_10 = _mm_mul_pd(self.twiddle9im, x10m19); let t_b2_11 = _mm_mul_pd(self.twiddle7im, x11m18); let t_b2_12 = _mm_mul_pd(self.twiddle5im, x12m17); let t_b2_13 = _mm_mul_pd(self.twiddle3im, x13m16); let t_b2_14 = _mm_mul_pd(self.twiddle1im, x14m15); let t_b3_1 = _mm_mul_pd(self.twiddle3im, x1m28); let t_b3_2 = _mm_mul_pd(self.twiddle6im, x2m27); let t_b3_3 = _mm_mul_pd(self.twiddle9im, x3m26); let t_b3_4 = _mm_mul_pd(self.twiddle12im, x4m25); let t_b3_5 = _mm_mul_pd(self.twiddle14im, x5m24); let t_b3_6 = _mm_mul_pd(self.twiddle11im, x6m23); let t_b3_7 = _mm_mul_pd(self.twiddle8im, x7m22); let t_b3_8 = _mm_mul_pd(self.twiddle5im, x8m21); let t_b3_9 = _mm_mul_pd(self.twiddle2im, x9m20); let t_b3_10 = _mm_mul_pd(self.twiddle1im, x10m19); let t_b3_11 = _mm_mul_pd(self.twiddle4im, x11m18); let t_b3_12 = _mm_mul_pd(self.twiddle7im, x12m17); let t_b3_13 = _mm_mul_pd(self.twiddle10im, x13m16); let t_b3_14 = _mm_mul_pd(self.twiddle13im, x14m15); let t_b4_1 = _mm_mul_pd(self.twiddle4im, x1m28); let t_b4_2 = _mm_mul_pd(self.twiddle8im, x2m27); let t_b4_3 = _mm_mul_pd(self.twiddle12im, x3m26); let t_b4_4 = _mm_mul_pd(self.twiddle13im, x4m25); let t_b4_5 = _mm_mul_pd(self.twiddle9im, x5m24); let t_b4_6 = _mm_mul_pd(self.twiddle5im, x6m23); let t_b4_7 = _mm_mul_pd(self.twiddle1im, x7m22); let t_b4_8 = _mm_mul_pd(self.twiddle3im, x8m21); let t_b4_9 = _mm_mul_pd(self.twiddle7im, x9m20); let t_b4_10 = _mm_mul_pd(self.twiddle11im, x10m19); let t_b4_11 = _mm_mul_pd(self.twiddle14im, x11m18); let t_b4_12 = _mm_mul_pd(self.twiddle10im, x12m17); let t_b4_13 = _mm_mul_pd(self.twiddle6im, x13m16); let t_b4_14 = _mm_mul_pd(self.twiddle2im, x14m15); let t_b5_1 = _mm_mul_pd(self.twiddle5im, x1m28); let t_b5_2 = _mm_mul_pd(self.twiddle10im, x2m27); let t_b5_3 = _mm_mul_pd(self.twiddle14im, x3m26); let t_b5_4 = _mm_mul_pd(self.twiddle9im, x4m25); let t_b5_5 = _mm_mul_pd(self.twiddle4im, x5m24); let t_b5_6 = _mm_mul_pd(self.twiddle1im, x6m23); let t_b5_7 = _mm_mul_pd(self.twiddle6im, x7m22); let t_b5_8 = _mm_mul_pd(self.twiddle11im, x8m21); let t_b5_9 = _mm_mul_pd(self.twiddle13im, x9m20); let t_b5_10 = _mm_mul_pd(self.twiddle8im, x10m19); let t_b5_11 = _mm_mul_pd(self.twiddle3im, x11m18); let t_b5_12 = _mm_mul_pd(self.twiddle2im, x12m17); let t_b5_13 = _mm_mul_pd(self.twiddle7im, x13m16); let t_b5_14 = _mm_mul_pd(self.twiddle12im, x14m15); let t_b6_1 = _mm_mul_pd(self.twiddle6im, x1m28); let t_b6_2 = _mm_mul_pd(self.twiddle12im, x2m27); let t_b6_3 = _mm_mul_pd(self.twiddle11im, x3m26); let t_b6_4 = _mm_mul_pd(self.twiddle5im, x4m25); let t_b6_5 = _mm_mul_pd(self.twiddle1im, x5m24); let t_b6_6 = _mm_mul_pd(self.twiddle7im, x6m23); let t_b6_7 = _mm_mul_pd(self.twiddle13im, x7m22); let t_b6_8 = _mm_mul_pd(self.twiddle10im, x8m21); let t_b6_9 = _mm_mul_pd(self.twiddle4im, x9m20); let t_b6_10 = _mm_mul_pd(self.twiddle2im, x10m19); let t_b6_11 = _mm_mul_pd(self.twiddle8im, x11m18); let t_b6_12 = _mm_mul_pd(self.twiddle14im, x12m17); let t_b6_13 = _mm_mul_pd(self.twiddle9im, x13m16); let t_b6_14 = _mm_mul_pd(self.twiddle3im, x14m15); let t_b7_1 = _mm_mul_pd(self.twiddle7im, x1m28); let t_b7_2 = _mm_mul_pd(self.twiddle14im, x2m27); let t_b7_3 = _mm_mul_pd(self.twiddle8im, x3m26); let t_b7_4 = _mm_mul_pd(self.twiddle1im, x4m25); let t_b7_5 = _mm_mul_pd(self.twiddle6im, x5m24); let t_b7_6 = _mm_mul_pd(self.twiddle13im, x6m23); let t_b7_7 = _mm_mul_pd(self.twiddle9im, x7m22); let t_b7_8 = _mm_mul_pd(self.twiddle2im, x8m21); let t_b7_9 = _mm_mul_pd(self.twiddle5im, x9m20); let t_b7_10 = _mm_mul_pd(self.twiddle12im, x10m19); let t_b7_11 = _mm_mul_pd(self.twiddle10im, x11m18); let t_b7_12 = _mm_mul_pd(self.twiddle3im, x12m17); let t_b7_13 = _mm_mul_pd(self.twiddle4im, x13m16); let t_b7_14 = _mm_mul_pd(self.twiddle11im, x14m15); let t_b8_1 = _mm_mul_pd(self.twiddle8im, x1m28); let t_b8_2 = _mm_mul_pd(self.twiddle13im, x2m27); let t_b8_3 = _mm_mul_pd(self.twiddle5im, x3m26); let t_b8_4 = _mm_mul_pd(self.twiddle3im, x4m25); let t_b8_5 = _mm_mul_pd(self.twiddle11im, x5m24); let t_b8_6 = _mm_mul_pd(self.twiddle10im, x6m23); let t_b8_7 = _mm_mul_pd(self.twiddle2im, x7m22); let t_b8_8 = _mm_mul_pd(self.twiddle6im, x8m21); let t_b8_9 = _mm_mul_pd(self.twiddle14im, x9m20); let t_b8_10 = _mm_mul_pd(self.twiddle7im, x10m19); let t_b8_11 = _mm_mul_pd(self.twiddle1im, x11m18); let t_b8_12 = _mm_mul_pd(self.twiddle9im, x12m17); let t_b8_13 = _mm_mul_pd(self.twiddle12im, x13m16); let t_b8_14 = _mm_mul_pd(self.twiddle4im, x14m15); let t_b9_1 = _mm_mul_pd(self.twiddle9im, x1m28); let t_b9_2 = _mm_mul_pd(self.twiddle11im, x2m27); let t_b9_3 = _mm_mul_pd(self.twiddle2im, x3m26); let t_b9_4 = _mm_mul_pd(self.twiddle7im, x4m25); let t_b9_5 = _mm_mul_pd(self.twiddle13im, x5m24); let t_b9_6 = _mm_mul_pd(self.twiddle4im, x6m23); let t_b9_7 = _mm_mul_pd(self.twiddle5im, x7m22); let t_b9_8 = _mm_mul_pd(self.twiddle14im, x8m21); let t_b9_9 = _mm_mul_pd(self.twiddle6im, x9m20); let t_b9_10 = _mm_mul_pd(self.twiddle3im, x10m19); let t_b9_11 = _mm_mul_pd(self.twiddle12im, x11m18); let t_b9_12 = _mm_mul_pd(self.twiddle8im, x12m17); let t_b9_13 = _mm_mul_pd(self.twiddle1im, x13m16); let t_b9_14 = _mm_mul_pd(self.twiddle10im, x14m15); let t_b10_1 = _mm_mul_pd(self.twiddle10im, x1m28); let t_b10_2 = _mm_mul_pd(self.twiddle9im, x2m27); let t_b10_3 = _mm_mul_pd(self.twiddle1im, x3m26); let t_b10_4 = _mm_mul_pd(self.twiddle11im, x4m25); let t_b10_5 = _mm_mul_pd(self.twiddle8im, x5m24); let t_b10_6 = _mm_mul_pd(self.twiddle2im, x6m23); let t_b10_7 = _mm_mul_pd(self.twiddle12im, x7m22); let t_b10_8 = _mm_mul_pd(self.twiddle7im, x8m21); let t_b10_9 = _mm_mul_pd(self.twiddle3im, x9m20); let t_b10_10 = _mm_mul_pd(self.twiddle13im, x10m19); let t_b10_11 = _mm_mul_pd(self.twiddle6im, x11m18); let t_b10_12 = _mm_mul_pd(self.twiddle4im, x12m17); let t_b10_13 = _mm_mul_pd(self.twiddle14im, x13m16); let t_b10_14 = _mm_mul_pd(self.twiddle5im, x14m15); let t_b11_1 = _mm_mul_pd(self.twiddle11im, x1m28); let t_b11_2 = _mm_mul_pd(self.twiddle7im, x2m27); let t_b11_3 = _mm_mul_pd(self.twiddle4im, x3m26); let t_b11_4 = _mm_mul_pd(self.twiddle14im, x4m25); let t_b11_5 = _mm_mul_pd(self.twiddle3im, x5m24); let t_b11_6 = _mm_mul_pd(self.twiddle8im, x6m23); let t_b11_7 = _mm_mul_pd(self.twiddle10im, x7m22); let t_b11_8 = _mm_mul_pd(self.twiddle1im, x8m21); let t_b11_9 = _mm_mul_pd(self.twiddle12im, x9m20); let t_b11_10 = _mm_mul_pd(self.twiddle6im, x10m19); let t_b11_11 = _mm_mul_pd(self.twiddle5im, x11m18); let t_b11_12 = _mm_mul_pd(self.twiddle13im, x12m17); let t_b11_13 = _mm_mul_pd(self.twiddle2im, x13m16); let t_b11_14 = _mm_mul_pd(self.twiddle9im, x14m15); let t_b12_1 = _mm_mul_pd(self.twiddle12im, x1m28); let t_b12_2 = _mm_mul_pd(self.twiddle5im, x2m27); let t_b12_3 = _mm_mul_pd(self.twiddle7im, x3m26); let t_b12_4 = _mm_mul_pd(self.twiddle10im, x4m25); let t_b12_5 = _mm_mul_pd(self.twiddle2im, x5m24); let t_b12_6 = _mm_mul_pd(self.twiddle14im, x6m23); let t_b12_7 = _mm_mul_pd(self.twiddle3im, x7m22); let t_b12_8 = _mm_mul_pd(self.twiddle9im, x8m21); let t_b12_9 = _mm_mul_pd(self.twiddle8im, x9m20); let t_b12_10 = _mm_mul_pd(self.twiddle4im, x10m19); let t_b12_11 = _mm_mul_pd(self.twiddle13im, x11m18); let t_b12_12 = _mm_mul_pd(self.twiddle1im, x12m17); let t_b12_13 = _mm_mul_pd(self.twiddle11im, x13m16); let t_b12_14 = _mm_mul_pd(self.twiddle6im, x14m15); let t_b13_1 = _mm_mul_pd(self.twiddle13im, x1m28); let t_b13_2 = _mm_mul_pd(self.twiddle3im, x2m27); let t_b13_3 = _mm_mul_pd(self.twiddle10im, x3m26); let t_b13_4 = _mm_mul_pd(self.twiddle6im, x4m25); let t_b13_5 = _mm_mul_pd(self.twiddle7im, x5m24); let t_b13_6 = _mm_mul_pd(self.twiddle9im, x6m23); let t_b13_7 = _mm_mul_pd(self.twiddle4im, x7m22); let t_b13_8 = _mm_mul_pd(self.twiddle12im, x8m21); let t_b13_9 = _mm_mul_pd(self.twiddle1im, x9m20); let t_b13_10 = _mm_mul_pd(self.twiddle14im, x10m19); let t_b13_11 = _mm_mul_pd(self.twiddle2im, x11m18); let t_b13_12 = _mm_mul_pd(self.twiddle11im, x12m17); let t_b13_13 = _mm_mul_pd(self.twiddle5im, x13m16); let t_b13_14 = _mm_mul_pd(self.twiddle8im, x14m15); let t_b14_1 = _mm_mul_pd(self.twiddle14im, x1m28); let t_b14_2 = _mm_mul_pd(self.twiddle1im, x2m27); let t_b14_3 = _mm_mul_pd(self.twiddle13im, x3m26); let t_b14_4 = _mm_mul_pd(self.twiddle2im, x4m25); let t_b14_5 = _mm_mul_pd(self.twiddle12im, x5m24); let t_b14_6 = _mm_mul_pd(self.twiddle3im, x6m23); let t_b14_7 = _mm_mul_pd(self.twiddle11im, x7m22); let t_b14_8 = _mm_mul_pd(self.twiddle4im, x8m21); let t_b14_9 = _mm_mul_pd(self.twiddle10im, x9m20); let t_b14_10 = _mm_mul_pd(self.twiddle5im, x10m19); let t_b14_11 = _mm_mul_pd(self.twiddle9im, x11m18); let t_b14_12 = _mm_mul_pd(self.twiddle6im, x12m17); let t_b14_13 = _mm_mul_pd(self.twiddle8im, x13m16); let t_b14_14 = _mm_mul_pd(self.twiddle7im, x14m15); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11 + t_a1_12 + t_a1_13 + t_a1_14); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11 + t_a2_12 + t_a2_13 + t_a2_14); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11 + t_a3_12 + t_a3_13 + t_a3_14); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11 + t_a4_12 + t_a4_13 + t_a4_14); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11 + t_a5_12 + t_a5_13 + t_a5_14); let t_a6 = calc_f64!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11 + t_a6_12 + t_a6_13 + t_a6_14); let t_a7 = calc_f64!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11 + t_a7_12 + t_a7_13 + t_a7_14); let t_a8 = calc_f64!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11 + t_a8_12 + t_a8_13 + t_a8_14); let t_a9 = calc_f64!(x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11 + t_a9_12 + t_a9_13 + t_a9_14); let t_a10 = calc_f64!(x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11 + t_a10_12 + t_a10_13 + t_a10_14); let t_a11 = calc_f64!(x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11 + t_a11_12 + t_a11_13 + t_a11_14); let t_a12 = calc_f64!(x0 + t_a12_1 + t_a12_2 + t_a12_3 + t_a12_4 + t_a12_5 + t_a12_6 + t_a12_7 + t_a12_8 + t_a12_9 + t_a12_10 + t_a12_11 + t_a12_12 + t_a12_13 + t_a12_14); let t_a13 = calc_f64!(x0 + t_a13_1 + t_a13_2 + t_a13_3 + t_a13_4 + t_a13_5 + t_a13_6 + t_a13_7 + t_a13_8 + t_a13_9 + t_a13_10 + t_a13_11 + t_a13_12 + t_a13_13 + t_a13_14); let t_a14 = calc_f64!(x0 + t_a14_1 + t_a14_2 + t_a14_3 + t_a14_4 + t_a14_5 + t_a14_6 + t_a14_7 + t_a14_8 + t_a14_9 + t_a14_10 + t_a14_11 + t_a14_12 + t_a14_13 + t_a14_14); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11 + t_b1_12 + t_b1_13 + t_b1_14); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 + t_b2_6 + t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11 - t_b2_12 - t_b2_13 - t_b2_14); let t_b3 = calc_f64!(t_b3_1 + t_b3_2 + t_b3_3 + t_b3_4 - t_b3_5 - t_b3_6 - t_b3_7 - t_b3_8 - t_b3_9 + t_b3_10 + t_b3_11 + t_b3_12 + t_b3_13 + t_b3_14); let t_b4 = calc_f64!(t_b4_1 + t_b4_2 + t_b4_3 - t_b4_4 - t_b4_5 - t_b4_6 - t_b4_7 + t_b4_8 + t_b4_9 + t_b4_10 - t_b4_11 - t_b4_12 - t_b4_13 - t_b4_14); let t_b5 = calc_f64!(t_b5_1 + t_b5_2 - t_b5_3 - t_b5_4 - t_b5_5 + t_b5_6 + t_b5_7 + t_b5_8 - t_b5_9 - t_b5_10 - t_b5_11 + t_b5_12 + t_b5_13 + t_b5_14); let t_b6 = calc_f64!(t_b6_1 + t_b6_2 - t_b6_3 - t_b6_4 + t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9 + t_b6_10 + t_b6_11 + t_b6_12 - t_b6_13 - t_b6_14); let t_b7 = calc_f64!(t_b7_1 + t_b7_2 - t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9 + t_b7_10 - t_b7_11 - t_b7_12 + t_b7_13 + t_b7_14); let t_b8 = calc_f64!(t_b8_1 - t_b8_2 - t_b8_3 + t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 + t_b8_9 - t_b8_10 + t_b8_11 + t_b8_12 - t_b8_13 - t_b8_14); let t_b9 = calc_f64!(t_b9_1 - t_b9_2 - t_b9_3 + t_b9_4 - t_b9_5 - t_b9_6 + t_b9_7 + t_b9_8 - t_b9_9 + t_b9_10 + t_b9_11 - t_b9_12 + t_b9_13 + t_b9_14); let t_b10 = calc_f64!(t_b10_1 - t_b10_2 + t_b10_3 + t_b10_4 - t_b10_5 + t_b10_6 + t_b10_7 - t_b10_8 + t_b10_9 + t_b10_10 - t_b10_11 + t_b10_12 + t_b10_13 - t_b10_14); let t_b11 = calc_f64!(t_b11_1 - t_b11_2 + t_b11_3 - t_b11_4 - t_b11_5 + t_b11_6 - t_b11_7 + t_b11_8 + t_b11_9 - t_b11_10 + t_b11_11 - t_b11_12 - t_b11_13 + t_b11_14); let t_b12 = calc_f64!(t_b12_1 - t_b12_2 + t_b12_3 - t_b12_4 + t_b12_5 + t_b12_6 - t_b12_7 + t_b12_8 - t_b12_9 + t_b12_10 - t_b12_11 - t_b12_12 + t_b12_13 - t_b12_14); let t_b13 = calc_f64!(t_b13_1 - t_b13_2 + t_b13_3 - t_b13_4 + t_b13_5 - t_b13_6 + t_b13_7 - t_b13_8 + t_b13_9 + t_b13_10 - t_b13_11 + t_b13_12 - t_b13_13 + t_b13_14); let t_b14 = calc_f64!(t_b14_1 - t_b14_2 + t_b14_3 - t_b14_4 + t_b14_5 - t_b14_6 + t_b14_7 - t_b14_8 + t_b14_9 - t_b14_10 + t_b14_11 - t_b14_12 + t_b14_13 - t_b14_14); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let t_b7_rot = self.rotate.rotate(t_b7); let t_b8_rot = self.rotate.rotate(t_b8); let t_b9_rot = self.rotate.rotate(t_b9); let t_b10_rot = self.rotate.rotate(t_b10); let t_b11_rot = self.rotate.rotate(t_b11); let t_b12_rot = self.rotate.rotate(t_b12); let t_b13_rot = self.rotate.rotate(t_b13); let t_b14_rot = self.rotate.rotate(t_b14); let y0 = calc_f64!(x0 + x1p28 + x2p27 + x3p26 + x4p25 + x5p24 + x6p23 + x7p22 + x8p21 + x9p20 + x10p19 + x11p18 + x12p17 + x13p16 + x14p15); let [y1, y28] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y27] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y26] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y25] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y24] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y23] = solo_fft2_f64(t_a6, t_b6_rot); let [y7, y22] = solo_fft2_f64(t_a7, t_b7_rot); let [y8, y21] = solo_fft2_f64(t_a8, t_b8_rot); let [y9, y20] = solo_fft2_f64(t_a9, t_b9_rot); let [y10, y19] = solo_fft2_f64(t_a10, t_b10_rot); let [y11, y18] = solo_fft2_f64(t_a11, t_b11_rot); let [y12, y17] = solo_fft2_f64(t_a12, t_b12_rot); let [y13, y16] = solo_fft2_f64(t_a13, t_b13_rot); let [y14, y15] = solo_fft2_f64(t_a14, t_b14_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22, y23, y24, y25, y26, y27, y28] } } // _____ _ _________ _ _ _ // |___ // | |___ /___ \| |__ (_) |_ // |_ \| | _____ |_ \ __) | '_ \| | __| // ___) | | |_____| ___) / __/| |_) | | |_ // |____/|_| |____/_____|_.__/|_|\__| // pub struct SseF32Butterfly31 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: __m128, twiddle1im: __m128, twiddle2re: __m128, twiddle2im: __m128, twiddle3re: __m128, twiddle3im: __m128, twiddle4re: __m128, twiddle4im: __m128, twiddle5re: __m128, twiddle5im: __m128, twiddle6re: __m128, twiddle6im: __m128, twiddle7re: __m128, twiddle7im: __m128, twiddle8re: __m128, twiddle8im: __m128, twiddle9re: __m128, twiddle9im: __m128, twiddle10re: __m128, twiddle10im: __m128, twiddle11re: __m128, twiddle11im: __m128, twiddle12re: __m128, twiddle12im: __m128, twiddle13re: __m128, twiddle13im: __m128, twiddle14re: __m128, twiddle14im: __m128, twiddle15re: __m128, twiddle15im: __m128, } boilerplate_fft_sse_f32_butterfly!(SseF32Butterfly31, 31, |this: &SseF32Butterfly31<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF32Butterfly31, 31, |this: &SseF32Butterfly31<_>| this .direction); impl SseF32Butterfly31 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 31, direction); let tw2: Complex = twiddles::compute_twiddle(2, 31, direction); let tw3: Complex = twiddles::compute_twiddle(3, 31, direction); let tw4: Complex = twiddles::compute_twiddle(4, 31, direction); let tw5: Complex = twiddles::compute_twiddle(5, 31, direction); let tw6: Complex = twiddles::compute_twiddle(6, 31, direction); let tw7: Complex = twiddles::compute_twiddle(7, 31, direction); let tw8: Complex = twiddles::compute_twiddle(8, 31, direction); let tw9: Complex = twiddles::compute_twiddle(9, 31, direction); let tw10: Complex = twiddles::compute_twiddle(10, 31, direction); let tw11: Complex = twiddles::compute_twiddle(11, 31, direction); let tw12: Complex = twiddles::compute_twiddle(12, 31, direction); let tw13: Complex = twiddles::compute_twiddle(13, 31, direction); let tw14: Complex = twiddles::compute_twiddle(14, 31, direction); let tw15: Complex = twiddles::compute_twiddle(15, 31, direction); let twiddle1re = unsafe { _mm_load1_ps(&tw1.re) }; let twiddle1im = unsafe { _mm_load1_ps(&tw1.im) }; let twiddle2re = unsafe { _mm_load1_ps(&tw2.re) }; let twiddle2im = unsafe { _mm_load1_ps(&tw2.im) }; let twiddle3re = unsafe { _mm_load1_ps(&tw3.re) }; let twiddle3im = unsafe { _mm_load1_ps(&tw3.im) }; let twiddle4re = unsafe { _mm_load1_ps(&tw4.re) }; let twiddle4im = unsafe { _mm_load1_ps(&tw4.im) }; let twiddle5re = unsafe { _mm_load1_ps(&tw5.re) }; let twiddle5im = unsafe { _mm_load1_ps(&tw5.im) }; let twiddle6re = unsafe { _mm_load1_ps(&tw6.re) }; let twiddle6im = unsafe { _mm_load1_ps(&tw6.im) }; let twiddle7re = unsafe { _mm_load1_ps(&tw7.re) }; let twiddle7im = unsafe { _mm_load1_ps(&tw7.im) }; let twiddle8re = unsafe { _mm_load1_ps(&tw8.re) }; let twiddle8im = unsafe { _mm_load1_ps(&tw8.im) }; let twiddle9re = unsafe { _mm_load1_ps(&tw9.re) }; let twiddle9im = unsafe { _mm_load1_ps(&tw9.im) }; let twiddle10re = unsafe { _mm_load1_ps(&tw10.re) }; let twiddle10im = unsafe { _mm_load1_ps(&tw10.im) }; let twiddle11re = unsafe { _mm_load1_ps(&tw11.re) }; let twiddle11im = unsafe { _mm_load1_ps(&tw11.im) }; let twiddle12re = unsafe { _mm_load1_ps(&tw12.re) }; let twiddle12im = unsafe { _mm_load1_ps(&tw12.im) }; let twiddle13re = unsafe { _mm_load1_ps(&tw13.re) }; let twiddle13im = unsafe { _mm_load1_ps(&tw13.im) }; let twiddle14re = unsafe { _mm_load1_ps(&tw14.re) }; let twiddle14im = unsafe { _mm_load1_ps(&tw14.im) }; let twiddle15re = unsafe { _mm_load1_ps(&tw15.re) }; let twiddle15im = unsafe { _mm_load1_ps(&tw15.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, twiddle12re, twiddle12im, twiddle13re, twiddle13im, twiddle14re, twiddle14im, twiddle15re, twiddle15im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[15]), extract_hi_lo_f32(input_packed[0], input_packed[16]), extract_lo_hi_f32(input_packed[1], input_packed[16]), extract_hi_lo_f32(input_packed[1], input_packed[17]), extract_lo_hi_f32(input_packed[2], input_packed[17]), extract_hi_lo_f32(input_packed[2], input_packed[18]), extract_lo_hi_f32(input_packed[3], input_packed[18]), extract_hi_lo_f32(input_packed[3], input_packed[19]), extract_lo_hi_f32(input_packed[4], input_packed[19]), extract_hi_lo_f32(input_packed[4], input_packed[20]), extract_lo_hi_f32(input_packed[5], input_packed[20]), extract_hi_lo_f32(input_packed[5], input_packed[21]), extract_lo_hi_f32(input_packed[6], input_packed[21]), extract_hi_lo_f32(input_packed[6], input_packed[22]), extract_lo_hi_f32(input_packed[7], input_packed[22]), extract_hi_lo_f32(input_packed[7], input_packed[23]), extract_lo_hi_f32(input_packed[8], input_packed[23]), extract_hi_lo_f32(input_packed[8], input_packed[24]), extract_lo_hi_f32(input_packed[9], input_packed[24]), extract_hi_lo_f32(input_packed[9], input_packed[25]), extract_lo_hi_f32(input_packed[10], input_packed[25]), extract_hi_lo_f32(input_packed[10], input_packed[26]), extract_lo_hi_f32(input_packed[11], input_packed[26]), extract_hi_lo_f32(input_packed[11], input_packed[27]), extract_lo_hi_f32(input_packed[12], input_packed[27]), extract_hi_lo_f32(input_packed[12], input_packed[28]), extract_lo_hi_f32(input_packed[13], input_packed[28]), extract_hi_lo_f32(input_packed[13], input_packed[29]), extract_lo_hi_f32(input_packed[14], input_packed[29]), extract_hi_lo_f32(input_packed[14], input_packed[30]), extract_lo_hi_f32(input_packed[15], input_packed[30]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_lo_f32(out[14], out[15]), extract_lo_lo_f32(out[16], out[17]), extract_lo_lo_f32(out[18], out[19]), extract_lo_lo_f32(out[20], out[21]), extract_lo_lo_f32(out[22], out[23]), extract_lo_lo_f32(out[24], out[25]), extract_lo_lo_f32(out[26], out[27]), extract_lo_lo_f32(out[28], out[29]), extract_lo_hi_f32(out[30], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), extract_hi_hi_f32(out[15], out[16]), extract_hi_hi_f32(out[17], out[18]), extract_hi_hi_f32(out[19], out[20]), extract_hi_hi_f32(out[21], out[22]), extract_hi_hi_f32(out[23], out[24]), extract_hi_hi_f32(out[25], out[26]), extract_hi_hi_f32(out[27], out[28]), extract_hi_hi_f32(out[29], out[30]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [__m128; 31]) -> [__m128; 31] { let [x1p30, x1m30] = parallel_fft2_interleaved_f32(values[1], values[30]); let [x2p29, x2m29] = parallel_fft2_interleaved_f32(values[2], values[29]); let [x3p28, x3m28] = parallel_fft2_interleaved_f32(values[3], values[28]); let [x4p27, x4m27] = parallel_fft2_interleaved_f32(values[4], values[27]); let [x5p26, x5m26] = parallel_fft2_interleaved_f32(values[5], values[26]); let [x6p25, x6m25] = parallel_fft2_interleaved_f32(values[6], values[25]); let [x7p24, x7m24] = parallel_fft2_interleaved_f32(values[7], values[24]); let [x8p23, x8m23] = parallel_fft2_interleaved_f32(values[8], values[23]); let [x9p22, x9m22] = parallel_fft2_interleaved_f32(values[9], values[22]); let [x10p21, x10m21] = parallel_fft2_interleaved_f32(values[10], values[21]); let [x11p20, x11m20] = parallel_fft2_interleaved_f32(values[11], values[20]); let [x12p19, x12m19] = parallel_fft2_interleaved_f32(values[12], values[19]); let [x13p18, x13m18] = parallel_fft2_interleaved_f32(values[13], values[18]); let [x14p17, x14m17] = parallel_fft2_interleaved_f32(values[14], values[17]); let [x15p16, x15m16] = parallel_fft2_interleaved_f32(values[15], values[16]); let t_a1_1 = _mm_mul_ps(self.twiddle1re, x1p30); let t_a1_2 = _mm_mul_ps(self.twiddle2re, x2p29); let t_a1_3 = _mm_mul_ps(self.twiddle3re, x3p28); let t_a1_4 = _mm_mul_ps(self.twiddle4re, x4p27); let t_a1_5 = _mm_mul_ps(self.twiddle5re, x5p26); let t_a1_6 = _mm_mul_ps(self.twiddle6re, x6p25); let t_a1_7 = _mm_mul_ps(self.twiddle7re, x7p24); let t_a1_8 = _mm_mul_ps(self.twiddle8re, x8p23); let t_a1_9 = _mm_mul_ps(self.twiddle9re, x9p22); let t_a1_10 = _mm_mul_ps(self.twiddle10re, x10p21); let t_a1_11 = _mm_mul_ps(self.twiddle11re, x11p20); let t_a1_12 = _mm_mul_ps(self.twiddle12re, x12p19); let t_a1_13 = _mm_mul_ps(self.twiddle13re, x13p18); let t_a1_14 = _mm_mul_ps(self.twiddle14re, x14p17); let t_a1_15 = _mm_mul_ps(self.twiddle15re, x15p16); let t_a2_1 = _mm_mul_ps(self.twiddle2re, x1p30); let t_a2_2 = _mm_mul_ps(self.twiddle4re, x2p29); let t_a2_3 = _mm_mul_ps(self.twiddle6re, x3p28); let t_a2_4 = _mm_mul_ps(self.twiddle8re, x4p27); let t_a2_5 = _mm_mul_ps(self.twiddle10re, x5p26); let t_a2_6 = _mm_mul_ps(self.twiddle12re, x6p25); let t_a2_7 = _mm_mul_ps(self.twiddle14re, x7p24); let t_a2_8 = _mm_mul_ps(self.twiddle15re, x8p23); let t_a2_9 = _mm_mul_ps(self.twiddle13re, x9p22); let t_a2_10 = _mm_mul_ps(self.twiddle11re, x10p21); let t_a2_11 = _mm_mul_ps(self.twiddle9re, x11p20); let t_a2_12 = _mm_mul_ps(self.twiddle7re, x12p19); let t_a2_13 = _mm_mul_ps(self.twiddle5re, x13p18); let t_a2_14 = _mm_mul_ps(self.twiddle3re, x14p17); let t_a2_15 = _mm_mul_ps(self.twiddle1re, x15p16); let t_a3_1 = _mm_mul_ps(self.twiddle3re, x1p30); let t_a3_2 = _mm_mul_ps(self.twiddle6re, x2p29); let t_a3_3 = _mm_mul_ps(self.twiddle9re, x3p28); let t_a3_4 = _mm_mul_ps(self.twiddle12re, x4p27); let t_a3_5 = _mm_mul_ps(self.twiddle15re, x5p26); let t_a3_6 = _mm_mul_ps(self.twiddle13re, x6p25); let t_a3_7 = _mm_mul_ps(self.twiddle10re, x7p24); let t_a3_8 = _mm_mul_ps(self.twiddle7re, x8p23); let t_a3_9 = _mm_mul_ps(self.twiddle4re, x9p22); let t_a3_10 = _mm_mul_ps(self.twiddle1re, x10p21); let t_a3_11 = _mm_mul_ps(self.twiddle2re, x11p20); let t_a3_12 = _mm_mul_ps(self.twiddle5re, x12p19); let t_a3_13 = _mm_mul_ps(self.twiddle8re, x13p18); let t_a3_14 = _mm_mul_ps(self.twiddle11re, x14p17); let t_a3_15 = _mm_mul_ps(self.twiddle14re, x15p16); let t_a4_1 = _mm_mul_ps(self.twiddle4re, x1p30); let t_a4_2 = _mm_mul_ps(self.twiddle8re, x2p29); let t_a4_3 = _mm_mul_ps(self.twiddle12re, x3p28); let t_a4_4 = _mm_mul_ps(self.twiddle15re, x4p27); let t_a4_5 = _mm_mul_ps(self.twiddle11re, x5p26); let t_a4_6 = _mm_mul_ps(self.twiddle7re, x6p25); let t_a4_7 = _mm_mul_ps(self.twiddle3re, x7p24); let t_a4_8 = _mm_mul_ps(self.twiddle1re, x8p23); let t_a4_9 = _mm_mul_ps(self.twiddle5re, x9p22); let t_a4_10 = _mm_mul_ps(self.twiddle9re, x10p21); let t_a4_11 = _mm_mul_ps(self.twiddle13re, x11p20); let t_a4_12 = _mm_mul_ps(self.twiddle14re, x12p19); let t_a4_13 = _mm_mul_ps(self.twiddle10re, x13p18); let t_a4_14 = _mm_mul_ps(self.twiddle6re, x14p17); let t_a4_15 = _mm_mul_ps(self.twiddle2re, x15p16); let t_a5_1 = _mm_mul_ps(self.twiddle5re, x1p30); let t_a5_2 = _mm_mul_ps(self.twiddle10re, x2p29); let t_a5_3 = _mm_mul_ps(self.twiddle15re, x3p28); let t_a5_4 = _mm_mul_ps(self.twiddle11re, x4p27); let t_a5_5 = _mm_mul_ps(self.twiddle6re, x5p26); let t_a5_6 = _mm_mul_ps(self.twiddle1re, x6p25); let t_a5_7 = _mm_mul_ps(self.twiddle4re, x7p24); let t_a5_8 = _mm_mul_ps(self.twiddle9re, x8p23); let t_a5_9 = _mm_mul_ps(self.twiddle14re, x9p22); let t_a5_10 = _mm_mul_ps(self.twiddle12re, x10p21); let t_a5_11 = _mm_mul_ps(self.twiddle7re, x11p20); let t_a5_12 = _mm_mul_ps(self.twiddle2re, x12p19); let t_a5_13 = _mm_mul_ps(self.twiddle3re, x13p18); let t_a5_14 = _mm_mul_ps(self.twiddle8re, x14p17); let t_a5_15 = _mm_mul_ps(self.twiddle13re, x15p16); let t_a6_1 = _mm_mul_ps(self.twiddle6re, x1p30); let t_a6_2 = _mm_mul_ps(self.twiddle12re, x2p29); let t_a6_3 = _mm_mul_ps(self.twiddle13re, x3p28); let t_a6_4 = _mm_mul_ps(self.twiddle7re, x4p27); let t_a6_5 = _mm_mul_ps(self.twiddle1re, x5p26); let t_a6_6 = _mm_mul_ps(self.twiddle5re, x6p25); let t_a6_7 = _mm_mul_ps(self.twiddle11re, x7p24); let t_a6_8 = _mm_mul_ps(self.twiddle14re, x8p23); let t_a6_9 = _mm_mul_ps(self.twiddle8re, x9p22); let t_a6_10 = _mm_mul_ps(self.twiddle2re, x10p21); let t_a6_11 = _mm_mul_ps(self.twiddle4re, x11p20); let t_a6_12 = _mm_mul_ps(self.twiddle10re, x12p19); let t_a6_13 = _mm_mul_ps(self.twiddle15re, x13p18); let t_a6_14 = _mm_mul_ps(self.twiddle9re, x14p17); let t_a6_15 = _mm_mul_ps(self.twiddle3re, x15p16); let t_a7_1 = _mm_mul_ps(self.twiddle7re, x1p30); let t_a7_2 = _mm_mul_ps(self.twiddle14re, x2p29); let t_a7_3 = _mm_mul_ps(self.twiddle10re, x3p28); let t_a7_4 = _mm_mul_ps(self.twiddle3re, x4p27); let t_a7_5 = _mm_mul_ps(self.twiddle4re, x5p26); let t_a7_6 = _mm_mul_ps(self.twiddle11re, x6p25); let t_a7_7 = _mm_mul_ps(self.twiddle13re, x7p24); let t_a7_8 = _mm_mul_ps(self.twiddle6re, x8p23); let t_a7_9 = _mm_mul_ps(self.twiddle1re, x9p22); let t_a7_10 = _mm_mul_ps(self.twiddle8re, x10p21); let t_a7_11 = _mm_mul_ps(self.twiddle15re, x11p20); let t_a7_12 = _mm_mul_ps(self.twiddle9re, x12p19); let t_a7_13 = _mm_mul_ps(self.twiddle2re, x13p18); let t_a7_14 = _mm_mul_ps(self.twiddle5re, x14p17); let t_a7_15 = _mm_mul_ps(self.twiddle12re, x15p16); let t_a8_1 = _mm_mul_ps(self.twiddle8re, x1p30); let t_a8_2 = _mm_mul_ps(self.twiddle15re, x2p29); let t_a8_3 = _mm_mul_ps(self.twiddle7re, x3p28); let t_a8_4 = _mm_mul_ps(self.twiddle1re, x4p27); let t_a8_5 = _mm_mul_ps(self.twiddle9re, x5p26); let t_a8_6 = _mm_mul_ps(self.twiddle14re, x6p25); let t_a8_7 = _mm_mul_ps(self.twiddle6re, x7p24); let t_a8_8 = _mm_mul_ps(self.twiddle2re, x8p23); let t_a8_9 = _mm_mul_ps(self.twiddle10re, x9p22); let t_a8_10 = _mm_mul_ps(self.twiddle13re, x10p21); let t_a8_11 = _mm_mul_ps(self.twiddle5re, x11p20); let t_a8_12 = _mm_mul_ps(self.twiddle3re, x12p19); let t_a8_13 = _mm_mul_ps(self.twiddle11re, x13p18); let t_a8_14 = _mm_mul_ps(self.twiddle12re, x14p17); let t_a8_15 = _mm_mul_ps(self.twiddle4re, x15p16); let t_a9_1 = _mm_mul_ps(self.twiddle9re, x1p30); let t_a9_2 = _mm_mul_ps(self.twiddle13re, x2p29); let t_a9_3 = _mm_mul_ps(self.twiddle4re, x3p28); let t_a9_4 = _mm_mul_ps(self.twiddle5re, x4p27); let t_a9_5 = _mm_mul_ps(self.twiddle14re, x5p26); let t_a9_6 = _mm_mul_ps(self.twiddle8re, x6p25); let t_a9_7 = _mm_mul_ps(self.twiddle1re, x7p24); let t_a9_8 = _mm_mul_ps(self.twiddle10re, x8p23); let t_a9_9 = _mm_mul_ps(self.twiddle12re, x9p22); let t_a9_10 = _mm_mul_ps(self.twiddle3re, x10p21); let t_a9_11 = _mm_mul_ps(self.twiddle6re, x11p20); let t_a9_12 = _mm_mul_ps(self.twiddle15re, x12p19); let t_a9_13 = _mm_mul_ps(self.twiddle7re, x13p18); let t_a9_14 = _mm_mul_ps(self.twiddle2re, x14p17); let t_a9_15 = _mm_mul_ps(self.twiddle11re, x15p16); let t_a10_1 = _mm_mul_ps(self.twiddle10re, x1p30); let t_a10_2 = _mm_mul_ps(self.twiddle11re, x2p29); let t_a10_3 = _mm_mul_ps(self.twiddle1re, x3p28); let t_a10_4 = _mm_mul_ps(self.twiddle9re, x4p27); let t_a10_5 = _mm_mul_ps(self.twiddle12re, x5p26); let t_a10_6 = _mm_mul_ps(self.twiddle2re, x6p25); let t_a10_7 = _mm_mul_ps(self.twiddle8re, x7p24); let t_a10_8 = _mm_mul_ps(self.twiddle13re, x8p23); let t_a10_9 = _mm_mul_ps(self.twiddle3re, x9p22); let t_a10_10 = _mm_mul_ps(self.twiddle7re, x10p21); let t_a10_11 = _mm_mul_ps(self.twiddle14re, x11p20); let t_a10_12 = _mm_mul_ps(self.twiddle4re, x12p19); let t_a10_13 = _mm_mul_ps(self.twiddle6re, x13p18); let t_a10_14 = _mm_mul_ps(self.twiddle15re, x14p17); let t_a10_15 = _mm_mul_ps(self.twiddle5re, x15p16); let t_a11_1 = _mm_mul_ps(self.twiddle11re, x1p30); let t_a11_2 = _mm_mul_ps(self.twiddle9re, x2p29); let t_a11_3 = _mm_mul_ps(self.twiddle2re, x3p28); let t_a11_4 = _mm_mul_ps(self.twiddle13re, x4p27); let t_a11_5 = _mm_mul_ps(self.twiddle7re, x5p26); let t_a11_6 = _mm_mul_ps(self.twiddle4re, x6p25); let t_a11_7 = _mm_mul_ps(self.twiddle15re, x7p24); let t_a11_8 = _mm_mul_ps(self.twiddle5re, x8p23); let t_a11_9 = _mm_mul_ps(self.twiddle6re, x9p22); let t_a11_10 = _mm_mul_ps(self.twiddle14re, x10p21); let t_a11_11 = _mm_mul_ps(self.twiddle3re, x11p20); let t_a11_12 = _mm_mul_ps(self.twiddle8re, x12p19); let t_a11_13 = _mm_mul_ps(self.twiddle12re, x13p18); let t_a11_14 = _mm_mul_ps(self.twiddle1re, x14p17); let t_a11_15 = _mm_mul_ps(self.twiddle10re, x15p16); let t_a12_1 = _mm_mul_ps(self.twiddle12re, x1p30); let t_a12_2 = _mm_mul_ps(self.twiddle7re, x2p29); let t_a12_3 = _mm_mul_ps(self.twiddle5re, x3p28); let t_a12_4 = _mm_mul_ps(self.twiddle14re, x4p27); let t_a12_5 = _mm_mul_ps(self.twiddle2re, x5p26); let t_a12_6 = _mm_mul_ps(self.twiddle10re, x6p25); let t_a12_7 = _mm_mul_ps(self.twiddle9re, x7p24); let t_a12_8 = _mm_mul_ps(self.twiddle3re, x8p23); let t_a12_9 = _mm_mul_ps(self.twiddle15re, x9p22); let t_a12_10 = _mm_mul_ps(self.twiddle4re, x10p21); let t_a12_11 = _mm_mul_ps(self.twiddle8re, x11p20); let t_a12_12 = _mm_mul_ps(self.twiddle11re, x12p19); let t_a12_13 = _mm_mul_ps(self.twiddle1re, x13p18); let t_a12_14 = _mm_mul_ps(self.twiddle13re, x14p17); let t_a12_15 = _mm_mul_ps(self.twiddle6re, x15p16); let t_a13_1 = _mm_mul_ps(self.twiddle13re, x1p30); let t_a13_2 = _mm_mul_ps(self.twiddle5re, x2p29); let t_a13_3 = _mm_mul_ps(self.twiddle8re, x3p28); let t_a13_4 = _mm_mul_ps(self.twiddle10re, x4p27); let t_a13_5 = _mm_mul_ps(self.twiddle3re, x5p26); let t_a13_6 = _mm_mul_ps(self.twiddle15re, x6p25); let t_a13_7 = _mm_mul_ps(self.twiddle2re, x7p24); let t_a13_8 = _mm_mul_ps(self.twiddle11re, x8p23); let t_a13_9 = _mm_mul_ps(self.twiddle7re, x9p22); let t_a13_10 = _mm_mul_ps(self.twiddle6re, x10p21); let t_a13_11 = _mm_mul_ps(self.twiddle12re, x11p20); let t_a13_12 = _mm_mul_ps(self.twiddle1re, x12p19); let t_a13_13 = _mm_mul_ps(self.twiddle14re, x13p18); let t_a13_14 = _mm_mul_ps(self.twiddle4re, x14p17); let t_a13_15 = _mm_mul_ps(self.twiddle9re, x15p16); let t_a14_1 = _mm_mul_ps(self.twiddle14re, x1p30); let t_a14_2 = _mm_mul_ps(self.twiddle3re, x2p29); let t_a14_3 = _mm_mul_ps(self.twiddle11re, x3p28); let t_a14_4 = _mm_mul_ps(self.twiddle6re, x4p27); let t_a14_5 = _mm_mul_ps(self.twiddle8re, x5p26); let t_a14_6 = _mm_mul_ps(self.twiddle9re, x6p25); let t_a14_7 = _mm_mul_ps(self.twiddle5re, x7p24); let t_a14_8 = _mm_mul_ps(self.twiddle12re, x8p23); let t_a14_9 = _mm_mul_ps(self.twiddle2re, x9p22); let t_a14_10 = _mm_mul_ps(self.twiddle15re, x10p21); let t_a14_11 = _mm_mul_ps(self.twiddle1re, x11p20); let t_a14_12 = _mm_mul_ps(self.twiddle13re, x12p19); let t_a14_13 = _mm_mul_ps(self.twiddle4re, x13p18); let t_a14_14 = _mm_mul_ps(self.twiddle10re, x14p17); let t_a14_15 = _mm_mul_ps(self.twiddle7re, x15p16); let t_a15_1 = _mm_mul_ps(self.twiddle15re, x1p30); let t_a15_2 = _mm_mul_ps(self.twiddle1re, x2p29); let t_a15_3 = _mm_mul_ps(self.twiddle14re, x3p28); let t_a15_4 = _mm_mul_ps(self.twiddle2re, x4p27); let t_a15_5 = _mm_mul_ps(self.twiddle13re, x5p26); let t_a15_6 = _mm_mul_ps(self.twiddle3re, x6p25); let t_a15_7 = _mm_mul_ps(self.twiddle12re, x7p24); let t_a15_8 = _mm_mul_ps(self.twiddle4re, x8p23); let t_a15_9 = _mm_mul_ps(self.twiddle11re, x9p22); let t_a15_10 = _mm_mul_ps(self.twiddle5re, x10p21); let t_a15_11 = _mm_mul_ps(self.twiddle10re, x11p20); let t_a15_12 = _mm_mul_ps(self.twiddle6re, x12p19); let t_a15_13 = _mm_mul_ps(self.twiddle9re, x13p18); let t_a15_14 = _mm_mul_ps(self.twiddle7re, x14p17); let t_a15_15 = _mm_mul_ps(self.twiddle8re, x15p16); let t_b1_1 = _mm_mul_ps(self.twiddle1im, x1m30); let t_b1_2 = _mm_mul_ps(self.twiddle2im, x2m29); let t_b1_3 = _mm_mul_ps(self.twiddle3im, x3m28); let t_b1_4 = _mm_mul_ps(self.twiddle4im, x4m27); let t_b1_5 = _mm_mul_ps(self.twiddle5im, x5m26); let t_b1_6 = _mm_mul_ps(self.twiddle6im, x6m25); let t_b1_7 = _mm_mul_ps(self.twiddle7im, x7m24); let t_b1_8 = _mm_mul_ps(self.twiddle8im, x8m23); let t_b1_9 = _mm_mul_ps(self.twiddle9im, x9m22); let t_b1_10 = _mm_mul_ps(self.twiddle10im, x10m21); let t_b1_11 = _mm_mul_ps(self.twiddle11im, x11m20); let t_b1_12 = _mm_mul_ps(self.twiddle12im, x12m19); let t_b1_13 = _mm_mul_ps(self.twiddle13im, x13m18); let t_b1_14 = _mm_mul_ps(self.twiddle14im, x14m17); let t_b1_15 = _mm_mul_ps(self.twiddle15im, x15m16); let t_b2_1 = _mm_mul_ps(self.twiddle2im, x1m30); let t_b2_2 = _mm_mul_ps(self.twiddle4im, x2m29); let t_b2_3 = _mm_mul_ps(self.twiddle6im, x3m28); let t_b2_4 = _mm_mul_ps(self.twiddle8im, x4m27); let t_b2_5 = _mm_mul_ps(self.twiddle10im, x5m26); let t_b2_6 = _mm_mul_ps(self.twiddle12im, x6m25); let t_b2_7 = _mm_mul_ps(self.twiddle14im, x7m24); let t_b2_8 = _mm_mul_ps(self.twiddle15im, x8m23); let t_b2_9 = _mm_mul_ps(self.twiddle13im, x9m22); let t_b2_10 = _mm_mul_ps(self.twiddle11im, x10m21); let t_b2_11 = _mm_mul_ps(self.twiddle9im, x11m20); let t_b2_12 = _mm_mul_ps(self.twiddle7im, x12m19); let t_b2_13 = _mm_mul_ps(self.twiddle5im, x13m18); let t_b2_14 = _mm_mul_ps(self.twiddle3im, x14m17); let t_b2_15 = _mm_mul_ps(self.twiddle1im, x15m16); let t_b3_1 = _mm_mul_ps(self.twiddle3im, x1m30); let t_b3_2 = _mm_mul_ps(self.twiddle6im, x2m29); let t_b3_3 = _mm_mul_ps(self.twiddle9im, x3m28); let t_b3_4 = _mm_mul_ps(self.twiddle12im, x4m27); let t_b3_5 = _mm_mul_ps(self.twiddle15im, x5m26); let t_b3_6 = _mm_mul_ps(self.twiddle13im, x6m25); let t_b3_7 = _mm_mul_ps(self.twiddle10im, x7m24); let t_b3_8 = _mm_mul_ps(self.twiddle7im, x8m23); let t_b3_9 = _mm_mul_ps(self.twiddle4im, x9m22); let t_b3_10 = _mm_mul_ps(self.twiddle1im, x10m21); let t_b3_11 = _mm_mul_ps(self.twiddle2im, x11m20); let t_b3_12 = _mm_mul_ps(self.twiddle5im, x12m19); let t_b3_13 = _mm_mul_ps(self.twiddle8im, x13m18); let t_b3_14 = _mm_mul_ps(self.twiddle11im, x14m17); let t_b3_15 = _mm_mul_ps(self.twiddle14im, x15m16); let t_b4_1 = _mm_mul_ps(self.twiddle4im, x1m30); let t_b4_2 = _mm_mul_ps(self.twiddle8im, x2m29); let t_b4_3 = _mm_mul_ps(self.twiddle12im, x3m28); let t_b4_4 = _mm_mul_ps(self.twiddle15im, x4m27); let t_b4_5 = _mm_mul_ps(self.twiddle11im, x5m26); let t_b4_6 = _mm_mul_ps(self.twiddle7im, x6m25); let t_b4_7 = _mm_mul_ps(self.twiddle3im, x7m24); let t_b4_8 = _mm_mul_ps(self.twiddle1im, x8m23); let t_b4_9 = _mm_mul_ps(self.twiddle5im, x9m22); let t_b4_10 = _mm_mul_ps(self.twiddle9im, x10m21); let t_b4_11 = _mm_mul_ps(self.twiddle13im, x11m20); let t_b4_12 = _mm_mul_ps(self.twiddle14im, x12m19); let t_b4_13 = _mm_mul_ps(self.twiddle10im, x13m18); let t_b4_14 = _mm_mul_ps(self.twiddle6im, x14m17); let t_b4_15 = _mm_mul_ps(self.twiddle2im, x15m16); let t_b5_1 = _mm_mul_ps(self.twiddle5im, x1m30); let t_b5_2 = _mm_mul_ps(self.twiddle10im, x2m29); let t_b5_3 = _mm_mul_ps(self.twiddle15im, x3m28); let t_b5_4 = _mm_mul_ps(self.twiddle11im, x4m27); let t_b5_5 = _mm_mul_ps(self.twiddle6im, x5m26); let t_b5_6 = _mm_mul_ps(self.twiddle1im, x6m25); let t_b5_7 = _mm_mul_ps(self.twiddle4im, x7m24); let t_b5_8 = _mm_mul_ps(self.twiddle9im, x8m23); let t_b5_9 = _mm_mul_ps(self.twiddle14im, x9m22); let t_b5_10 = _mm_mul_ps(self.twiddle12im, x10m21); let t_b5_11 = _mm_mul_ps(self.twiddle7im, x11m20); let t_b5_12 = _mm_mul_ps(self.twiddle2im, x12m19); let t_b5_13 = _mm_mul_ps(self.twiddle3im, x13m18); let t_b5_14 = _mm_mul_ps(self.twiddle8im, x14m17); let t_b5_15 = _mm_mul_ps(self.twiddle13im, x15m16); let t_b6_1 = _mm_mul_ps(self.twiddle6im, x1m30); let t_b6_2 = _mm_mul_ps(self.twiddle12im, x2m29); let t_b6_3 = _mm_mul_ps(self.twiddle13im, x3m28); let t_b6_4 = _mm_mul_ps(self.twiddle7im, x4m27); let t_b6_5 = _mm_mul_ps(self.twiddle1im, x5m26); let t_b6_6 = _mm_mul_ps(self.twiddle5im, x6m25); let t_b6_7 = _mm_mul_ps(self.twiddle11im, x7m24); let t_b6_8 = _mm_mul_ps(self.twiddle14im, x8m23); let t_b6_9 = _mm_mul_ps(self.twiddle8im, x9m22); let t_b6_10 = _mm_mul_ps(self.twiddle2im, x10m21); let t_b6_11 = _mm_mul_ps(self.twiddle4im, x11m20); let t_b6_12 = _mm_mul_ps(self.twiddle10im, x12m19); let t_b6_13 = _mm_mul_ps(self.twiddle15im, x13m18); let t_b6_14 = _mm_mul_ps(self.twiddle9im, x14m17); let t_b6_15 = _mm_mul_ps(self.twiddle3im, x15m16); let t_b7_1 = _mm_mul_ps(self.twiddle7im, x1m30); let t_b7_2 = _mm_mul_ps(self.twiddle14im, x2m29); let t_b7_3 = _mm_mul_ps(self.twiddle10im, x3m28); let t_b7_4 = _mm_mul_ps(self.twiddle3im, x4m27); let t_b7_5 = _mm_mul_ps(self.twiddle4im, x5m26); let t_b7_6 = _mm_mul_ps(self.twiddle11im, x6m25); let t_b7_7 = _mm_mul_ps(self.twiddle13im, x7m24); let t_b7_8 = _mm_mul_ps(self.twiddle6im, x8m23); let t_b7_9 = _mm_mul_ps(self.twiddle1im, x9m22); let t_b7_10 = _mm_mul_ps(self.twiddle8im, x10m21); let t_b7_11 = _mm_mul_ps(self.twiddle15im, x11m20); let t_b7_12 = _mm_mul_ps(self.twiddle9im, x12m19); let t_b7_13 = _mm_mul_ps(self.twiddle2im, x13m18); let t_b7_14 = _mm_mul_ps(self.twiddle5im, x14m17); let t_b7_15 = _mm_mul_ps(self.twiddle12im, x15m16); let t_b8_1 = _mm_mul_ps(self.twiddle8im, x1m30); let t_b8_2 = _mm_mul_ps(self.twiddle15im, x2m29); let t_b8_3 = _mm_mul_ps(self.twiddle7im, x3m28); let t_b8_4 = _mm_mul_ps(self.twiddle1im, x4m27); let t_b8_5 = _mm_mul_ps(self.twiddle9im, x5m26); let t_b8_6 = _mm_mul_ps(self.twiddle14im, x6m25); let t_b8_7 = _mm_mul_ps(self.twiddle6im, x7m24); let t_b8_8 = _mm_mul_ps(self.twiddle2im, x8m23); let t_b8_9 = _mm_mul_ps(self.twiddle10im, x9m22); let t_b8_10 = _mm_mul_ps(self.twiddle13im, x10m21); let t_b8_11 = _mm_mul_ps(self.twiddle5im, x11m20); let t_b8_12 = _mm_mul_ps(self.twiddle3im, x12m19); let t_b8_13 = _mm_mul_ps(self.twiddle11im, x13m18); let t_b8_14 = _mm_mul_ps(self.twiddle12im, x14m17); let t_b8_15 = _mm_mul_ps(self.twiddle4im, x15m16); let t_b9_1 = _mm_mul_ps(self.twiddle9im, x1m30); let t_b9_2 = _mm_mul_ps(self.twiddle13im, x2m29); let t_b9_3 = _mm_mul_ps(self.twiddle4im, x3m28); let t_b9_4 = _mm_mul_ps(self.twiddle5im, x4m27); let t_b9_5 = _mm_mul_ps(self.twiddle14im, x5m26); let t_b9_6 = _mm_mul_ps(self.twiddle8im, x6m25); let t_b9_7 = _mm_mul_ps(self.twiddle1im, x7m24); let t_b9_8 = _mm_mul_ps(self.twiddle10im, x8m23); let t_b9_9 = _mm_mul_ps(self.twiddle12im, x9m22); let t_b9_10 = _mm_mul_ps(self.twiddle3im, x10m21); let t_b9_11 = _mm_mul_ps(self.twiddle6im, x11m20); let t_b9_12 = _mm_mul_ps(self.twiddle15im, x12m19); let t_b9_13 = _mm_mul_ps(self.twiddle7im, x13m18); let t_b9_14 = _mm_mul_ps(self.twiddle2im, x14m17); let t_b9_15 = _mm_mul_ps(self.twiddle11im, x15m16); let t_b10_1 = _mm_mul_ps(self.twiddle10im, x1m30); let t_b10_2 = _mm_mul_ps(self.twiddle11im, x2m29); let t_b10_3 = _mm_mul_ps(self.twiddle1im, x3m28); let t_b10_4 = _mm_mul_ps(self.twiddle9im, x4m27); let t_b10_5 = _mm_mul_ps(self.twiddle12im, x5m26); let t_b10_6 = _mm_mul_ps(self.twiddle2im, x6m25); let t_b10_7 = _mm_mul_ps(self.twiddle8im, x7m24); let t_b10_8 = _mm_mul_ps(self.twiddle13im, x8m23); let t_b10_9 = _mm_mul_ps(self.twiddle3im, x9m22); let t_b10_10 = _mm_mul_ps(self.twiddle7im, x10m21); let t_b10_11 = _mm_mul_ps(self.twiddle14im, x11m20); let t_b10_12 = _mm_mul_ps(self.twiddle4im, x12m19); let t_b10_13 = _mm_mul_ps(self.twiddle6im, x13m18); let t_b10_14 = _mm_mul_ps(self.twiddle15im, x14m17); let t_b10_15 = _mm_mul_ps(self.twiddle5im, x15m16); let t_b11_1 = _mm_mul_ps(self.twiddle11im, x1m30); let t_b11_2 = _mm_mul_ps(self.twiddle9im, x2m29); let t_b11_3 = _mm_mul_ps(self.twiddle2im, x3m28); let t_b11_4 = _mm_mul_ps(self.twiddle13im, x4m27); let t_b11_5 = _mm_mul_ps(self.twiddle7im, x5m26); let t_b11_6 = _mm_mul_ps(self.twiddle4im, x6m25); let t_b11_7 = _mm_mul_ps(self.twiddle15im, x7m24); let t_b11_8 = _mm_mul_ps(self.twiddle5im, x8m23); let t_b11_9 = _mm_mul_ps(self.twiddle6im, x9m22); let t_b11_10 = _mm_mul_ps(self.twiddle14im, x10m21); let t_b11_11 = _mm_mul_ps(self.twiddle3im, x11m20); let t_b11_12 = _mm_mul_ps(self.twiddle8im, x12m19); let t_b11_13 = _mm_mul_ps(self.twiddle12im, x13m18); let t_b11_14 = _mm_mul_ps(self.twiddle1im, x14m17); let t_b11_15 = _mm_mul_ps(self.twiddle10im, x15m16); let t_b12_1 = _mm_mul_ps(self.twiddle12im, x1m30); let t_b12_2 = _mm_mul_ps(self.twiddle7im, x2m29); let t_b12_3 = _mm_mul_ps(self.twiddle5im, x3m28); let t_b12_4 = _mm_mul_ps(self.twiddle14im, x4m27); let t_b12_5 = _mm_mul_ps(self.twiddle2im, x5m26); let t_b12_6 = _mm_mul_ps(self.twiddle10im, x6m25); let t_b12_7 = _mm_mul_ps(self.twiddle9im, x7m24); let t_b12_8 = _mm_mul_ps(self.twiddle3im, x8m23); let t_b12_9 = _mm_mul_ps(self.twiddle15im, x9m22); let t_b12_10 = _mm_mul_ps(self.twiddle4im, x10m21); let t_b12_11 = _mm_mul_ps(self.twiddle8im, x11m20); let t_b12_12 = _mm_mul_ps(self.twiddle11im, x12m19); let t_b12_13 = _mm_mul_ps(self.twiddle1im, x13m18); let t_b12_14 = _mm_mul_ps(self.twiddle13im, x14m17); let t_b12_15 = _mm_mul_ps(self.twiddle6im, x15m16); let t_b13_1 = _mm_mul_ps(self.twiddle13im, x1m30); let t_b13_2 = _mm_mul_ps(self.twiddle5im, x2m29); let t_b13_3 = _mm_mul_ps(self.twiddle8im, x3m28); let t_b13_4 = _mm_mul_ps(self.twiddle10im, x4m27); let t_b13_5 = _mm_mul_ps(self.twiddle3im, x5m26); let t_b13_6 = _mm_mul_ps(self.twiddle15im, x6m25); let t_b13_7 = _mm_mul_ps(self.twiddle2im, x7m24); let t_b13_8 = _mm_mul_ps(self.twiddle11im, x8m23); let t_b13_9 = _mm_mul_ps(self.twiddle7im, x9m22); let t_b13_10 = _mm_mul_ps(self.twiddle6im, x10m21); let t_b13_11 = _mm_mul_ps(self.twiddle12im, x11m20); let t_b13_12 = _mm_mul_ps(self.twiddle1im, x12m19); let t_b13_13 = _mm_mul_ps(self.twiddle14im, x13m18); let t_b13_14 = _mm_mul_ps(self.twiddle4im, x14m17); let t_b13_15 = _mm_mul_ps(self.twiddle9im, x15m16); let t_b14_1 = _mm_mul_ps(self.twiddle14im, x1m30); let t_b14_2 = _mm_mul_ps(self.twiddle3im, x2m29); let t_b14_3 = _mm_mul_ps(self.twiddle11im, x3m28); let t_b14_4 = _mm_mul_ps(self.twiddle6im, x4m27); let t_b14_5 = _mm_mul_ps(self.twiddle8im, x5m26); let t_b14_6 = _mm_mul_ps(self.twiddle9im, x6m25); let t_b14_7 = _mm_mul_ps(self.twiddle5im, x7m24); let t_b14_8 = _mm_mul_ps(self.twiddle12im, x8m23); let t_b14_9 = _mm_mul_ps(self.twiddle2im, x9m22); let t_b14_10 = _mm_mul_ps(self.twiddle15im, x10m21); let t_b14_11 = _mm_mul_ps(self.twiddle1im, x11m20); let t_b14_12 = _mm_mul_ps(self.twiddle13im, x12m19); let t_b14_13 = _mm_mul_ps(self.twiddle4im, x13m18); let t_b14_14 = _mm_mul_ps(self.twiddle10im, x14m17); let t_b14_15 = _mm_mul_ps(self.twiddle7im, x15m16); let t_b15_1 = _mm_mul_ps(self.twiddle15im, x1m30); let t_b15_2 = _mm_mul_ps(self.twiddle1im, x2m29); let t_b15_3 = _mm_mul_ps(self.twiddle14im, x3m28); let t_b15_4 = _mm_mul_ps(self.twiddle2im, x4m27); let t_b15_5 = _mm_mul_ps(self.twiddle13im, x5m26); let t_b15_6 = _mm_mul_ps(self.twiddle3im, x6m25); let t_b15_7 = _mm_mul_ps(self.twiddle12im, x7m24); let t_b15_8 = _mm_mul_ps(self.twiddle4im, x8m23); let t_b15_9 = _mm_mul_ps(self.twiddle11im, x9m22); let t_b15_10 = _mm_mul_ps(self.twiddle5im, x10m21); let t_b15_11 = _mm_mul_ps(self.twiddle10im, x11m20); let t_b15_12 = _mm_mul_ps(self.twiddle6im, x12m19); let t_b15_13 = _mm_mul_ps(self.twiddle9im, x13m18); let t_b15_14 = _mm_mul_ps(self.twiddle7im, x14m17); let t_b15_15 = _mm_mul_ps(self.twiddle8im, x15m16); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11 + t_a1_12 + t_a1_13 + t_a1_14 + t_a1_15); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11 + t_a2_12 + t_a2_13 + t_a2_14 + t_a2_15); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11 + t_a3_12 + t_a3_13 + t_a3_14 + t_a3_15); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11 + t_a4_12 + t_a4_13 + t_a4_14 + t_a4_15); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11 + t_a5_12 + t_a5_13 + t_a5_14 + t_a5_15); let t_a6 = calc_f32!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11 + t_a6_12 + t_a6_13 + t_a6_14 + t_a6_15); let t_a7 = calc_f32!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11 + t_a7_12 + t_a7_13 + t_a7_14 + t_a7_15); let t_a8 = calc_f32!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11 + t_a8_12 + t_a8_13 + t_a8_14 + t_a8_15); let t_a9 = calc_f32!(x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11 + t_a9_12 + t_a9_13 + t_a9_14 + t_a9_15); let t_a10 = calc_f32!(x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11 + t_a10_12 + t_a10_13 + t_a10_14 + t_a10_15); let t_a11 = calc_f32!(x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11 + t_a11_12 + t_a11_13 + t_a11_14 + t_a11_15); let t_a12 = calc_f32!(x0 + t_a12_1 + t_a12_2 + t_a12_3 + t_a12_4 + t_a12_5 + t_a12_6 + t_a12_7 + t_a12_8 + t_a12_9 + t_a12_10 + t_a12_11 + t_a12_12 + t_a12_13 + t_a12_14 + t_a12_15); let t_a13 = calc_f32!(x0 + t_a13_1 + t_a13_2 + t_a13_3 + t_a13_4 + t_a13_5 + t_a13_6 + t_a13_7 + t_a13_8 + t_a13_9 + t_a13_10 + t_a13_11 + t_a13_12 + t_a13_13 + t_a13_14 + t_a13_15); let t_a14 = calc_f32!(x0 + t_a14_1 + t_a14_2 + t_a14_3 + t_a14_4 + t_a14_5 + t_a14_6 + t_a14_7 + t_a14_8 + t_a14_9 + t_a14_10 + t_a14_11 + t_a14_12 + t_a14_13 + t_a14_14 + t_a14_15); let t_a15 = calc_f32!(x0 + t_a15_1 + t_a15_2 + t_a15_3 + t_a15_4 + t_a15_5 + t_a15_6 + t_a15_7 + t_a15_8 + t_a15_9 + t_a15_10 + t_a15_11 + t_a15_12 + t_a15_13 + t_a15_14 + t_a15_15); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11 + t_b1_12 + t_b1_13 + t_b1_14 + t_b1_15); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 + t_b2_6 + t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11 - t_b2_12 - t_b2_13 - t_b2_14 - t_b2_15); let t_b3 = calc_f32!(t_b3_1 + t_b3_2 + t_b3_3 + t_b3_4 + t_b3_5 - t_b3_6 - t_b3_7 - t_b3_8 - t_b3_9 - t_b3_10 + t_b3_11 + t_b3_12 + t_b3_13 + t_b3_14 + t_b3_15); let t_b4 = calc_f32!(t_b4_1 + t_b4_2 + t_b4_3 - t_b4_4 - t_b4_5 - t_b4_6 - t_b4_7 + t_b4_8 + t_b4_9 + t_b4_10 + t_b4_11 - t_b4_12 - t_b4_13 - t_b4_14 - t_b4_15); let t_b5 = calc_f32!(t_b5_1 + t_b5_2 + t_b5_3 - t_b5_4 - t_b5_5 - t_b5_6 + t_b5_7 + t_b5_8 + t_b5_9 - t_b5_10 - t_b5_11 - t_b5_12 + t_b5_13 + t_b5_14 + t_b5_15); let t_b6 = calc_f32!(t_b6_1 + t_b6_2 - t_b6_3 - t_b6_4 - t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9 - t_b6_10 + t_b6_11 + t_b6_12 - t_b6_13 - t_b6_14 - t_b6_15); let t_b7 = calc_f32!(t_b7_1 + t_b7_2 - t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9 + t_b7_10 + t_b7_11 - t_b7_12 - t_b7_13 + t_b7_14 + t_b7_15); let t_b8 = calc_f32!(t_b8_1 - t_b8_2 - t_b8_3 + t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 + t_b8_9 - t_b8_10 - t_b8_11 + t_b8_12 + t_b8_13 - t_b8_14 - t_b8_15); let t_b9 = calc_f32!(t_b9_1 - t_b9_2 - t_b9_3 + t_b9_4 + t_b9_5 - t_b9_6 + t_b9_7 + t_b9_8 - t_b9_9 - t_b9_10 + t_b9_11 + t_b9_12 - t_b9_13 + t_b9_14 + t_b9_15); let t_b10 = calc_f32!(t_b10_1 - t_b10_2 - t_b10_3 + t_b10_4 - t_b10_5 - t_b10_6 + t_b10_7 - t_b10_8 - t_b10_9 + t_b10_10 - t_b10_11 - t_b10_12 + t_b10_13 - t_b10_14 - t_b10_15); let t_b11 = calc_f32!(t_b11_1 - t_b11_2 + t_b11_3 + t_b11_4 - t_b11_5 + t_b11_6 + t_b11_7 - t_b11_8 + t_b11_9 - t_b11_10 - t_b11_11 + t_b11_12 - t_b11_13 - t_b11_14 + t_b11_15); let t_b12 = calc_f32!(t_b12_1 - t_b12_2 + t_b12_3 - t_b12_4 - t_b12_5 + t_b12_6 - t_b12_7 + t_b12_8 + t_b12_9 - t_b12_10 + t_b12_11 - t_b12_12 + t_b12_13 + t_b12_14 - t_b12_15); let t_b13 = calc_f32!(t_b13_1 - t_b13_2 + t_b13_3 - t_b13_4 + t_b13_5 - t_b13_6 - t_b13_7 + t_b13_8 - t_b13_9 + t_b13_10 - t_b13_11 + t_b13_12 + t_b13_13 - t_b13_14 + t_b13_15); let t_b14 = calc_f32!(t_b14_1 - t_b14_2 + t_b14_3 - t_b14_4 + t_b14_5 - t_b14_6 + t_b14_7 - t_b14_8 + t_b14_9 - t_b14_10 - t_b14_11 + t_b14_12 - t_b14_13 + t_b14_14 - t_b14_15); let t_b15 = calc_f32!(t_b15_1 - t_b15_2 + t_b15_3 - t_b15_4 + t_b15_5 - t_b15_6 + t_b15_7 - t_b15_8 + t_b15_9 - t_b15_10 + t_b15_11 - t_b15_12 + t_b15_13 - t_b15_14 + t_b15_15); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let t_b7_rot = self.rotate.rotate_both(t_b7); let t_b8_rot = self.rotate.rotate_both(t_b8); let t_b9_rot = self.rotate.rotate_both(t_b9); let t_b10_rot = self.rotate.rotate_both(t_b10); let t_b11_rot = self.rotate.rotate_both(t_b11); let t_b12_rot = self.rotate.rotate_both(t_b12); let t_b13_rot = self.rotate.rotate_both(t_b13); let t_b14_rot = self.rotate.rotate_both(t_b14); let t_b15_rot = self.rotate.rotate_both(t_b15); let y0 = calc_f32!(x0 + x1p30 + x2p29 + x3p28 + x4p27 + x5p26 + x6p25 + x7p24 + x8p23 + x9p22 + x10p21 + x11p20 + x12p19 + x13p18 + x14p17 + x15p16); let [y1, y30] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y29] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y28] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y27] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y26] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y25] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); let [y7, y24] = parallel_fft2_interleaved_f32(t_a7, t_b7_rot); let [y8, y23] = parallel_fft2_interleaved_f32(t_a8, t_b8_rot); let [y9, y22] = parallel_fft2_interleaved_f32(t_a9, t_b9_rot); let [y10, y21] = parallel_fft2_interleaved_f32(t_a10, t_b10_rot); let [y11, y20] = parallel_fft2_interleaved_f32(t_a11, t_b11_rot); let [y12, y19] = parallel_fft2_interleaved_f32(t_a12, t_b12_rot); let [y13, y18] = parallel_fft2_interleaved_f32(t_a13, t_b13_rot); let [y14, y17] = parallel_fft2_interleaved_f32(t_a14, t_b14_rot); let [y15, y16] = parallel_fft2_interleaved_f32(t_a15, t_b15_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22, y23, y24, y25, y26, y27, y28, y29, y30] } } // _____ _ __ _ _ _ _ _ // |___ // | / /_ | || | | |__ (_) |_ // |_ \| | _____ | '_ \| || |_| '_ \| | __| // ___) | | |_____| | (_) |__ _| |_) | | |_ // |____/|_| \___/ |_| |_.__/|_|\__| // pub struct SseF64Butterfly31 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: __m128d, twiddle1im: __m128d, twiddle2re: __m128d, twiddle2im: __m128d, twiddle3re: __m128d, twiddle3im: __m128d, twiddle4re: __m128d, twiddle4im: __m128d, twiddle5re: __m128d, twiddle5im: __m128d, twiddle6re: __m128d, twiddle6im: __m128d, twiddle7re: __m128d, twiddle7im: __m128d, twiddle8re: __m128d, twiddle8im: __m128d, twiddle9re: __m128d, twiddle9im: __m128d, twiddle10re: __m128d, twiddle10im: __m128d, twiddle11re: __m128d, twiddle11im: __m128d, twiddle12re: __m128d, twiddle12im: __m128d, twiddle13re: __m128d, twiddle13im: __m128d, twiddle14re: __m128d, twiddle14im: __m128d, twiddle15re: __m128d, twiddle15im: __m128d, } boilerplate_fft_sse_f64_butterfly!(SseF64Butterfly31, 31, |this: &SseF64Butterfly31<_>| this .direction); boilerplate_fft_sse_common_butterfly!(SseF64Butterfly31, 31, |this: &SseF64Butterfly31<_>| this .direction); impl SseF64Butterfly31 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 31, direction); let tw2: Complex = twiddles::compute_twiddle(2, 31, direction); let tw3: Complex = twiddles::compute_twiddle(3, 31, direction); let tw4: Complex = twiddles::compute_twiddle(4, 31, direction); let tw5: Complex = twiddles::compute_twiddle(5, 31, direction); let tw6: Complex = twiddles::compute_twiddle(6, 31, direction); let tw7: Complex = twiddles::compute_twiddle(7, 31, direction); let tw8: Complex = twiddles::compute_twiddle(8, 31, direction); let tw9: Complex = twiddles::compute_twiddle(9, 31, direction); let tw10: Complex = twiddles::compute_twiddle(10, 31, direction); let tw11: Complex = twiddles::compute_twiddle(11, 31, direction); let tw12: Complex = twiddles::compute_twiddle(12, 31, direction); let tw13: Complex = twiddles::compute_twiddle(13, 31, direction); let tw14: Complex = twiddles::compute_twiddle(14, 31, direction); let tw15: Complex = twiddles::compute_twiddle(15, 31, direction); let twiddle1re = unsafe { _mm_set_pd(tw1.re, tw1.re) }; let twiddle1im = unsafe { _mm_set_pd(tw1.im, tw1.im) }; let twiddle2re = unsafe { _mm_set_pd(tw2.re, tw2.re) }; let twiddle2im = unsafe { _mm_set_pd(tw2.im, tw2.im) }; let twiddle3re = unsafe { _mm_set_pd(tw3.re, tw3.re) }; let twiddle3im = unsafe { _mm_set_pd(tw3.im, tw3.im) }; let twiddle4re = unsafe { _mm_set_pd(tw4.re, tw4.re) }; let twiddle4im = unsafe { _mm_set_pd(tw4.im, tw4.im) }; let twiddle5re = unsafe { _mm_set_pd(tw5.re, tw5.re) }; let twiddle5im = unsafe { _mm_set_pd(tw5.im, tw5.im) }; let twiddle6re = unsafe { _mm_set_pd(tw6.re, tw6.re) }; let twiddle6im = unsafe { _mm_set_pd(tw6.im, tw6.im) }; let twiddle7re = unsafe { _mm_set_pd(tw7.re, tw7.re) }; let twiddle7im = unsafe { _mm_set_pd(tw7.im, tw7.im) }; let twiddle8re = unsafe { _mm_set_pd(tw8.re, tw8.re) }; let twiddle8im = unsafe { _mm_set_pd(tw8.im, tw8.im) }; let twiddle9re = unsafe { _mm_set_pd(tw9.re, tw9.re) }; let twiddle9im = unsafe { _mm_set_pd(tw9.im, tw9.im) }; let twiddle10re = unsafe { _mm_set_pd(tw10.re, tw10.re) }; let twiddle10im = unsafe { _mm_set_pd(tw10.im, tw10.im) }; let twiddle11re = unsafe { _mm_set_pd(tw11.re, tw11.re) }; let twiddle11im = unsafe { _mm_set_pd(tw11.im, tw11.im) }; let twiddle12re = unsafe { _mm_set_pd(tw12.re, tw12.re) }; let twiddle12im = unsafe { _mm_set_pd(tw12.im, tw12.im) }; let twiddle13re = unsafe { _mm_set_pd(tw13.re, tw13.re) }; let twiddle13im = unsafe { _mm_set_pd(tw13.im, tw13.im) }; let twiddle14re = unsafe { _mm_set_pd(tw14.re, tw14.re) }; let twiddle14im = unsafe { _mm_set_pd(tw14.im, tw14.im) }; let twiddle15re = unsafe { _mm_set_pd(tw15.re, tw15.re) }; let twiddle15im = unsafe { _mm_set_pd(tw15.im, tw15.im) }; Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, twiddle12re, twiddle12im, twiddle13re, twiddle13im, twiddle14re, twiddle14im, twiddle15re, twiddle15im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl SseArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [__m128d; 31]) -> [__m128d; 31] { let [x1p30, x1m30] = solo_fft2_f64(values[1], values[30]); let [x2p29, x2m29] = solo_fft2_f64(values[2], values[29]); let [x3p28, x3m28] = solo_fft2_f64(values[3], values[28]); let [x4p27, x4m27] = solo_fft2_f64(values[4], values[27]); let [x5p26, x5m26] = solo_fft2_f64(values[5], values[26]); let [x6p25, x6m25] = solo_fft2_f64(values[6], values[25]); let [x7p24, x7m24] = solo_fft2_f64(values[7], values[24]); let [x8p23, x8m23] = solo_fft2_f64(values[8], values[23]); let [x9p22, x9m22] = solo_fft2_f64(values[9], values[22]); let [x10p21, x10m21] = solo_fft2_f64(values[10], values[21]); let [x11p20, x11m20] = solo_fft2_f64(values[11], values[20]); let [x12p19, x12m19] = solo_fft2_f64(values[12], values[19]); let [x13p18, x13m18] = solo_fft2_f64(values[13], values[18]); let [x14p17, x14m17] = solo_fft2_f64(values[14], values[17]); let [x15p16, x15m16] = solo_fft2_f64(values[15], values[16]); let t_a1_1 = _mm_mul_pd(self.twiddle1re, x1p30); let t_a1_2 = _mm_mul_pd(self.twiddle2re, x2p29); let t_a1_3 = _mm_mul_pd(self.twiddle3re, x3p28); let t_a1_4 = _mm_mul_pd(self.twiddle4re, x4p27); let t_a1_5 = _mm_mul_pd(self.twiddle5re, x5p26); let t_a1_6 = _mm_mul_pd(self.twiddle6re, x6p25); let t_a1_7 = _mm_mul_pd(self.twiddle7re, x7p24); let t_a1_8 = _mm_mul_pd(self.twiddle8re, x8p23); let t_a1_9 = _mm_mul_pd(self.twiddle9re, x9p22); let t_a1_10 = _mm_mul_pd(self.twiddle10re, x10p21); let t_a1_11 = _mm_mul_pd(self.twiddle11re, x11p20); let t_a1_12 = _mm_mul_pd(self.twiddle12re, x12p19); let t_a1_13 = _mm_mul_pd(self.twiddle13re, x13p18); let t_a1_14 = _mm_mul_pd(self.twiddle14re, x14p17); let t_a1_15 = _mm_mul_pd(self.twiddle15re, x15p16); let t_a2_1 = _mm_mul_pd(self.twiddle2re, x1p30); let t_a2_2 = _mm_mul_pd(self.twiddle4re, x2p29); let t_a2_3 = _mm_mul_pd(self.twiddle6re, x3p28); let t_a2_4 = _mm_mul_pd(self.twiddle8re, x4p27); let t_a2_5 = _mm_mul_pd(self.twiddle10re, x5p26); let t_a2_6 = _mm_mul_pd(self.twiddle12re, x6p25); let t_a2_7 = _mm_mul_pd(self.twiddle14re, x7p24); let t_a2_8 = _mm_mul_pd(self.twiddle15re, x8p23); let t_a2_9 = _mm_mul_pd(self.twiddle13re, x9p22); let t_a2_10 = _mm_mul_pd(self.twiddle11re, x10p21); let t_a2_11 = _mm_mul_pd(self.twiddle9re, x11p20); let t_a2_12 = _mm_mul_pd(self.twiddle7re, x12p19); let t_a2_13 = _mm_mul_pd(self.twiddle5re, x13p18); let t_a2_14 = _mm_mul_pd(self.twiddle3re, x14p17); let t_a2_15 = _mm_mul_pd(self.twiddle1re, x15p16); let t_a3_1 = _mm_mul_pd(self.twiddle3re, x1p30); let t_a3_2 = _mm_mul_pd(self.twiddle6re, x2p29); let t_a3_3 = _mm_mul_pd(self.twiddle9re, x3p28); let t_a3_4 = _mm_mul_pd(self.twiddle12re, x4p27); let t_a3_5 = _mm_mul_pd(self.twiddle15re, x5p26); let t_a3_6 = _mm_mul_pd(self.twiddle13re, x6p25); let t_a3_7 = _mm_mul_pd(self.twiddle10re, x7p24); let t_a3_8 = _mm_mul_pd(self.twiddle7re, x8p23); let t_a3_9 = _mm_mul_pd(self.twiddle4re, x9p22); let t_a3_10 = _mm_mul_pd(self.twiddle1re, x10p21); let t_a3_11 = _mm_mul_pd(self.twiddle2re, x11p20); let t_a3_12 = _mm_mul_pd(self.twiddle5re, x12p19); let t_a3_13 = _mm_mul_pd(self.twiddle8re, x13p18); let t_a3_14 = _mm_mul_pd(self.twiddle11re, x14p17); let t_a3_15 = _mm_mul_pd(self.twiddle14re, x15p16); let t_a4_1 = _mm_mul_pd(self.twiddle4re, x1p30); let t_a4_2 = _mm_mul_pd(self.twiddle8re, x2p29); let t_a4_3 = _mm_mul_pd(self.twiddle12re, x3p28); let t_a4_4 = _mm_mul_pd(self.twiddle15re, x4p27); let t_a4_5 = _mm_mul_pd(self.twiddle11re, x5p26); let t_a4_6 = _mm_mul_pd(self.twiddle7re, x6p25); let t_a4_7 = _mm_mul_pd(self.twiddle3re, x7p24); let t_a4_8 = _mm_mul_pd(self.twiddle1re, x8p23); let t_a4_9 = _mm_mul_pd(self.twiddle5re, x9p22); let t_a4_10 = _mm_mul_pd(self.twiddle9re, x10p21); let t_a4_11 = _mm_mul_pd(self.twiddle13re, x11p20); let t_a4_12 = _mm_mul_pd(self.twiddle14re, x12p19); let t_a4_13 = _mm_mul_pd(self.twiddle10re, x13p18); let t_a4_14 = _mm_mul_pd(self.twiddle6re, x14p17); let t_a4_15 = _mm_mul_pd(self.twiddle2re, x15p16); let t_a5_1 = _mm_mul_pd(self.twiddle5re, x1p30); let t_a5_2 = _mm_mul_pd(self.twiddle10re, x2p29); let t_a5_3 = _mm_mul_pd(self.twiddle15re, x3p28); let t_a5_4 = _mm_mul_pd(self.twiddle11re, x4p27); let t_a5_5 = _mm_mul_pd(self.twiddle6re, x5p26); let t_a5_6 = _mm_mul_pd(self.twiddle1re, x6p25); let t_a5_7 = _mm_mul_pd(self.twiddle4re, x7p24); let t_a5_8 = _mm_mul_pd(self.twiddle9re, x8p23); let t_a5_9 = _mm_mul_pd(self.twiddle14re, x9p22); let t_a5_10 = _mm_mul_pd(self.twiddle12re, x10p21); let t_a5_11 = _mm_mul_pd(self.twiddle7re, x11p20); let t_a5_12 = _mm_mul_pd(self.twiddle2re, x12p19); let t_a5_13 = _mm_mul_pd(self.twiddle3re, x13p18); let t_a5_14 = _mm_mul_pd(self.twiddle8re, x14p17); let t_a5_15 = _mm_mul_pd(self.twiddle13re, x15p16); let t_a6_1 = _mm_mul_pd(self.twiddle6re, x1p30); let t_a6_2 = _mm_mul_pd(self.twiddle12re, x2p29); let t_a6_3 = _mm_mul_pd(self.twiddle13re, x3p28); let t_a6_4 = _mm_mul_pd(self.twiddle7re, x4p27); let t_a6_5 = _mm_mul_pd(self.twiddle1re, x5p26); let t_a6_6 = _mm_mul_pd(self.twiddle5re, x6p25); let t_a6_7 = _mm_mul_pd(self.twiddle11re, x7p24); let t_a6_8 = _mm_mul_pd(self.twiddle14re, x8p23); let t_a6_9 = _mm_mul_pd(self.twiddle8re, x9p22); let t_a6_10 = _mm_mul_pd(self.twiddle2re, x10p21); let t_a6_11 = _mm_mul_pd(self.twiddle4re, x11p20); let t_a6_12 = _mm_mul_pd(self.twiddle10re, x12p19); let t_a6_13 = _mm_mul_pd(self.twiddle15re, x13p18); let t_a6_14 = _mm_mul_pd(self.twiddle9re, x14p17); let t_a6_15 = _mm_mul_pd(self.twiddle3re, x15p16); let t_a7_1 = _mm_mul_pd(self.twiddle7re, x1p30); let t_a7_2 = _mm_mul_pd(self.twiddle14re, x2p29); let t_a7_3 = _mm_mul_pd(self.twiddle10re, x3p28); let t_a7_4 = _mm_mul_pd(self.twiddle3re, x4p27); let t_a7_5 = _mm_mul_pd(self.twiddle4re, x5p26); let t_a7_6 = _mm_mul_pd(self.twiddle11re, x6p25); let t_a7_7 = _mm_mul_pd(self.twiddle13re, x7p24); let t_a7_8 = _mm_mul_pd(self.twiddle6re, x8p23); let t_a7_9 = _mm_mul_pd(self.twiddle1re, x9p22); let t_a7_10 = _mm_mul_pd(self.twiddle8re, x10p21); let t_a7_11 = _mm_mul_pd(self.twiddle15re, x11p20); let t_a7_12 = _mm_mul_pd(self.twiddle9re, x12p19); let t_a7_13 = _mm_mul_pd(self.twiddle2re, x13p18); let t_a7_14 = _mm_mul_pd(self.twiddle5re, x14p17); let t_a7_15 = _mm_mul_pd(self.twiddle12re, x15p16); let t_a8_1 = _mm_mul_pd(self.twiddle8re, x1p30); let t_a8_2 = _mm_mul_pd(self.twiddle15re, x2p29); let t_a8_3 = _mm_mul_pd(self.twiddle7re, x3p28); let t_a8_4 = _mm_mul_pd(self.twiddle1re, x4p27); let t_a8_5 = _mm_mul_pd(self.twiddle9re, x5p26); let t_a8_6 = _mm_mul_pd(self.twiddle14re, x6p25); let t_a8_7 = _mm_mul_pd(self.twiddle6re, x7p24); let t_a8_8 = _mm_mul_pd(self.twiddle2re, x8p23); let t_a8_9 = _mm_mul_pd(self.twiddle10re, x9p22); let t_a8_10 = _mm_mul_pd(self.twiddle13re, x10p21); let t_a8_11 = _mm_mul_pd(self.twiddle5re, x11p20); let t_a8_12 = _mm_mul_pd(self.twiddle3re, x12p19); let t_a8_13 = _mm_mul_pd(self.twiddle11re, x13p18); let t_a8_14 = _mm_mul_pd(self.twiddle12re, x14p17); let t_a8_15 = _mm_mul_pd(self.twiddle4re, x15p16); let t_a9_1 = _mm_mul_pd(self.twiddle9re, x1p30); let t_a9_2 = _mm_mul_pd(self.twiddle13re, x2p29); let t_a9_3 = _mm_mul_pd(self.twiddle4re, x3p28); let t_a9_4 = _mm_mul_pd(self.twiddle5re, x4p27); let t_a9_5 = _mm_mul_pd(self.twiddle14re, x5p26); let t_a9_6 = _mm_mul_pd(self.twiddle8re, x6p25); let t_a9_7 = _mm_mul_pd(self.twiddle1re, x7p24); let t_a9_8 = _mm_mul_pd(self.twiddle10re, x8p23); let t_a9_9 = _mm_mul_pd(self.twiddle12re, x9p22); let t_a9_10 = _mm_mul_pd(self.twiddle3re, x10p21); let t_a9_11 = _mm_mul_pd(self.twiddle6re, x11p20); let t_a9_12 = _mm_mul_pd(self.twiddle15re, x12p19); let t_a9_13 = _mm_mul_pd(self.twiddle7re, x13p18); let t_a9_14 = _mm_mul_pd(self.twiddle2re, x14p17); let t_a9_15 = _mm_mul_pd(self.twiddle11re, x15p16); let t_a10_1 = _mm_mul_pd(self.twiddle10re, x1p30); let t_a10_2 = _mm_mul_pd(self.twiddle11re, x2p29); let t_a10_3 = _mm_mul_pd(self.twiddle1re, x3p28); let t_a10_4 = _mm_mul_pd(self.twiddle9re, x4p27); let t_a10_5 = _mm_mul_pd(self.twiddle12re, x5p26); let t_a10_6 = _mm_mul_pd(self.twiddle2re, x6p25); let t_a10_7 = _mm_mul_pd(self.twiddle8re, x7p24); let t_a10_8 = _mm_mul_pd(self.twiddle13re, x8p23); let t_a10_9 = _mm_mul_pd(self.twiddle3re, x9p22); let t_a10_10 = _mm_mul_pd(self.twiddle7re, x10p21); let t_a10_11 = _mm_mul_pd(self.twiddle14re, x11p20); let t_a10_12 = _mm_mul_pd(self.twiddle4re, x12p19); let t_a10_13 = _mm_mul_pd(self.twiddle6re, x13p18); let t_a10_14 = _mm_mul_pd(self.twiddle15re, x14p17); let t_a10_15 = _mm_mul_pd(self.twiddle5re, x15p16); let t_a11_1 = _mm_mul_pd(self.twiddle11re, x1p30); let t_a11_2 = _mm_mul_pd(self.twiddle9re, x2p29); let t_a11_3 = _mm_mul_pd(self.twiddle2re, x3p28); let t_a11_4 = _mm_mul_pd(self.twiddle13re, x4p27); let t_a11_5 = _mm_mul_pd(self.twiddle7re, x5p26); let t_a11_6 = _mm_mul_pd(self.twiddle4re, x6p25); let t_a11_7 = _mm_mul_pd(self.twiddle15re, x7p24); let t_a11_8 = _mm_mul_pd(self.twiddle5re, x8p23); let t_a11_9 = _mm_mul_pd(self.twiddle6re, x9p22); let t_a11_10 = _mm_mul_pd(self.twiddle14re, x10p21); let t_a11_11 = _mm_mul_pd(self.twiddle3re, x11p20); let t_a11_12 = _mm_mul_pd(self.twiddle8re, x12p19); let t_a11_13 = _mm_mul_pd(self.twiddle12re, x13p18); let t_a11_14 = _mm_mul_pd(self.twiddle1re, x14p17); let t_a11_15 = _mm_mul_pd(self.twiddle10re, x15p16); let t_a12_1 = _mm_mul_pd(self.twiddle12re, x1p30); let t_a12_2 = _mm_mul_pd(self.twiddle7re, x2p29); let t_a12_3 = _mm_mul_pd(self.twiddle5re, x3p28); let t_a12_4 = _mm_mul_pd(self.twiddle14re, x4p27); let t_a12_5 = _mm_mul_pd(self.twiddle2re, x5p26); let t_a12_6 = _mm_mul_pd(self.twiddle10re, x6p25); let t_a12_7 = _mm_mul_pd(self.twiddle9re, x7p24); let t_a12_8 = _mm_mul_pd(self.twiddle3re, x8p23); let t_a12_9 = _mm_mul_pd(self.twiddle15re, x9p22); let t_a12_10 = _mm_mul_pd(self.twiddle4re, x10p21); let t_a12_11 = _mm_mul_pd(self.twiddle8re, x11p20); let t_a12_12 = _mm_mul_pd(self.twiddle11re, x12p19); let t_a12_13 = _mm_mul_pd(self.twiddle1re, x13p18); let t_a12_14 = _mm_mul_pd(self.twiddle13re, x14p17); let t_a12_15 = _mm_mul_pd(self.twiddle6re, x15p16); let t_a13_1 = _mm_mul_pd(self.twiddle13re, x1p30); let t_a13_2 = _mm_mul_pd(self.twiddle5re, x2p29); let t_a13_3 = _mm_mul_pd(self.twiddle8re, x3p28); let t_a13_4 = _mm_mul_pd(self.twiddle10re, x4p27); let t_a13_5 = _mm_mul_pd(self.twiddle3re, x5p26); let t_a13_6 = _mm_mul_pd(self.twiddle15re, x6p25); let t_a13_7 = _mm_mul_pd(self.twiddle2re, x7p24); let t_a13_8 = _mm_mul_pd(self.twiddle11re, x8p23); let t_a13_9 = _mm_mul_pd(self.twiddle7re, x9p22); let t_a13_10 = _mm_mul_pd(self.twiddle6re, x10p21); let t_a13_11 = _mm_mul_pd(self.twiddle12re, x11p20); let t_a13_12 = _mm_mul_pd(self.twiddle1re, x12p19); let t_a13_13 = _mm_mul_pd(self.twiddle14re, x13p18); let t_a13_14 = _mm_mul_pd(self.twiddle4re, x14p17); let t_a13_15 = _mm_mul_pd(self.twiddle9re, x15p16); let t_a14_1 = _mm_mul_pd(self.twiddle14re, x1p30); let t_a14_2 = _mm_mul_pd(self.twiddle3re, x2p29); let t_a14_3 = _mm_mul_pd(self.twiddle11re, x3p28); let t_a14_4 = _mm_mul_pd(self.twiddle6re, x4p27); let t_a14_5 = _mm_mul_pd(self.twiddle8re, x5p26); let t_a14_6 = _mm_mul_pd(self.twiddle9re, x6p25); let t_a14_7 = _mm_mul_pd(self.twiddle5re, x7p24); let t_a14_8 = _mm_mul_pd(self.twiddle12re, x8p23); let t_a14_9 = _mm_mul_pd(self.twiddle2re, x9p22); let t_a14_10 = _mm_mul_pd(self.twiddle15re, x10p21); let t_a14_11 = _mm_mul_pd(self.twiddle1re, x11p20); let t_a14_12 = _mm_mul_pd(self.twiddle13re, x12p19); let t_a14_13 = _mm_mul_pd(self.twiddle4re, x13p18); let t_a14_14 = _mm_mul_pd(self.twiddle10re, x14p17); let t_a14_15 = _mm_mul_pd(self.twiddle7re, x15p16); let t_a15_1 = _mm_mul_pd(self.twiddle15re, x1p30); let t_a15_2 = _mm_mul_pd(self.twiddle1re, x2p29); let t_a15_3 = _mm_mul_pd(self.twiddle14re, x3p28); let t_a15_4 = _mm_mul_pd(self.twiddle2re, x4p27); let t_a15_5 = _mm_mul_pd(self.twiddle13re, x5p26); let t_a15_6 = _mm_mul_pd(self.twiddle3re, x6p25); let t_a15_7 = _mm_mul_pd(self.twiddle12re, x7p24); let t_a15_8 = _mm_mul_pd(self.twiddle4re, x8p23); let t_a15_9 = _mm_mul_pd(self.twiddle11re, x9p22); let t_a15_10 = _mm_mul_pd(self.twiddle5re, x10p21); let t_a15_11 = _mm_mul_pd(self.twiddle10re, x11p20); let t_a15_12 = _mm_mul_pd(self.twiddle6re, x12p19); let t_a15_13 = _mm_mul_pd(self.twiddle9re, x13p18); let t_a15_14 = _mm_mul_pd(self.twiddle7re, x14p17); let t_a15_15 = _mm_mul_pd(self.twiddle8re, x15p16); let t_b1_1 = _mm_mul_pd(self.twiddle1im, x1m30); let t_b1_2 = _mm_mul_pd(self.twiddle2im, x2m29); let t_b1_3 = _mm_mul_pd(self.twiddle3im, x3m28); let t_b1_4 = _mm_mul_pd(self.twiddle4im, x4m27); let t_b1_5 = _mm_mul_pd(self.twiddle5im, x5m26); let t_b1_6 = _mm_mul_pd(self.twiddle6im, x6m25); let t_b1_7 = _mm_mul_pd(self.twiddle7im, x7m24); let t_b1_8 = _mm_mul_pd(self.twiddle8im, x8m23); let t_b1_9 = _mm_mul_pd(self.twiddle9im, x9m22); let t_b1_10 = _mm_mul_pd(self.twiddle10im, x10m21); let t_b1_11 = _mm_mul_pd(self.twiddle11im, x11m20); let t_b1_12 = _mm_mul_pd(self.twiddle12im, x12m19); let t_b1_13 = _mm_mul_pd(self.twiddle13im, x13m18); let t_b1_14 = _mm_mul_pd(self.twiddle14im, x14m17); let t_b1_15 = _mm_mul_pd(self.twiddle15im, x15m16); let t_b2_1 = _mm_mul_pd(self.twiddle2im, x1m30); let t_b2_2 = _mm_mul_pd(self.twiddle4im, x2m29); let t_b2_3 = _mm_mul_pd(self.twiddle6im, x3m28); let t_b2_4 = _mm_mul_pd(self.twiddle8im, x4m27); let t_b2_5 = _mm_mul_pd(self.twiddle10im, x5m26); let t_b2_6 = _mm_mul_pd(self.twiddle12im, x6m25); let t_b2_7 = _mm_mul_pd(self.twiddle14im, x7m24); let t_b2_8 = _mm_mul_pd(self.twiddle15im, x8m23); let t_b2_9 = _mm_mul_pd(self.twiddle13im, x9m22); let t_b2_10 = _mm_mul_pd(self.twiddle11im, x10m21); let t_b2_11 = _mm_mul_pd(self.twiddle9im, x11m20); let t_b2_12 = _mm_mul_pd(self.twiddle7im, x12m19); let t_b2_13 = _mm_mul_pd(self.twiddle5im, x13m18); let t_b2_14 = _mm_mul_pd(self.twiddle3im, x14m17); let t_b2_15 = _mm_mul_pd(self.twiddle1im, x15m16); let t_b3_1 = _mm_mul_pd(self.twiddle3im, x1m30); let t_b3_2 = _mm_mul_pd(self.twiddle6im, x2m29); let t_b3_3 = _mm_mul_pd(self.twiddle9im, x3m28); let t_b3_4 = _mm_mul_pd(self.twiddle12im, x4m27); let t_b3_5 = _mm_mul_pd(self.twiddle15im, x5m26); let t_b3_6 = _mm_mul_pd(self.twiddle13im, x6m25); let t_b3_7 = _mm_mul_pd(self.twiddle10im, x7m24); let t_b3_8 = _mm_mul_pd(self.twiddle7im, x8m23); let t_b3_9 = _mm_mul_pd(self.twiddle4im, x9m22); let t_b3_10 = _mm_mul_pd(self.twiddle1im, x10m21); let t_b3_11 = _mm_mul_pd(self.twiddle2im, x11m20); let t_b3_12 = _mm_mul_pd(self.twiddle5im, x12m19); let t_b3_13 = _mm_mul_pd(self.twiddle8im, x13m18); let t_b3_14 = _mm_mul_pd(self.twiddle11im, x14m17); let t_b3_15 = _mm_mul_pd(self.twiddle14im, x15m16); let t_b4_1 = _mm_mul_pd(self.twiddle4im, x1m30); let t_b4_2 = _mm_mul_pd(self.twiddle8im, x2m29); let t_b4_3 = _mm_mul_pd(self.twiddle12im, x3m28); let t_b4_4 = _mm_mul_pd(self.twiddle15im, x4m27); let t_b4_5 = _mm_mul_pd(self.twiddle11im, x5m26); let t_b4_6 = _mm_mul_pd(self.twiddle7im, x6m25); let t_b4_7 = _mm_mul_pd(self.twiddle3im, x7m24); let t_b4_8 = _mm_mul_pd(self.twiddle1im, x8m23); let t_b4_9 = _mm_mul_pd(self.twiddle5im, x9m22); let t_b4_10 = _mm_mul_pd(self.twiddle9im, x10m21); let t_b4_11 = _mm_mul_pd(self.twiddle13im, x11m20); let t_b4_12 = _mm_mul_pd(self.twiddle14im, x12m19); let t_b4_13 = _mm_mul_pd(self.twiddle10im, x13m18); let t_b4_14 = _mm_mul_pd(self.twiddle6im, x14m17); let t_b4_15 = _mm_mul_pd(self.twiddle2im, x15m16); let t_b5_1 = _mm_mul_pd(self.twiddle5im, x1m30); let t_b5_2 = _mm_mul_pd(self.twiddle10im, x2m29); let t_b5_3 = _mm_mul_pd(self.twiddle15im, x3m28); let t_b5_4 = _mm_mul_pd(self.twiddle11im, x4m27); let t_b5_5 = _mm_mul_pd(self.twiddle6im, x5m26); let t_b5_6 = _mm_mul_pd(self.twiddle1im, x6m25); let t_b5_7 = _mm_mul_pd(self.twiddle4im, x7m24); let t_b5_8 = _mm_mul_pd(self.twiddle9im, x8m23); let t_b5_9 = _mm_mul_pd(self.twiddle14im, x9m22); let t_b5_10 = _mm_mul_pd(self.twiddle12im, x10m21); let t_b5_11 = _mm_mul_pd(self.twiddle7im, x11m20); let t_b5_12 = _mm_mul_pd(self.twiddle2im, x12m19); let t_b5_13 = _mm_mul_pd(self.twiddle3im, x13m18); let t_b5_14 = _mm_mul_pd(self.twiddle8im, x14m17); let t_b5_15 = _mm_mul_pd(self.twiddle13im, x15m16); let t_b6_1 = _mm_mul_pd(self.twiddle6im, x1m30); let t_b6_2 = _mm_mul_pd(self.twiddle12im, x2m29); let t_b6_3 = _mm_mul_pd(self.twiddle13im, x3m28); let t_b6_4 = _mm_mul_pd(self.twiddle7im, x4m27); let t_b6_5 = _mm_mul_pd(self.twiddle1im, x5m26); let t_b6_6 = _mm_mul_pd(self.twiddle5im, x6m25); let t_b6_7 = _mm_mul_pd(self.twiddle11im, x7m24); let t_b6_8 = _mm_mul_pd(self.twiddle14im, x8m23); let t_b6_9 = _mm_mul_pd(self.twiddle8im, x9m22); let t_b6_10 = _mm_mul_pd(self.twiddle2im, x10m21); let t_b6_11 = _mm_mul_pd(self.twiddle4im, x11m20); let t_b6_12 = _mm_mul_pd(self.twiddle10im, x12m19); let t_b6_13 = _mm_mul_pd(self.twiddle15im, x13m18); let t_b6_14 = _mm_mul_pd(self.twiddle9im, x14m17); let t_b6_15 = _mm_mul_pd(self.twiddle3im, x15m16); let t_b7_1 = _mm_mul_pd(self.twiddle7im, x1m30); let t_b7_2 = _mm_mul_pd(self.twiddle14im, x2m29); let t_b7_3 = _mm_mul_pd(self.twiddle10im, x3m28); let t_b7_4 = _mm_mul_pd(self.twiddle3im, x4m27); let t_b7_5 = _mm_mul_pd(self.twiddle4im, x5m26); let t_b7_6 = _mm_mul_pd(self.twiddle11im, x6m25); let t_b7_7 = _mm_mul_pd(self.twiddle13im, x7m24); let t_b7_8 = _mm_mul_pd(self.twiddle6im, x8m23); let t_b7_9 = _mm_mul_pd(self.twiddle1im, x9m22); let t_b7_10 = _mm_mul_pd(self.twiddle8im, x10m21); let t_b7_11 = _mm_mul_pd(self.twiddle15im, x11m20); let t_b7_12 = _mm_mul_pd(self.twiddle9im, x12m19); let t_b7_13 = _mm_mul_pd(self.twiddle2im, x13m18); let t_b7_14 = _mm_mul_pd(self.twiddle5im, x14m17); let t_b7_15 = _mm_mul_pd(self.twiddle12im, x15m16); let t_b8_1 = _mm_mul_pd(self.twiddle8im, x1m30); let t_b8_2 = _mm_mul_pd(self.twiddle15im, x2m29); let t_b8_3 = _mm_mul_pd(self.twiddle7im, x3m28); let t_b8_4 = _mm_mul_pd(self.twiddle1im, x4m27); let t_b8_5 = _mm_mul_pd(self.twiddle9im, x5m26); let t_b8_6 = _mm_mul_pd(self.twiddle14im, x6m25); let t_b8_7 = _mm_mul_pd(self.twiddle6im, x7m24); let t_b8_8 = _mm_mul_pd(self.twiddle2im, x8m23); let t_b8_9 = _mm_mul_pd(self.twiddle10im, x9m22); let t_b8_10 = _mm_mul_pd(self.twiddle13im, x10m21); let t_b8_11 = _mm_mul_pd(self.twiddle5im, x11m20); let t_b8_12 = _mm_mul_pd(self.twiddle3im, x12m19); let t_b8_13 = _mm_mul_pd(self.twiddle11im, x13m18); let t_b8_14 = _mm_mul_pd(self.twiddle12im, x14m17); let t_b8_15 = _mm_mul_pd(self.twiddle4im, x15m16); let t_b9_1 = _mm_mul_pd(self.twiddle9im, x1m30); let t_b9_2 = _mm_mul_pd(self.twiddle13im, x2m29); let t_b9_3 = _mm_mul_pd(self.twiddle4im, x3m28); let t_b9_4 = _mm_mul_pd(self.twiddle5im, x4m27); let t_b9_5 = _mm_mul_pd(self.twiddle14im, x5m26); let t_b9_6 = _mm_mul_pd(self.twiddle8im, x6m25); let t_b9_7 = _mm_mul_pd(self.twiddle1im, x7m24); let t_b9_8 = _mm_mul_pd(self.twiddle10im, x8m23); let t_b9_9 = _mm_mul_pd(self.twiddle12im, x9m22); let t_b9_10 = _mm_mul_pd(self.twiddle3im, x10m21); let t_b9_11 = _mm_mul_pd(self.twiddle6im, x11m20); let t_b9_12 = _mm_mul_pd(self.twiddle15im, x12m19); let t_b9_13 = _mm_mul_pd(self.twiddle7im, x13m18); let t_b9_14 = _mm_mul_pd(self.twiddle2im, x14m17); let t_b9_15 = _mm_mul_pd(self.twiddle11im, x15m16); let t_b10_1 = _mm_mul_pd(self.twiddle10im, x1m30); let t_b10_2 = _mm_mul_pd(self.twiddle11im, x2m29); let t_b10_3 = _mm_mul_pd(self.twiddle1im, x3m28); let t_b10_4 = _mm_mul_pd(self.twiddle9im, x4m27); let t_b10_5 = _mm_mul_pd(self.twiddle12im, x5m26); let t_b10_6 = _mm_mul_pd(self.twiddle2im, x6m25); let t_b10_7 = _mm_mul_pd(self.twiddle8im, x7m24); let t_b10_8 = _mm_mul_pd(self.twiddle13im, x8m23); let t_b10_9 = _mm_mul_pd(self.twiddle3im, x9m22); let t_b10_10 = _mm_mul_pd(self.twiddle7im, x10m21); let t_b10_11 = _mm_mul_pd(self.twiddle14im, x11m20); let t_b10_12 = _mm_mul_pd(self.twiddle4im, x12m19); let t_b10_13 = _mm_mul_pd(self.twiddle6im, x13m18); let t_b10_14 = _mm_mul_pd(self.twiddle15im, x14m17); let t_b10_15 = _mm_mul_pd(self.twiddle5im, x15m16); let t_b11_1 = _mm_mul_pd(self.twiddle11im, x1m30); let t_b11_2 = _mm_mul_pd(self.twiddle9im, x2m29); let t_b11_3 = _mm_mul_pd(self.twiddle2im, x3m28); let t_b11_4 = _mm_mul_pd(self.twiddle13im, x4m27); let t_b11_5 = _mm_mul_pd(self.twiddle7im, x5m26); let t_b11_6 = _mm_mul_pd(self.twiddle4im, x6m25); let t_b11_7 = _mm_mul_pd(self.twiddle15im, x7m24); let t_b11_8 = _mm_mul_pd(self.twiddle5im, x8m23); let t_b11_9 = _mm_mul_pd(self.twiddle6im, x9m22); let t_b11_10 = _mm_mul_pd(self.twiddle14im, x10m21); let t_b11_11 = _mm_mul_pd(self.twiddle3im, x11m20); let t_b11_12 = _mm_mul_pd(self.twiddle8im, x12m19); let t_b11_13 = _mm_mul_pd(self.twiddle12im, x13m18); let t_b11_14 = _mm_mul_pd(self.twiddle1im, x14m17); let t_b11_15 = _mm_mul_pd(self.twiddle10im, x15m16); let t_b12_1 = _mm_mul_pd(self.twiddle12im, x1m30); let t_b12_2 = _mm_mul_pd(self.twiddle7im, x2m29); let t_b12_3 = _mm_mul_pd(self.twiddle5im, x3m28); let t_b12_4 = _mm_mul_pd(self.twiddle14im, x4m27); let t_b12_5 = _mm_mul_pd(self.twiddle2im, x5m26); let t_b12_6 = _mm_mul_pd(self.twiddle10im, x6m25); let t_b12_7 = _mm_mul_pd(self.twiddle9im, x7m24); let t_b12_8 = _mm_mul_pd(self.twiddle3im, x8m23); let t_b12_9 = _mm_mul_pd(self.twiddle15im, x9m22); let t_b12_10 = _mm_mul_pd(self.twiddle4im, x10m21); let t_b12_11 = _mm_mul_pd(self.twiddle8im, x11m20); let t_b12_12 = _mm_mul_pd(self.twiddle11im, x12m19); let t_b12_13 = _mm_mul_pd(self.twiddle1im, x13m18); let t_b12_14 = _mm_mul_pd(self.twiddle13im, x14m17); let t_b12_15 = _mm_mul_pd(self.twiddle6im, x15m16); let t_b13_1 = _mm_mul_pd(self.twiddle13im, x1m30); let t_b13_2 = _mm_mul_pd(self.twiddle5im, x2m29); let t_b13_3 = _mm_mul_pd(self.twiddle8im, x3m28); let t_b13_4 = _mm_mul_pd(self.twiddle10im, x4m27); let t_b13_5 = _mm_mul_pd(self.twiddle3im, x5m26); let t_b13_6 = _mm_mul_pd(self.twiddle15im, x6m25); let t_b13_7 = _mm_mul_pd(self.twiddle2im, x7m24); let t_b13_8 = _mm_mul_pd(self.twiddle11im, x8m23); let t_b13_9 = _mm_mul_pd(self.twiddle7im, x9m22); let t_b13_10 = _mm_mul_pd(self.twiddle6im, x10m21); let t_b13_11 = _mm_mul_pd(self.twiddle12im, x11m20); let t_b13_12 = _mm_mul_pd(self.twiddle1im, x12m19); let t_b13_13 = _mm_mul_pd(self.twiddle14im, x13m18); let t_b13_14 = _mm_mul_pd(self.twiddle4im, x14m17); let t_b13_15 = _mm_mul_pd(self.twiddle9im, x15m16); let t_b14_1 = _mm_mul_pd(self.twiddle14im, x1m30); let t_b14_2 = _mm_mul_pd(self.twiddle3im, x2m29); let t_b14_3 = _mm_mul_pd(self.twiddle11im, x3m28); let t_b14_4 = _mm_mul_pd(self.twiddle6im, x4m27); let t_b14_5 = _mm_mul_pd(self.twiddle8im, x5m26); let t_b14_6 = _mm_mul_pd(self.twiddle9im, x6m25); let t_b14_7 = _mm_mul_pd(self.twiddle5im, x7m24); let t_b14_8 = _mm_mul_pd(self.twiddle12im, x8m23); let t_b14_9 = _mm_mul_pd(self.twiddle2im, x9m22); let t_b14_10 = _mm_mul_pd(self.twiddle15im, x10m21); let t_b14_11 = _mm_mul_pd(self.twiddle1im, x11m20); let t_b14_12 = _mm_mul_pd(self.twiddle13im, x12m19); let t_b14_13 = _mm_mul_pd(self.twiddle4im, x13m18); let t_b14_14 = _mm_mul_pd(self.twiddle10im, x14m17); let t_b14_15 = _mm_mul_pd(self.twiddle7im, x15m16); let t_b15_1 = _mm_mul_pd(self.twiddle15im, x1m30); let t_b15_2 = _mm_mul_pd(self.twiddle1im, x2m29); let t_b15_3 = _mm_mul_pd(self.twiddle14im, x3m28); let t_b15_4 = _mm_mul_pd(self.twiddle2im, x4m27); let t_b15_5 = _mm_mul_pd(self.twiddle13im, x5m26); let t_b15_6 = _mm_mul_pd(self.twiddle3im, x6m25); let t_b15_7 = _mm_mul_pd(self.twiddle12im, x7m24); let t_b15_8 = _mm_mul_pd(self.twiddle4im, x8m23); let t_b15_9 = _mm_mul_pd(self.twiddle11im, x9m22); let t_b15_10 = _mm_mul_pd(self.twiddle5im, x10m21); let t_b15_11 = _mm_mul_pd(self.twiddle10im, x11m20); let t_b15_12 = _mm_mul_pd(self.twiddle6im, x12m19); let t_b15_13 = _mm_mul_pd(self.twiddle9im, x13m18); let t_b15_14 = _mm_mul_pd(self.twiddle7im, x14m17); let t_b15_15 = _mm_mul_pd(self.twiddle8im, x15m16); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11 + t_a1_12 + t_a1_13 + t_a1_14 + t_a1_15); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11 + t_a2_12 + t_a2_13 + t_a2_14 + t_a2_15); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11 + t_a3_12 + t_a3_13 + t_a3_14 + t_a3_15); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11 + t_a4_12 + t_a4_13 + t_a4_14 + t_a4_15); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11 + t_a5_12 + t_a5_13 + t_a5_14 + t_a5_15); let t_a6 = calc_f64!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11 + t_a6_12 + t_a6_13 + t_a6_14 + t_a6_15); let t_a7 = calc_f64!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11 + t_a7_12 + t_a7_13 + t_a7_14 + t_a7_15); let t_a8 = calc_f64!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11 + t_a8_12 + t_a8_13 + t_a8_14 + t_a8_15); let t_a9 = calc_f64!(x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11 + t_a9_12 + t_a9_13 + t_a9_14 + t_a9_15); let t_a10 = calc_f64!(x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11 + t_a10_12 + t_a10_13 + t_a10_14 + t_a10_15); let t_a11 = calc_f64!(x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11 + t_a11_12 + t_a11_13 + t_a11_14 + t_a11_15); let t_a12 = calc_f64!(x0 + t_a12_1 + t_a12_2 + t_a12_3 + t_a12_4 + t_a12_5 + t_a12_6 + t_a12_7 + t_a12_8 + t_a12_9 + t_a12_10 + t_a12_11 + t_a12_12 + t_a12_13 + t_a12_14 + t_a12_15); let t_a13 = calc_f64!(x0 + t_a13_1 + t_a13_2 + t_a13_3 + t_a13_4 + t_a13_5 + t_a13_6 + t_a13_7 + t_a13_8 + t_a13_9 + t_a13_10 + t_a13_11 + t_a13_12 + t_a13_13 + t_a13_14 + t_a13_15); let t_a14 = calc_f64!(x0 + t_a14_1 + t_a14_2 + t_a14_3 + t_a14_4 + t_a14_5 + t_a14_6 + t_a14_7 + t_a14_8 + t_a14_9 + t_a14_10 + t_a14_11 + t_a14_12 + t_a14_13 + t_a14_14 + t_a14_15); let t_a15 = calc_f64!(x0 + t_a15_1 + t_a15_2 + t_a15_3 + t_a15_4 + t_a15_5 + t_a15_6 + t_a15_7 + t_a15_8 + t_a15_9 + t_a15_10 + t_a15_11 + t_a15_12 + t_a15_13 + t_a15_14 + t_a15_15); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11 + t_b1_12 + t_b1_13 + t_b1_14 + t_b1_15); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 + t_b2_6 + t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11 - t_b2_12 - t_b2_13 - t_b2_14 - t_b2_15); let t_b3 = calc_f64!(t_b3_1 + t_b3_2 + t_b3_3 + t_b3_4 + t_b3_5 - t_b3_6 - t_b3_7 - t_b3_8 - t_b3_9 - t_b3_10 + t_b3_11 + t_b3_12 + t_b3_13 + t_b3_14 + t_b3_15); let t_b4 = calc_f64!(t_b4_1 + t_b4_2 + t_b4_3 - t_b4_4 - t_b4_5 - t_b4_6 - t_b4_7 + t_b4_8 + t_b4_9 + t_b4_10 + t_b4_11 - t_b4_12 - t_b4_13 - t_b4_14 - t_b4_15); let t_b5 = calc_f64!(t_b5_1 + t_b5_2 + t_b5_3 - t_b5_4 - t_b5_5 - t_b5_6 + t_b5_7 + t_b5_8 + t_b5_9 - t_b5_10 - t_b5_11 - t_b5_12 + t_b5_13 + t_b5_14 + t_b5_15); let t_b6 = calc_f64!(t_b6_1 + t_b6_2 - t_b6_3 - t_b6_4 - t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9 - t_b6_10 + t_b6_11 + t_b6_12 - t_b6_13 - t_b6_14 - t_b6_15); let t_b7 = calc_f64!(t_b7_1 + t_b7_2 - t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9 + t_b7_10 + t_b7_11 - t_b7_12 - t_b7_13 + t_b7_14 + t_b7_15); let t_b8 = calc_f64!(t_b8_1 - t_b8_2 - t_b8_3 + t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 + t_b8_9 - t_b8_10 - t_b8_11 + t_b8_12 + t_b8_13 - t_b8_14 - t_b8_15); let t_b9 = calc_f64!(t_b9_1 - t_b9_2 - t_b9_3 + t_b9_4 + t_b9_5 - t_b9_6 + t_b9_7 + t_b9_8 - t_b9_9 - t_b9_10 + t_b9_11 + t_b9_12 - t_b9_13 + t_b9_14 + t_b9_15); let t_b10 = calc_f64!(t_b10_1 - t_b10_2 - t_b10_3 + t_b10_4 - t_b10_5 - t_b10_6 + t_b10_7 - t_b10_8 - t_b10_9 + t_b10_10 - t_b10_11 - t_b10_12 + t_b10_13 - t_b10_14 - t_b10_15); let t_b11 = calc_f64!(t_b11_1 - t_b11_2 + t_b11_3 + t_b11_4 - t_b11_5 + t_b11_6 + t_b11_7 - t_b11_8 + t_b11_9 - t_b11_10 - t_b11_11 + t_b11_12 - t_b11_13 - t_b11_14 + t_b11_15); let t_b12 = calc_f64!(t_b12_1 - t_b12_2 + t_b12_3 - t_b12_4 - t_b12_5 + t_b12_6 - t_b12_7 + t_b12_8 + t_b12_9 - t_b12_10 + t_b12_11 - t_b12_12 + t_b12_13 + t_b12_14 - t_b12_15); let t_b13 = calc_f64!(t_b13_1 - t_b13_2 + t_b13_3 - t_b13_4 + t_b13_5 - t_b13_6 - t_b13_7 + t_b13_8 - t_b13_9 + t_b13_10 - t_b13_11 + t_b13_12 + t_b13_13 - t_b13_14 + t_b13_15); let t_b14 = calc_f64!(t_b14_1 - t_b14_2 + t_b14_3 - t_b14_4 + t_b14_5 - t_b14_6 + t_b14_7 - t_b14_8 + t_b14_9 - t_b14_10 - t_b14_11 + t_b14_12 - t_b14_13 + t_b14_14 - t_b14_15); let t_b15 = calc_f64!(t_b15_1 - t_b15_2 + t_b15_3 - t_b15_4 + t_b15_5 - t_b15_6 + t_b15_7 - t_b15_8 + t_b15_9 - t_b15_10 + t_b15_11 - t_b15_12 + t_b15_13 - t_b15_14 + t_b15_15); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let t_b7_rot = self.rotate.rotate(t_b7); let t_b8_rot = self.rotate.rotate(t_b8); let t_b9_rot = self.rotate.rotate(t_b9); let t_b10_rot = self.rotate.rotate(t_b10); let t_b11_rot = self.rotate.rotate(t_b11); let t_b12_rot = self.rotate.rotate(t_b12); let t_b13_rot = self.rotate.rotate(t_b13); let t_b14_rot = self.rotate.rotate(t_b14); let t_b15_rot = self.rotate.rotate(t_b15); let y0 = calc_f64!(x0 + x1p30 + x2p29 + x3p28 + x4p27 + x5p26 + x6p25 + x7p24 + x8p23 + x9p22 + x10p21 + x11p20 + x12p19 + x13p18 + x14p17 + x15p16); let [y1, y30] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y29] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y28] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y27] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y26] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y25] = solo_fft2_f64(t_a6, t_b6_rot); let [y7, y24] = solo_fft2_f64(t_a7, t_b7_rot); let [y8, y23] = solo_fft2_f64(t_a8, t_b8_rot); let [y9, y22] = solo_fft2_f64(t_a9, t_b9_rot); let [y10, y21] = solo_fft2_f64(t_a10, t_b10_rot); let [y11, y20] = solo_fft2_f64(t_a11, t_b11_rot); let [y12, y19] = solo_fft2_f64(t_a12, t_b12_rot); let [y13, y18] = solo_fft2_f64(t_a13, t_b13_rot); let [y14, y17] = solo_fft2_f64(t_a14, t_b14_rot); let [y15, y16] = solo_fft2_f64(t_a15, t_b15_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22, y23, y24, y25, y26, y27, y28, y29, y30] } } // _____ _____ ____ _____ ____ // |_ _| ____/ ___|_ _/ ___| // | | | _| \___ \ | | \___ \ // | | | |___ ___) || | ___) | // |_| |_____|____/ |_| |____/ // #[cfg(test)] mod unit_tests { use super::*; use crate::test_utils::check_fft_algorithm; //the tests for all butterflies will be identical except for the identifiers used and size //so it's ideal for a macro macro_rules! test_butterfly_32_func { ($test_name:ident, $struct_name:ident, $size:expr) => { #[test] fn $test_name() { let butterfly = $struct_name::new(FftDirection::Forward); check_fft_algorithm::(&butterfly, $size, FftDirection::Forward); let butterfly_direction = $struct_name::new(FftDirection::Inverse); check_fft_algorithm::(&butterfly_direction, $size, FftDirection::Inverse); } }; } test_butterfly_32_func!(test_ssef32_butterfly7, SseF32Butterfly7, 7); test_butterfly_32_func!(test_ssef32_butterfly11, SseF32Butterfly11, 11); test_butterfly_32_func!(test_ssef32_butterfly13, SseF32Butterfly13, 13); test_butterfly_32_func!(test_ssef32_butterfly17, SseF32Butterfly17, 17); test_butterfly_32_func!(test_ssef32_butterfly19, SseF32Butterfly19, 19); test_butterfly_32_func!(test_ssef32_butterfly23, SseF32Butterfly23, 23); test_butterfly_32_func!(test_ssef32_butterfly29, SseF32Butterfly29, 29); test_butterfly_32_func!(test_ssef32_butterfly31, SseF32Butterfly31, 31); //the tests for all butterflies will be identical except for the identifiers used and size //so it's ideal for a macro macro_rules! test_butterfly_64_func { ($test_name:ident, $struct_name:ident, $size:expr) => { #[test] fn $test_name() { let butterfly = $struct_name::new(FftDirection::Forward); check_fft_algorithm::(&butterfly, $size, FftDirection::Forward); let butterfly_direction = $struct_name::new(FftDirection::Inverse); check_fft_algorithm::(&butterfly_direction, $size, FftDirection::Inverse); } }; } test_butterfly_64_func!(test_ssef64_butterfly7, SseF64Butterfly7, 7); test_butterfly_64_func!(test_ssef64_butterfly11, SseF64Butterfly11, 11); test_butterfly_64_func!(test_ssef64_butterfly13, SseF64Butterfly13, 13); test_butterfly_64_func!(test_ssef64_butterfly17, SseF64Butterfly17, 17); test_butterfly_64_func!(test_ssef64_butterfly19, SseF64Butterfly19, 19); test_butterfly_64_func!(test_ssef64_butterfly23, SseF64Butterfly23, 23); test_butterfly_64_func!(test_ssef64_butterfly29, SseF64Butterfly29, 29); test_butterfly_64_func!(test_ssef64_butterfly31, SseF64Butterfly31, 31); } rustfft-6.2.0/src/sse/sse_radix4.rs000064400000000000000000000405010072674642500153520ustar 00000000000000use num_complex::Complex; use core::arch::x86_64::*; use crate::algorithm::bitreversed_transpose; use crate::array_utils::{self, workaround_transmute_mut}; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::sse::sse_butterflies::{ SseF32Butterfly1, SseF32Butterfly16, SseF32Butterfly2, SseF32Butterfly32, SseF32Butterfly4, SseF32Butterfly8, }; use crate::sse::sse_butterflies::{ SseF64Butterfly1, SseF64Butterfly16, SseF64Butterfly2, SseF64Butterfly32, SseF64Butterfly4, SseF64Butterfly8, }; use crate::{common::FftNum, twiddles, FftDirection}; use crate::{Direction, Fft, Length}; use super::sse_common::{assert_f32, assert_f64}; use super::sse_utils::*; use super::sse_vector::{SseArray, SseArrayMut}; /// FFT algorithm optimized for power-of-two sizes, SSE accelerated version. /// This is designed to be used via a Planner, and not created directly. const USE_BUTTERFLY32_FROM: usize = 262144; // Use length 32 butterfly starting from this length enum Sse32Butterfly { Len1(SseF32Butterfly1), Len2(SseF32Butterfly2), Len4(SseF32Butterfly4), Len8(SseF32Butterfly8), Len16(SseF32Butterfly16), Len32(SseF32Butterfly32), } enum Sse64Butterfly { Len1(SseF64Butterfly1), Len2(SseF64Butterfly2), Len4(SseF64Butterfly4), Len8(SseF64Butterfly8), Len16(SseF64Butterfly16), Len32(SseF64Butterfly32), } pub struct Sse32Radix4 { _phantom: std::marker::PhantomData, twiddles: Box<[__m128]>, base_fft: Sse32Butterfly, base_len: usize, len: usize, direction: FftDirection, bf4: SseF32Butterfly4, } impl Sse32Radix4 { /// Preallocates necessary arrays and precomputes necessary data to efficiently compute the power-of-two FFT pub fn new(len: usize, direction: FftDirection) -> Self { assert!( len.is_power_of_two(), "Radix4 algorithm requires a power-of-two input size. Got {}", len ); assert_f32::(); // figure out which base length we're going to use let num_bits = len.trailing_zeros(); let (base_len, base_fft) = match num_bits { 0 => (len, Sse32Butterfly::Len1(SseF32Butterfly1::new(direction))), 1 => (len, Sse32Butterfly::Len2(SseF32Butterfly2::new(direction))), 2 => (len, Sse32Butterfly::Len4(SseF32Butterfly4::new(direction))), 3 => (len, Sse32Butterfly::Len8(SseF32Butterfly8::new(direction))), _ => { if num_bits % 2 == 1 { if len < USE_BUTTERFLY32_FROM { (8, Sse32Butterfly::Len8(SseF32Butterfly8::new(direction))) } else { (32, Sse32Butterfly::Len32(SseF32Butterfly32::new(direction))) } } else { (16, Sse32Butterfly::Len16(SseF32Butterfly16::new(direction))) } } }; // precompute the twiddle factors this algorithm will use. // we're doing the same precomputation of twiddle factors as the mixed radix algorithm where width=4 and height=len/4 // but mixed radix only does one step and then calls itself recusrively, and this algorithm does every layer all the way down // so we're going to pack all the "layers" of twiddle factors into a single array, starting with the bottom layer and going up let mut twiddle_stride = len / (base_len * 4); let mut twiddle_factors = Vec::with_capacity(len * 2); while twiddle_stride > 0 { let num_rows = len / (twiddle_stride * 4); for i in 0..num_rows / 2 { for k in 1..4 { unsafe { let twiddle_a = twiddles::compute_twiddle(2 * i * k * twiddle_stride, len, direction); let twiddle_b = twiddles::compute_twiddle( (2 * i + 1) * k * twiddle_stride, len, direction, ); let twiddles_packed = _mm_set_ps(twiddle_b.im, twiddle_b.re, twiddle_a.im, twiddle_a.re); twiddle_factors.push(twiddles_packed); } } } twiddle_stride >>= 2; } Self { twiddles: twiddle_factors.into_boxed_slice(), base_fft, base_len, len, direction, _phantom: std::marker::PhantomData, bf4: SseF32Butterfly4::::new(direction), } } #[target_feature(enable = "sse4.1")] unsafe fn perform_fft_out_of_place( &self, signal: &[Complex], spectrum: &mut [Complex], _scratch: &mut [Complex], ) { // copy the data into the spectrum vector if self.len() == self.base_len { spectrum.copy_from_slice(signal); } else { bitreversed_transpose(self.base_len, signal, spectrum); } // Base-level FFTs match &self.base_fft { Sse32Butterfly::Len1(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Sse32Butterfly::Len2(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Sse32Butterfly::Len4(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Sse32Butterfly::Len8(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Sse32Butterfly::Len16(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Sse32Butterfly::Len32(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), }; // cross-FFTs let mut current_size = self.base_len * 4; let mut layer_twiddles: &[__m128] = &self.twiddles; while current_size <= signal.len() { let num_rows = signal.len() / current_size; for i in 0..num_rows { butterfly_4_32( &mut spectrum[i * current_size..], layer_twiddles, current_size / 4, &self.bf4, ) } //skip past all the twiddle factors used in this layer let twiddle_offset = (current_size * 3) / 8; layer_twiddles = &layer_twiddles[twiddle_offset..]; current_size *= 4; } } } boilerplate_fft_sse_oop!(Sse32Radix4, |this: &Sse32Radix4<_>| this.len); #[target_feature(enable = "sse4.1")] unsafe fn butterfly_4_32( data: &mut [Complex], twiddles: &[__m128], num_ffts: usize, bf4: &SseF32Butterfly4, ) { let mut idx = 0usize; let mut buffer: &mut [Complex] = workaround_transmute_mut(data); for tw in twiddles.chunks_exact(6).take(num_ffts / 4) { let scratch0 = buffer.load_complex(idx); let scratch0b = buffer.load_complex(idx + 2); let mut scratch1 = buffer.load_complex(idx + 1 * num_ffts); let mut scratch1b = buffer.load_complex(idx + 2 + 1 * num_ffts); let mut scratch2 = buffer.load_complex(idx + 2 * num_ffts); let mut scratch2b = buffer.load_complex(idx + 2 + 2 * num_ffts); let mut scratch3 = buffer.load_complex(idx + 3 * num_ffts); let mut scratch3b = buffer.load_complex(idx + 2 + 3 * num_ffts); scratch1 = mul_complex_f32(scratch1, tw[0]); scratch2 = mul_complex_f32(scratch2, tw[1]); scratch3 = mul_complex_f32(scratch3, tw[2]); scratch1b = mul_complex_f32(scratch1b, tw[3]); scratch2b = mul_complex_f32(scratch2b, tw[4]); scratch3b = mul_complex_f32(scratch3b, tw[5]); let scratch = bf4.perform_parallel_fft_direct(scratch0, scratch1, scratch2, scratch3); let scratchb = bf4.perform_parallel_fft_direct(scratch0b, scratch1b, scratch2b, scratch3b); buffer.store_complex(scratch[0], idx); buffer.store_complex(scratchb[0], idx + 2); buffer.store_complex(scratch[1], idx + 1 * num_ffts); buffer.store_complex(scratchb[1], idx + 2 + 1 * num_ffts); buffer.store_complex(scratch[2], idx + 2 * num_ffts); buffer.store_complex(scratchb[2], idx + 2 + 2 * num_ffts); buffer.store_complex(scratch[3], idx + 3 * num_ffts); buffer.store_complex(scratchb[3], idx + 2 + 3 * num_ffts); idx += 4; } } pub struct Sse64Radix4 { _phantom: std::marker::PhantomData, twiddles: Box<[__m128d]>, base_fft: Sse64Butterfly, base_len: usize, len: usize, direction: FftDirection, bf4: SseF64Butterfly4, } impl Sse64Radix4 { /// Preallocates necessary arrays and precomputes necessary data to efficiently compute the power-of-two FFT pub fn new(len: usize, direction: FftDirection) -> Self { assert!( len.is_power_of_two(), "Radix4 algorithm requires a power-of-two input size. Got {}", len ); assert_f64::(); // figure out which base length we're going to use let num_bits = len.trailing_zeros(); let (base_len, base_fft) = match num_bits { 0 => (len, Sse64Butterfly::Len1(SseF64Butterfly1::new(direction))), 1 => (len, Sse64Butterfly::Len2(SseF64Butterfly2::new(direction))), 2 => (len, Sse64Butterfly::Len4(SseF64Butterfly4::new(direction))), 3 => (len, Sse64Butterfly::Len8(SseF64Butterfly8::new(direction))), _ => { if num_bits % 2 == 1 { if len < USE_BUTTERFLY32_FROM { (8, Sse64Butterfly::Len8(SseF64Butterfly8::new(direction))) } else { (32, Sse64Butterfly::Len32(SseF64Butterfly32::new(direction))) } } else { (16, Sse64Butterfly::Len16(SseF64Butterfly16::new(direction))) } } }; // precompute the twiddle factors this algorithm will use. // we're doing the same precomputation of twiddle factors as the mixed radix algorithm where width=4 and height=len/4 // but mixed radix only does one step and then calls itself recusrively, and this algorithm does every layer all the way down // so we're going to pack all the "layers" of twiddle factors into a single array, starting with the bottom layer and going up let mut twiddle_stride = len / (base_len * 4); let mut twiddle_factors = Vec::with_capacity(len * 2); while twiddle_stride > 0 { let num_rows = len / (twiddle_stride * 4); for i in 0..num_rows { for k in 1..4 { unsafe { let twiddle = twiddles::compute_twiddle(i * k * twiddle_stride, len, direction); let twiddle_packed = _mm_set_pd(twiddle.im, twiddle.re); twiddle_factors.push(twiddle_packed); } } } twiddle_stride >>= 2; } Self { twiddles: twiddle_factors.into_boxed_slice(), base_fft, base_len, len, direction, _phantom: std::marker::PhantomData, bf4: SseF64Butterfly4::::new(direction), } } #[target_feature(enable = "sse4.1")] unsafe fn perform_fft_out_of_place( &self, signal: &[Complex], spectrum: &mut [Complex], _scratch: &mut [Complex], ) { // copy the data into the spectrum vector if self.len() == self.base_len { spectrum.copy_from_slice(signal); } else { bitreversed_transpose(self.base_len, signal, spectrum); } // Base-level FFTs match &self.base_fft { Sse64Butterfly::Len1(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Sse64Butterfly::Len2(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Sse64Butterfly::Len4(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Sse64Butterfly::Len8(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Sse64Butterfly::Len16(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), Sse64Butterfly::Len32(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), } // cross-FFTs let mut current_size = self.base_len * 4; let mut layer_twiddles: &[__m128d] = &self.twiddles; while current_size <= signal.len() { let num_rows = signal.len() / current_size; for i in 0..num_rows { butterfly_4_64( &mut spectrum[i * current_size..], layer_twiddles, current_size / 4, &self.bf4, ) } //skip past all the twiddle factors used in this layer let twiddle_offset = (current_size * 3) / 4; layer_twiddles = &layer_twiddles[twiddle_offset..]; current_size *= 4; } } } boilerplate_fft_sse_oop!(Sse64Radix4, |this: &Sse64Radix4<_>| this.len); #[target_feature(enable = "sse4.1")] unsafe fn butterfly_4_64( data: &mut [Complex], twiddles: &[__m128d], num_ffts: usize, bf4: &SseF64Butterfly4, ) { let mut idx = 0usize; let mut buffer: &mut [Complex] = workaround_transmute_mut(data); for tw in twiddles.chunks_exact(6).take(num_ffts / 2) { let scratch0 = buffer.load_complex(idx); let scratch0b = buffer.load_complex(idx + 1); let mut scratch1 = buffer.load_complex(idx + 1 * num_ffts); let mut scratch1b = buffer.load_complex(idx + 1 + 1 * num_ffts); let mut scratch2 = buffer.load_complex(idx + 2 * num_ffts); let mut scratch2b = buffer.load_complex(idx + 1 + 2 * num_ffts); let mut scratch3 = buffer.load_complex(idx + 3 * num_ffts); let mut scratch3b = buffer.load_complex(idx + 1 + 3 * num_ffts); scratch1 = mul_complex_f64(scratch1, tw[0]); scratch2 = mul_complex_f64(scratch2, tw[1]); scratch3 = mul_complex_f64(scratch3, tw[2]); scratch1b = mul_complex_f64(scratch1b, tw[3]); scratch2b = mul_complex_f64(scratch2b, tw[4]); scratch3b = mul_complex_f64(scratch3b, tw[5]); let scratch = bf4.perform_fft_direct(scratch0, scratch1, scratch2, scratch3); let scratchb = bf4.perform_fft_direct(scratch0b, scratch1b, scratch2b, scratch3b); buffer.store_complex(scratch[0], idx); buffer.store_complex(scratchb[0], idx + 1); buffer.store_complex(scratch[1], idx + 1 * num_ffts); buffer.store_complex(scratchb[1], idx + 1 + 1 * num_ffts); buffer.store_complex(scratch[2], idx + 2 * num_ffts); buffer.store_complex(scratchb[2], idx + 1 + 2 * num_ffts); buffer.store_complex(scratch[3], idx + 3 * num_ffts); buffer.store_complex(scratchb[3], idx + 1 + 3 * num_ffts); idx += 2; } } #[cfg(test)] mod unit_tests { use super::*; use crate::test_utils::check_fft_algorithm; #[test] fn test_sse_radix4_64() { for pow in 4..12 { let len = 1 << pow; test_sse_radix4_64_with_length(len, FftDirection::Forward); test_sse_radix4_64_with_length(len, FftDirection::Inverse); } } fn test_sse_radix4_64_with_length(len: usize, direction: FftDirection) { let fft = Sse64Radix4::new(len, direction); check_fft_algorithm::(&fft, len, direction); } #[test] fn test_sse_radix4_32() { for pow in 0..12 { let len = 1 << pow; test_sse_radix4_32_with_length(len, FftDirection::Forward); test_sse_radix4_32_with_length(len, FftDirection::Inverse); } } fn test_sse_radix4_32_with_length(len: usize, direction: FftDirection) { let fft = Sse32Radix4::new(len, direction); check_fft_algorithm::(&fft, len, direction); } } rustfft-6.2.0/src/sse/sse_utils.rs000064400000000000000000000202530072674642500153210ustar 00000000000000use core::arch::x86_64::*; // __ __ _ _ _________ _ _ _ // | \/ | __ _| |_| |__ |___ /___ \| |__ (_) |_ // | |\/| |/ _` | __| '_ \ _____ |_ \ __) | '_ \| | __| // | | | | (_| | |_| | | | |_____| ___) / __/| |_) | | |_ // |_| |_|\__,_|\__|_| |_| |____/_____|_.__/|_|\__| // pub struct Rotate90F32 { //sign_lo: __m128, sign_hi: __m128, sign_both: __m128, } impl Rotate90F32 { pub fn new(positive: bool) -> Self { // There doesn't seem to be any need for rotating just the first element, but let's keep the code just in case //let sign_lo = unsafe { // if positive { // _mm_set_ps(0.0, 0.0, 0.0, -0.0) // } // else { // _mm_set_ps(0.0, 0.0, -0.0, 0.0) // } //}; let sign_hi = unsafe { if positive { _mm_set_ps(0.0, -0.0, 0.0, 0.0) } else { _mm_set_ps(-0.0, 0.0, 0.0, 0.0) } }; let sign_both = unsafe { if positive { _mm_set_ps(0.0, -0.0, 0.0, -0.0) } else { _mm_set_ps(-0.0, 0.0, -0.0, 0.0) } }; Self { //sign_lo, sign_hi, sign_both, } } #[inline(always)] pub unsafe fn rotate_hi(&self, values: __m128) -> __m128 { let temp = _mm_shuffle_ps(values, values, 0xB4); _mm_xor_ps(temp, self.sign_hi) } // There doesn't seem to be any need for rotating just the first element, but let's keep the code just in case //#[inline(always)] //pub unsafe fn rotate_lo(&self, values: __m128) -> __m128 { // let temp = _mm_shuffle_ps(values, values, 0xE1); // _mm_xor_ps(temp, self.sign_lo) //} #[inline(always)] pub unsafe fn rotate_both(&self, values: __m128) -> __m128 { let temp = _mm_shuffle_ps(values, values, 0xB1); _mm_xor_ps(temp, self.sign_both) } } // Pack low (1st) complex // left: r1.re, r1.im, r2.re, r2.im // right: l1.re, l1.im, l2.re, l2.im // --> r1.re, r1.im, l1.re, l1.im #[inline(always)] pub unsafe fn extract_lo_lo_f32(left: __m128, right: __m128) -> __m128 { //_mm_shuffle_ps(left, right, 0x44) _mm_castpd_ps(_mm_unpacklo_pd(_mm_castps_pd(left), _mm_castps_pd(right))) } // Pack high (2nd) complex // left: r1.re, r1.im, r2.re, r2.im // right: l1.re, l1.im, l2.re, l2.im // --> r2.re, r2.im, l2.re, l2.im #[inline(always)] pub unsafe fn extract_hi_hi_f32(left: __m128, right: __m128) -> __m128 { _mm_castpd_ps(_mm_unpackhi_pd(_mm_castps_pd(left), _mm_castps_pd(right))) } // Pack low (1st) and high (2nd) complex // left: r1.re, r1.im, r2.re, r2.im // right: l1.re, l1.im, l2.re, l2.im // --> r1.re, r1.im, l2.re, l2.im #[inline(always)] pub unsafe fn extract_lo_hi_f32(left: __m128, right: __m128) -> __m128 { _mm_blend_ps(left, right, 0x0C) } // Pack high (2nd) and low (1st) complex // left: r1.re, r1.im, r2.re, r2.im // right: l1.re, l1.im, l2.re, l2.im // --> r2.re, r2.im, l1.re, l1.im #[inline(always)] pub unsafe fn extract_hi_lo_f32(left: __m128, right: __m128) -> __m128 { _mm_shuffle_ps(left, right, 0x4E) } // Reverse complex // values: a.re, a.im, b.re, b.im // --> b.re, b.im, a.re, a.im #[inline(always)] pub unsafe fn reverse_complex_elements_f32(values: __m128) -> __m128 { _mm_shuffle_ps(values, values, 0x4E) } // Invert sign of high (2nd) complex // values: a.re, a.im, b.re, b.im // --> a.re, a.im, -b.re, -b.im #[inline(always)] pub unsafe fn negate_hi_f32(values: __m128) -> __m128 { _mm_xor_ps(values, _mm_set_ps(-0.0, -0.0, 0.0, 0.0)) } // Duplicate low (1st) complex // values: a.re, a.im, b.re, b.im // --> a.re, a.im, a.re, a.im #[inline(always)] pub unsafe fn duplicate_lo_f32(values: __m128) -> __m128 { _mm_shuffle_ps(values, values, 0x44) } // Duplicate high (2nd) complex // values: a.re, a.im, b.re, b.im // --> b.re, b.im, b.re, b.im #[inline(always)] pub unsafe fn duplicate_hi_f32(values: __m128) -> __m128 { _mm_shuffle_ps(values, values, 0xEE) } // transpose a 2x2 complex matrix given as [x0, x1], [x2, x3] // result is [x0, x2], [x1, x3] #[inline(always)] pub unsafe fn transpose_complex_2x2_f32(left: __m128, right: __m128) -> [__m128; 2] { let temp02 = extract_lo_lo_f32(left, right); let temp13 = extract_hi_hi_f32(left, right); [temp02, temp13] } // Complex multiplication. // Each input contains two complex values, which are multiplied in parallel. #[inline(always)] pub unsafe fn mul_complex_f32(left: __m128, right: __m128) -> __m128 { //SSE3, taken from Intel performance manual let mut temp1 = _mm_shuffle_ps(right, right, 0xA0); let mut temp2 = _mm_shuffle_ps(right, right, 0xF5); temp1 = _mm_mul_ps(temp1, left); temp2 = _mm_mul_ps(temp2, left); temp2 = _mm_shuffle_ps(temp2, temp2, 0xB1); _mm_addsub_ps(temp1, temp2) } // __ __ _ _ __ _ _ _ _ _ // | \/ | __ _| |_| |__ / /_ | || | | |__ (_) |_ // | |\/| |/ _` | __| '_ \ _____ | '_ \| || |_| '_ \| | __| // | | | | (_| | |_| | | | |_____| | (_) |__ _| |_) | | |_ // |_| |_|\__,_|\__|_| |_| \___/ |_| |_.__/|_|\__| // pub(crate) struct Rotate90F64 { sign: __m128d, } impl Rotate90F64 { pub fn new(positive: bool) -> Self { let sign = unsafe { if positive { _mm_set_pd(0.0, -0.0) } else { _mm_set_pd(-0.0, 0.0) } }; Self { sign } } #[inline(always)] pub unsafe fn rotate(&self, values: __m128d) -> __m128d { let temp = _mm_shuffle_pd(values, values, 0x01); _mm_xor_pd(temp, self.sign) } } #[inline(always)] pub unsafe fn mul_complex_f64(left: __m128d, right: __m128d) -> __m128d { // SSE3, taken from Intel performance manual let mut temp1 = _mm_unpacklo_pd(right, right); let mut temp2 = _mm_unpackhi_pd(right, right); temp1 = _mm_mul_pd(temp1, left); temp2 = _mm_mul_pd(temp2, left); temp2 = _mm_shuffle_pd(temp2, temp2, 0x01); _mm_addsub_pd(temp1, temp2) } #[cfg(test)] mod unit_tests { use super::*; use num_complex::Complex; #[test] fn test_mul_complex_f64() { unsafe { let right = _mm_set_pd(1.0, 2.0); let left = _mm_set_pd(5.0, 7.0); let res = mul_complex_f64(left, right); let expected = _mm_set_pd(2.0 * 5.0 + 1.0 * 7.0, 2.0 * 7.0 - 1.0 * 5.0); assert_eq!( std::mem::transmute::<__m128d, Complex>(res), std::mem::transmute::<__m128d, Complex>(expected) ); } } #[test] fn test_mul_complex_f32() { unsafe { let val1 = Complex::::new(1.0, 2.5); let val2 = Complex::::new(3.2, 4.2); let val3 = Complex::::new(5.6, 6.2); let val4 = Complex::::new(7.4, 8.3); let nbr2 = _mm_set_ps(val4.im, val4.re, val3.im, val3.re); let nbr1 = _mm_set_ps(val2.im, val2.re, val1.im, val1.re); let res = mul_complex_f32(nbr1, nbr2); let res = std::mem::transmute::<__m128, [Complex; 2]>(res); let expected = [val1 * val3, val2 * val4]; assert_eq!(res, expected); } } #[test] fn test_pack() { unsafe { let nbr2 = _mm_set_ps(8.0, 7.0, 6.0, 5.0); let nbr1 = _mm_set_ps(4.0, 3.0, 2.0, 1.0); let first = extract_lo_lo_f32(nbr1, nbr2); let second = extract_hi_hi_f32(nbr1, nbr2); let first = std::mem::transmute::<__m128, [Complex; 2]>(first); let second = std::mem::transmute::<__m128, [Complex; 2]>(second); let first_expected = [Complex::new(1.0, 2.0), Complex::new(5.0, 6.0)]; let second_expected = [Complex::new(3.0, 4.0), Complex::new(7.0, 8.0)]; assert_eq!(first, first_expected); assert_eq!(second, second_expected); } } } rustfft-6.2.0/src/sse/sse_vector.rs000064400000000000000000000264330072674642500154710ustar 00000000000000use core::arch::x86_64::*; use num_complex::Complex; use std::ops::{Deref, DerefMut}; use crate::array_utils::DoubleBuf; // Read these indexes from an SseArray and build an array of simd vectors. // Takes a name of a vector to read from, and a list of indexes to read. // This statement: // ``` // let values = read_complex_to_array!(input, {0, 1, 2, 3}); // ``` // is equivalent to: // ``` // let values = [ // input.load_complex(0), // input.load_complex(1), // input.load_complex(2), // input.load_complex(3), // ]; // ``` macro_rules! read_complex_to_array { ($input:ident, { $($idx:literal),* }) => { [ $( $input.load_complex($idx), )* ] } } // Read these indexes from an SseArray and build an array or partially filled simd vectors. // Takes a name of a vector to read from, and a list of indexes to read. // This statement: // ``` // let values = read_partial1_complex_to_array!(input, {0, 1, 2, 3}); // ``` // is equivalent to: // ``` // let values = [ // input.load1_complex(0), // input.load1_complex(1), // input.load1_complex(2), // input.load1_complex(3), // ]; // ``` macro_rules! read_partial1_complex_to_array { ($input:ident, { $($idx:literal),* }) => { [ $( $input.load1_complex($idx), )* ] } } // Write these indexes of an array of simd vectors to the same indexes of an SseArray. // Takes a name of a vector to read from, one to write to, and a list of indexes. // This statement: // ``` // let values = write_complex_to_array!(input, output, {0, 1, 2, 3}); // ``` // is equivalent to: // ``` // let values = [ // output.store_complex(input[0], 0), // output.store_complex(input[1], 1), // output.store_complex(input[2], 2), // output.store_complex(input[3], 3), // ]; // ``` macro_rules! write_complex_to_array { ($input:ident, $output:ident, { $($idx:literal),* }) => { $( $output.store_complex($input[$idx], $idx); )* } } // Write the low half of these indexes of an array of simd vectors to the same indexes of an SseArray. // Takes a name of a vector to read from, one to write to, and a list of indexes. // This statement: // ``` // let values = write_partial_lo_complex_to_array!(input, output, {0, 1, 2, 3}); // ``` // is equivalent to: // ``` // let values = [ // output.store_partial_lo_complex(input[0], 0), // output.store_partial_lo_complex(input[1], 1), // output.store_partial_lo_complex(input[2], 2), // output.store_partial_lo_complex(input[3], 3), // ]; // ``` macro_rules! write_partial_lo_complex_to_array { ($input:ident, $output:ident, { $($idx:literal),* }) => { $( $output.store_partial_lo_complex($input[$idx], $idx); )* } } // Write these indexes of an array of simd vectors to the same indexes, multiplied by a stride, of an SseArray. // Takes a name of a vector to read from, one to write to, an integer stride, and a list of indexes. // This statement: // ``` // let values = write_complex_to_array_separate!(input, output, {0, 1, 2, 3}); // ``` // is equivalent to: // ``` // let values = [ // output.store_complex(input[0], 0), // output.store_complex(input[1], 2), // output.store_complex(input[2], 4), // output.store_complex(input[3], 6), // ]; // ``` macro_rules! write_complex_to_array_strided { ($input:ident, $output:ident, $stride:literal, { $($idx:literal),* }) => { $( $output.store_complex($input[$idx], $idx*$stride); )* } } // A trait to hold the BVectorType and COMPLEX_PER_VECTOR associated data pub trait SseNum { type VectorType; const COMPLEX_PER_VECTOR: usize; } impl SseNum for f32 { type VectorType = __m128; const COMPLEX_PER_VECTOR: usize = 2; } impl SseNum for f64 { type VectorType = __m128d; const COMPLEX_PER_VECTOR: usize = 1; } // A trait to handle reading from an array of complex floats into SSE vectors. // SSE works with 128-bit vectors, meaning a vector can hold two complex f32, // or a single complex f64. pub trait SseArray: Deref { // Load complex numbers from the array to fill a SSE vector. unsafe fn load_complex(&self, index: usize) -> T::VectorType; // Load a single complex number from the array into a SSE vector, setting the unused elements to zero. unsafe fn load_partial1_complex(&self, index: usize) -> T::VectorType; // Load a single complex number from the array, and copy it to all elements of a SSE vector. unsafe fn load1_complex(&self, index: usize) -> T::VectorType; } impl SseArray for &[Complex] { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); _mm_loadu_ps(self.as_ptr().add(index) as *const f32) } #[inline(always)] unsafe fn load_partial1_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + 1); _mm_castpd_ps(_mm_load_sd(self.as_ptr().add(index) as *const f64)) } #[inline(always)] unsafe fn load1_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + 1); _mm_castpd_ps(_mm_load1_pd(self.as_ptr().add(index) as *const f64)) } } impl SseArray for &mut [Complex] { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); _mm_loadu_ps(self.as_ptr().add(index) as *const f32) } #[inline(always)] unsafe fn load_partial1_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + 1); _mm_castpd_ps(_mm_load_sd(self.as_ptr().add(index) as *const f64)) } #[inline(always)] unsafe fn load1_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + 1); _mm_castpd_ps(_mm_load1_pd(self.as_ptr().add(index) as *const f64)) } } impl SseArray for &[Complex] { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); _mm_loadu_pd(self.as_ptr().add(index) as *const f64) } #[inline(always)] unsafe fn load_partial1_complex(&self, _index: usize) -> ::VectorType { unimplemented!("Impossible to do a partial load of complex f64's"); } #[inline(always)] unsafe fn load1_complex(&self, _index: usize) -> ::VectorType { unimplemented!("Impossible to do a partial load of complex f64's"); } } impl SseArray for &mut [Complex] { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); _mm_loadu_pd(self.as_ptr().add(index) as *const f64) } #[inline(always)] unsafe fn load_partial1_complex(&self, _index: usize) -> ::VectorType { unimplemented!("Impossible to do a partial load of complex f64's"); } #[inline(always)] unsafe fn load1_complex(&self, _index: usize) -> ::VectorType { unimplemented!("Impossible to do a partial load of complex f64's"); } } impl<'a, T: SseNum> SseArray for DoubleBuf<'a, T> where &'a [Complex]: SseArray, { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> T::VectorType { self.input.load_complex(index) } #[inline(always)] unsafe fn load_partial1_complex(&self, index: usize) -> T::VectorType { self.input.load_partial1_complex(index) } #[inline(always)] unsafe fn load1_complex(&self, index: usize) -> T::VectorType { self.input.load1_complex(index) } } // A trait to handle writing to an array of complex floats from SSE vectors. // SSE works with 128-bit vectors, meaning a vector can hold two complex f32, // or a single complex f64. pub trait SseArrayMut: SseArray + DerefMut { // Store all complex numbers from a SSE vector to the array. unsafe fn store_complex(&mut self, vector: T::VectorType, index: usize); // Store the low complex number from a SSE vector to the array. unsafe fn store_partial_lo_complex(&mut self, vector: T::VectorType, index: usize); // Store the high complex number from a SSE vector to the array. unsafe fn store_partial_hi_complex(&mut self, vector: T::VectorType, index: usize); } impl SseArrayMut for &mut [Complex] { #[inline(always)] unsafe fn store_complex(&mut self, vector: ::VectorType, index: usize) { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); _mm_storeu_ps(self.as_mut_ptr().add(index) as *mut f32, vector); } #[inline(always)] unsafe fn store_partial_hi_complex( &mut self, vector: ::VectorType, index: usize, ) { debug_assert!(self.len() >= index + 1); _mm_storeh_pd( self.as_mut_ptr().add(index) as *mut f64, _mm_castps_pd(vector), ); } #[inline(always)] unsafe fn store_partial_lo_complex( &mut self, vector: ::VectorType, index: usize, ) { debug_assert!(self.len() >= index + 1); _mm_storel_pd( self.as_mut_ptr().add(index) as *mut f64, _mm_castps_pd(vector), ); } } impl SseArrayMut for &mut [Complex] { #[inline(always)] unsafe fn store_complex(&mut self, vector: ::VectorType, index: usize) { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); _mm_storeu_pd(self.as_mut_ptr().add(index) as *mut f64, vector); } #[inline(always)] unsafe fn store_partial_hi_complex( &mut self, _vector: ::VectorType, _index: usize, ) { unimplemented!("Impossible to do a partial store of complex f64's"); } #[inline(always)] unsafe fn store_partial_lo_complex( &mut self, _vector: ::VectorType, _index: usize, ) { unimplemented!("Impossible to do a partial store of complex f64's"); } } impl<'a, T: SseNum> SseArrayMut for DoubleBuf<'a, T> where Self: SseArray, &'a mut [Complex]: SseArrayMut, { #[inline(always)] unsafe fn store_complex(&mut self, vector: T::VectorType, index: usize) { self.output.store_complex(vector, index); } #[inline(always)] unsafe fn store_partial_lo_complex(&mut self, vector: T::VectorType, index: usize) { self.output.store_partial_lo_complex(vector, index); } #[inline(always)] unsafe fn store_partial_hi_complex(&mut self, vector: T::VectorType, index: usize) { self.output.store_partial_hi_complex(vector, index); } } rustfft-6.2.0/src/test_utils.rs000064400000000000000000000153540072674642500147220ustar 00000000000000use num_complex::Complex; use num_traits::{Float, One, Zero}; use rand::distributions::{uniform::SampleUniform, Distribution, Uniform}; use rand::{rngs::StdRng, SeedableRng}; use crate::{algorithm::Dft, Direction, FftNum, Length}; use crate::{Fft, FftDirection}; /// The seed for the random number generator used to generate /// random signals. It's defined here so that we have deterministic /// tests const RNG_SEED: [u8; 32] = [ 1, 9, 1, 0, 1, 1, 4, 3, 1, 4, 9, 8, 4, 1, 4, 8, 2, 8, 1, 2, 2, 2, 6, 1, 2, 3, 4, 5, 6, 7, 8, 9, ]; pub fn random_signal(length: usize) -> Vec> { let mut sig = Vec::with_capacity(length); let normal_dist: Uniform = Uniform::new(T::zero(), T::from_f32(10.0).unwrap()); let mut rng: StdRng = SeedableRng::from_seed(RNG_SEED); for _ in 0..length { sig.push(Complex { re: normal_dist.sample(&mut rng), im: normal_dist.sample(&mut rng), }); } return sig; } pub fn compare_vectors(vec1: &[Complex], vec2: &[Complex]) -> bool { assert_eq!(vec1.len(), vec2.len()); let mut error = T::zero(); for (&a, &b) in vec1.iter().zip(vec2.iter()) { error = error + (a - b).norm(); } return (error.to_f64().unwrap() / vec1.len() as f64) < 0.1f64; } #[allow(unused)] fn transppose_diagnostic(expected: &[Complex], actual: &[Complex]) { for (i, (&e, &a)) in expected.iter().zip(actual.iter()).enumerate() { if (e - a).norm().to_f32().unwrap() > 0.01 { if let Some(found_index) = expected .iter() .position(|&ev| (ev - a).norm().to_f32().unwrap() < 0.01) { println!("{} incorrectly contained {}", i, found_index); } else { println!("{} X", i); } } } } pub fn check_fft_algorithm( fft: &dyn Fft, len: usize, direction: FftDirection, ) { assert_eq!( fft.len(), len, "Algorithm reported incorrect size. Expected {}, got {}", len, fft.len() ); assert_eq!( fft.fft_direction(), direction, "Algorithm reported incorrect FFT direction" ); let n = 3; //test the forward direction let dft = Dft::new(len, direction); let dirty_scratch_value = Complex::one() * T::from_i32(100).unwrap(); // set up buffers let reference_input = random_signal(len * n); let mut expected_output = reference_input.clone(); let mut dft_scratch = vec![Zero::zero(); dft.get_inplace_scratch_len()]; dft.process_with_scratch(&mut expected_output, &mut dft_scratch); // test process() { let mut buffer = reference_input.clone(); fft.process(&mut buffer); if !compare_vectors(&expected_output, &buffer) { dbg!(&expected_output); dbg!(&buffer); panic!( "process() failed, length = {}, direction = {}", len, direction ); } } // test process_with_scratch() { let mut buffer = reference_input.clone(); let mut scratch = vec![Zero::zero(); fft.get_inplace_scratch_len()]; fft.process_with_scratch(&mut buffer, &mut scratch); assert!( compare_vectors(&expected_output, &buffer), "process_with_scratch() failed, length = {}, direction = {}", len, direction ); // make sure this algorithm works correctly with dirty scratch if scratch.len() > 0 { for item in scratch.iter_mut() { *item = dirty_scratch_value; } buffer.copy_from_slice(&reference_input); fft.process_with_scratch(&mut buffer, &mut scratch); assert!(compare_vectors(&expected_output, &buffer), "process_with_scratch() failed the 'dirty scratch' test, length = {}, direction = {}", len, direction); } } // test process_outofplace_with_scratch() { let mut input = reference_input.clone(); let mut scratch = vec![Zero::zero(); fft.get_outofplace_scratch_len()]; let mut output = vec![Zero::zero(); n * len]; fft.process_outofplace_with_scratch(&mut input, &mut output, &mut scratch); assert!( compare_vectors(&expected_output, &output), "process_outofplace_with_scratch() failed, length = {}, direction = {}", len, direction ); // make sure this algorithm works correctly with dirty scratch if scratch.len() > 0 { for item in scratch.iter_mut() { *item = dirty_scratch_value; } input.copy_from_slice(&reference_input); fft.process_outofplace_with_scratch(&mut input, &mut output, &mut scratch); assert!( compare_vectors(&expected_output, &output), "process_outofplace_with_scratch() failed the 'dirty scratch' test, length = {}, direction = {}", len, direction ); } } } // A fake FFT algorithm that requests much more scratch than it needs. You can use this as an inner FFT to other algorithms to test their scratch-supplying logic #[derive(Debug)] pub struct BigScratchAlgorithm { pub len: usize, pub inplace_scratch: usize, pub outofplace_scratch: usize, pub direction: FftDirection, } impl Fft for BigScratchAlgorithm { fn process_with_scratch(&self, _buffer: &mut [Complex], scratch: &mut [Complex]) { assert!( scratch.len() >= self.inplace_scratch, "Not enough inplace scratch provided, self={:?}, provided scratch={}", &self, scratch.len() ); } fn process_outofplace_with_scratch( &self, _input: &mut [Complex], _output: &mut [Complex], scratch: &mut [Complex], ) { assert!( scratch.len() >= self.outofplace_scratch, "Not enough OOP scratch provided, self={:?}, provided scratch={}", &self, scratch.len() ); } fn get_inplace_scratch_len(&self) -> usize { self.inplace_scratch } fn get_outofplace_scratch_len(&self) -> usize { self.outofplace_scratch } } impl Length for BigScratchAlgorithm { fn len(&self) -> usize { self.len } } impl Direction for BigScratchAlgorithm { fn fft_direction(&self) -> FftDirection { self.direction } } rustfft-6.2.0/src/twiddles.rs000064400000000000000000000076150072674642500143430ustar 00000000000000use crate::{common::FftNum, FftDirection}; use num_complex::Complex; use strength_reduce::{StrengthReducedU128, StrengthReducedU64}; pub fn compute_twiddle( index: usize, fft_len: usize, direction: FftDirection, ) -> Complex { let constant = -2f64 * std::f64::consts::PI / fft_len as f64; let angle = constant * index as f64; let result = Complex { re: T::from_f64(angle.cos()).unwrap(), im: T::from_f64(angle.sin()).unwrap(), }; match direction { FftDirection::Forward => result, FftDirection::Inverse => result.conj(), } } pub fn fill_bluesteins_twiddles( destination: &mut [Complex], direction: FftDirection, ) { let twice_len = destination.len() * 2; // Standard bluestein's twiddle computation requires us to square the index before usingit to compute a twiddle factor // And since twiddle factors are cyclic, we can improve precision once the squared index gets converted to floating point by taking a modulo // Modulo is expensive, so we're going to use strength-reduction to keep it manageable // Strength-reduced u128s are very heavy, so we only want to use them if we need them - and we only need them if // len * len doesn't fit in a u64, AKA if len doesn't fit in a u32 if destination.len() < std::u32::MAX as usize { let twice_len_reduced = StrengthReducedU64::new(twice_len as u64); for (i, e) in destination.iter_mut().enumerate() { let i_squared = i as u64 * i as u64; let i_mod = i_squared % twice_len_reduced; *e = compute_twiddle(i_mod as usize, twice_len, direction); } } else { // Sadly, the len doesn't fit in a u64, so we have to crank it up to u128 arithmetic let twice_len_reduced = StrengthReducedU128::new(twice_len as u128); for (i, e) in destination.iter_mut().enumerate() { // Standard bluestein's twiddle computation requires us to square the index before usingit to compute a twiddle factor // And since twiddle factors are cyclic, we can improve precision once the squared index gets converted to floating point by taking a modulo let i_squared = i as u128 * i as u128; let i_mod = i_squared % twice_len_reduced; *e = compute_twiddle(i_mod as usize, twice_len, direction); } } } pub fn rotate_90(value: Complex, direction: FftDirection) -> Complex { match direction { FftDirection::Forward => Complex { re: value.im, im: -value.re, }, FftDirection::Inverse => Complex { re: -value.im, im: value.re, }, } } #[cfg(test)] mod unit_tests { use super::*; #[test] fn test_rotate() { // Verify that the rotate90 function does the same thing as multiplying by twiddle(1,4), in the forward direction let value = Complex { re: 9.1, im: 2.2 }; let rotated_forward = rotate_90(value, FftDirection::Forward); let twiddled_forward = value * compute_twiddle(1, 4, FftDirection::Forward); assert_eq!(value.re, -rotated_forward.im); assert_eq!(value.im, rotated_forward.re); assert!(value.re + twiddled_forward.im < 0.0001); assert!(value.im - rotated_forward.re < 0.0001); // Verify that the rotate90 function does the same thing as multiplying by twiddle(1,4), in the inverse direction let rotated_forward = rotate_90(value, FftDirection::Inverse); let twiddled_forward = value * compute_twiddle(1, 4, FftDirection::Inverse); assert_eq!(value.re, rotated_forward.im); assert_eq!(value.im, -rotated_forward.re); assert!(value.re - twiddled_forward.im < 0.0001); assert!(value.im + rotated_forward.re < 0.0001); } } rustfft-6.2.0/src/wasm_simd/mod.rs000064400000000000000000000005570072674642500152640ustar 00000000000000#[macro_use] mod wasm_simd_common; #[macro_use] mod wasm_simd_vector; #[macro_use] pub mod wasm_simd_butterflies; pub mod wasm_simd_prime_butterflies; pub mod wasm_simd_radix4; mod wasm_simd_utils; pub mod wasm_simd_planner; pub use self::wasm_simd_butterflies::*; pub use self::wasm_simd_prime_butterflies::*; pub use self::wasm_simd_radix4::*; rustfft-6.2.0/src/wasm_simd/wasm_simd_butterflies.rs000064400000000000000000003756420072674642500211120ustar 00000000000000use core::arch::wasm32::*; use num_complex::Complex; use crate::{common::FftNum, FftDirection}; use crate::array_utils; use crate::array_utils::workaround_transmute_mut; use crate::array_utils::DoubleBuf; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::twiddles; use crate::{Direction, Fft, Length}; use super::wasm_simd_common::{assert_f32, assert_f64}; use super::wasm_simd_utils::*; use super::wasm_simd_vector::WasmSimdArrayMut; #[allow(unused)] macro_rules! boilerplate_fft_wasm_simd_f32_butterfly { ($struct_name:ident, $len:expr, $direction_fn:expr) => { impl $struct_name { #[target_feature(enable = "simd128")] pub(crate) unsafe fn perform_fft_butterfly(&self, buffer: &mut [Complex]) { self.perform_fft_contiguous(workaround_transmute_mut::<_, Complex>(buffer)); } #[target_feature(enable = "simd128")] pub(crate) unsafe fn perform_parallel_fft_butterfly(&self, buffer: &mut [Complex]) { self.perform_parallel_fft_contiguous(workaround_transmute_mut::<_, Complex>( buffer, )); } // Do multiple ffts over a longer vector inplace, called from "process_with_scratch" of Fft trait #[target_feature(enable = "simd128")] pub(crate) unsafe fn perform_fft_butterfly_multi( &self, buffer: &mut [Complex], ) -> Result<(), ()> { let len = buffer.len(); let alldone = array_utils::iter_chunks(buffer, 2 * self.len(), |chunk| { self.perform_parallel_fft_butterfly(chunk) }); if alldone.is_err() && buffer.len() >= self.len() { self.perform_fft_butterfly(&mut buffer[len - self.len()..]); } Ok(()) } // Do multiple ffts over a longer vector outofplace, called from "process_outofplace_with_scratch" of Fft trait #[target_feature(enable = "simd128")] pub(crate) unsafe fn perform_oop_fft_butterfly_multi( &self, input: &mut [Complex], output: &mut [Complex], ) -> Result<(), ()> { let len = input.len(); let alldone = array_utils::iter_chunks_zipped( input, output, 2 * self.len(), |in_chunk, out_chunk| { let input_slice = workaround_transmute_mut(in_chunk); let output_slice = workaround_transmute_mut(out_chunk); self.perform_parallel_fft_contiguous(DoubleBuf { input: input_slice, output: output_slice, }) }, ); if alldone.is_err() && input.len() >= self.len() { let input_slice = workaround_transmute_mut(input); let output_slice = workaround_transmute_mut(output); self.perform_fft_contiguous(DoubleBuf { input: &mut input_slice[len - self.len()..], output: &mut output_slice[len - self.len()..], }) } Ok(()) } } }; } macro_rules! boilerplate_fft_wasm_simd_f64_butterfly { ($struct_name:ident, $len:expr, $direction_fn:expr) => { impl $struct_name { // Do a single fft #[target_feature(enable = "simd128")] pub(crate) unsafe fn perform_fft_butterfly(&self, buffer: &mut [Complex]) { self.perform_fft_contiguous(workaround_transmute_mut::<_, Complex>(buffer)); } // Do multiple ffts over a longer vector inplace, called from "process_with_scratch" of Fft trait #[target_feature(enable = "simd128")] pub(crate) unsafe fn perform_fft_butterfly_multi( &self, buffer: &mut [Complex], ) -> Result<(), ()> { array_utils::iter_chunks(buffer, self.len(), |chunk| { self.perform_fft_butterfly(chunk) }) } // Do multiple ffts over a longer vector outofplace, called from "process_outofplace_with_scratch" of Fft trait #[target_feature(enable = "simd128")] pub(crate) unsafe fn perform_oop_fft_butterfly_multi( &self, input: &mut [Complex], output: &mut [Complex], ) -> Result<(), ()> { array_utils::iter_chunks_zipped(input, output, self.len(), |in_chunk, out_chunk| { let input_slice = workaround_transmute_mut(in_chunk); let output_slice = workaround_transmute_mut(out_chunk); self.perform_fft_contiguous(DoubleBuf { input: input_slice, output: output_slice, }) }) } } }; } #[allow(unused)] macro_rules! boilerplate_fft_wasm_simd_common_butterfly { ($struct_name:ident, $len:expr, $direction_fn:expr) => { impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], _scratch: &mut [Complex], ) { if input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } let result = unsafe { self.perform_oop_fft_butterfly_multi(input, output) }; if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); } } fn process_with_scratch(&self, buffer: &mut [Complex], _scratch: &mut [Complex]) { if buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace(self.len(), buffer.len(), 0, 0); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let result = unsafe { self.perform_fft_butterfly_multi(buffer) }; if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace(self.len(), buffer.len(), 0, 0); } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { 0 } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { 0 } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { $len } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { $direction_fn(self) } } }; } // _ _________ _ _ _ // / | |___ /___ \| |__ (_) |_ // | | _____ |_ \ __) | '_ \| | __| // | | |_____| ___) / __/| |_) | | |_ // |_| |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly1 { direction: FftDirection, _phantom: std::marker::PhantomData, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly1, 1, |this: &WasmSimdF32Butterfly1<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly1, 1, |this: &WasmSimdF32Butterfly1<_>| this.direction ); impl WasmSimdF32Butterfly1 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); Self { direction, _phantom: std::marker::PhantomData, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let value = buffer.load_partial1_complex(0); buffer.store_partial_lo_complex(value, 0); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let value = buffer.load_complex(0); buffer.store_complex(value, 0); } } // _ __ _ _ _ _ _ // / | / /_ | || | | |__ (_) |_ // | | _____ | '_ \| || |_| '_ \| | __| // | | |_____| | (_) |__ _| |_) | | |_ // |_| \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly1 { direction: FftDirection, _phantom: std::marker::PhantomData, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly1, 1, |this: &WasmSimdF64Butterfly1<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly1, 1, |this: &WasmSimdF64Butterfly1<_>| this.direction ); impl WasmSimdF64Butterfly1 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); Self { direction, _phantom: std::marker::PhantomData, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let value = buffer.load_complex(0); buffer.store_complex(value, 0); } } // ____ _________ _ _ _ // |___ \ |___ /___ \| |__ (_) |_ // __) | _____ |_ \ __) | '_ \| | __| // / __/ |_____| ___) / __/| |_) | | |_ // |_____| |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly2 { direction: FftDirection, _phantom: std::marker::PhantomData, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly2, 2, |this: &WasmSimdF32Butterfly2<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly2, 2, |this: &WasmSimdF32Butterfly2<_>| this.direction ); impl WasmSimdF32Butterfly2 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); Self { direction, _phantom: std::marker::PhantomData, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = buffer.load_complex(0); let temp = self.perform_fft_direct(values); buffer.store_complex(temp, 0); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let values_a = buffer.load_complex(0); let values_b = buffer.load_complex(2); let out = self.perform_parallel_fft_direct(values_a, values_b); let [out02, out13] = transpose_complex_2x2_f32(out[0], out[1]); buffer.store_complex(out02, 0); buffer.store_complex(out13, 2); } // length 2 fft of x, given as [x0, x1] // result is [X0, X1] #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: v128) -> v128 { solo_fft2_f32(values) } // dual length 2 fft of x and y, given as [x0, x1], [y0, y1] // result is [X0, Y0], [X1, Y1] #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, values_x: v128, values_y: v128, ) -> [v128; 2] { parallel_fft2_contiguous_f32(values_x, values_y) } } // double lenth 2 fft of a and b, given as [x0, y0], [x1, y1] // result is [X0, Y0], [X1, Y1] #[inline(always)] pub(crate) unsafe fn parallel_fft2_interleaved_f32(val02: v128, val13: v128) -> [v128; 2] { let temp0 = f32x4_add(val02, val13); let temp1 = f32x4_sub(val02, val13); [temp0, temp1] } // double lenth 2 fft of a and b, given as [x0, x1], [y0, y1] // result is [X0, Y0], [X1, Y1] #[inline(always)] unsafe fn parallel_fft2_contiguous_f32(left: v128, right: v128) -> [v128; 2] { let [temp02, temp13] = transpose_complex_2x2_f32(left, right); parallel_fft2_interleaved_f32(temp02, temp13) } // length 2 fft of x, given as [x0, x1] // result is [X0, X1] #[inline(always)] unsafe fn solo_fft2_f32(values: v128) -> v128 { let high = u64x2_shuffle::<0, 0>(values, values); let low = u64x2_shuffle::<1, 1>(values, values); let low = f32x4_mul(low, f32x4(1.0, 1.0, -1.0, -1.0)); f32x4_add(high, low) } // ____ __ _ _ _ _ _ // |___ \ / /_ | || | | |__ (_) |_ // __) | _____ | '_ \| || |_| '_ \| | __| // / __/ |_____| | (_) |__ _| |_) | | |_ // |_____| \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly2 { direction: FftDirection, _phantom: std::marker::PhantomData, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly2, 2, |this: &WasmSimdF64Butterfly2<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly2, 2, |this: &WasmSimdF64Butterfly2<_>| this.direction ); impl WasmSimdF64Butterfly2 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); Self { direction, _phantom: std::marker::PhantomData, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let value0 = buffer.load_complex(0); let value1 = buffer.load_complex(1); let out = self.perform_fft_direct(value0, value1); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 1); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, value0: v128, value1: v128) -> [v128; 2] { solo_fft2_f64(value0, value1) } } #[inline(always)] pub(crate) unsafe fn solo_fft2_f64(left: v128, right: v128) -> [v128; 2] { let temp0 = f64x2_add(left, right); let temp1 = f64x2_sub(left, right); [temp0, temp1] } // _____ _________ _ _ _ // |___ / |___ /___ \| |__ (_) |_ // |_ \ _____ |_ \ __) | '_ \| | __| // ___) | |_____| ___) / __/| |_) | | |_ // |____/ |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly3 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle: v128, twiddle1re: v128, twiddle1im: v128, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly3, 3, |this: &WasmSimdF32Butterfly3<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly3, 3, |this: &WasmSimdF32Butterfly3<_>| this.direction ); impl WasmSimdF32Butterfly3 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 3, direction); let twiddle = f32x4(tw1.re, tw1.re, -tw1.im, -tw1.im); let twiddle1re = f32x4_splat(tw1.re); let twiddle1im = f32x4_splat(tw1.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle, twiddle1re, twiddle1im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let value0x = buffer.load_partial1_complex(0); let value12 = buffer.load_complex(1); let out = self.perform_fft_direct(value0x, value12); buffer.store_partial_lo_complex(out[0], 0); buffer.store_complex(out[1], 1); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let valuea0a1 = buffer.load_complex(0); let valuea2b0 = buffer.load_complex(2); let valueb1b2 = buffer.load_complex(4); let value0 = extract_lo_hi_f32(valuea0a1, valuea2b0); let value1 = extract_hi_lo_f32(valuea0a1, valueb1b2); let value2 = extract_lo_hi_f32(valuea2b0, valueb1b2); let out = self.perform_parallel_fft_direct(value0, value1, value2); let out0 = extract_lo_lo_f32(out[0], out[1]); let out1 = extract_lo_hi_f32(out[2], out[0]); let out2 = extract_hi_hi_f32(out[1], out[2]); buffer.store_complex(out0, 0); buffer.store_complex(out1, 2); buffer.store_complex(out2, 4); } // length 3 fft of a, given as [x0, 0.0], [x1, x2] // result is [X0, Z], [X1, X2] // The value Z should be discarded. #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, value0x: v128, value12: v128) -> [v128; 2] { // This is a WasmSimd translation of the scalar 3-point butterfly let rev12 = reverse_complex_and_negate_hi_f32(value12); let temp12pn = self.rotate.rotate_hi(f32x4_add(value12, rev12)); let twiddled = f32x4_mul(temp12pn, self.twiddle); let temp = f32x4_add(value0x, twiddled); let out12 = solo_fft2_f32(temp); let out0x = f32x4_add(value0x, temp12pn); [out0x, out12] } // length 3 dual fft of a, given as (x0, y0), (x1, y1), (x2, y2). // result is [(X0, Y0), (X1, Y1), (X2, Y2)] #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, value0: v128, value1: v128, value2: v128, ) -> [v128; 3] { // This is a WasmSimd translation of the scalar 3-point butterfly let x12p = f32x4_add(value1, value2); let x12n = f32x4_sub(value1, value2); let sum = f32x4_add(value0, x12p); let temp_a = f32x4_mul(self.twiddle1re, x12p); let temp_a = f32x4_add(temp_a, value0); let n_rot = self.rotate.rotate_both(x12n); let temp_b = f32x4_mul(self.twiddle1im, n_rot); let x1 = f32x4_add(temp_a, temp_b); let x2 = f32x4_sub(temp_a, temp_b); [sum, x1, x2] } } // _____ __ _ _ _ _ _ // |___ / / /_ | || | | |__ (_) |_ // |_ \ _____ | '_ \| || |_| '_ \| | __| // ___) | |_____| | (_) |__ _| |_) | | |_ // |____/ \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly3 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: v128, twiddle1im: v128, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly3, 3, |this: &WasmSimdF64Butterfly3<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly3, 3, |this: &WasmSimdF64Butterfly3<_>| this.direction ); impl WasmSimdF64Butterfly3 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 3, direction); let twiddle1re = f64x2_splat(tw1.re); let twiddle1im = f64x2_splat(tw1.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let value0 = buffer.load_complex(0); let value1 = buffer.load_complex(1); let value2 = buffer.load_complex(2); let out = self.perform_fft_direct(value0, value1, value2); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 1); buffer.store_complex(out[2], 2); } // length 3 fft of x, given as x0, x1, x2. // result is [X0, X1, X2] #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value0: v128, value1: v128, value2: v128, ) -> [v128; 3] { // This is a WasmSimd translation of the scalar 3-point butterfly let x12p = f64x2_add(value1, value2); let x12n = f64x2_sub(value1, value2); let sum = f64x2_add(value0, x12p); // let temp_a = vfmaq_f64(value0, self.twiddle1re, x12p); let temp_a = f64x2_add(value0, f64x2_mul(self.twiddle1re, x12p)); let n_rot = self.rotate.rotate(x12n); let temp_b = f64x2_mul(self.twiddle1im, n_rot); let x1 = f64x2_add(temp_a, temp_b); let x2 = f64x2_sub(temp_a, temp_b); [sum, x1, x2] } } // _ _ _________ _ _ _ // | || | |___ /___ \| |__ (_) |_ // | || |_ _____ |_ \ __) | '_ \| | __| // |__ _| |_____| ___) / __/| |_) | | |_ // |_| |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly4 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly4, 4, |this: &WasmSimdF32Butterfly4<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly4, 4, |this: &WasmSimdF32Butterfly4<_>| this.direction ); impl WasmSimdF32Butterfly4 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = if direction == FftDirection::Inverse { Rotate90F32::new(true) } else { Rotate90F32::new(false) }; Self { direction, _phantom: std::marker::PhantomData, rotate, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let value01 = buffer.load_complex(0); let value23 = buffer.load_complex(2); let out = self.perform_fft_direct(value01, value23); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 2); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let value01a = buffer.load_complex(0); let value23a = buffer.load_complex(2); let value01b = buffer.load_complex(4); let value23b = buffer.load_complex(6); let [value0ab, value1ab] = transpose_complex_2x2_f32(value01a, value01b); let [value2ab, value3ab] = transpose_complex_2x2_f32(value23a, value23b); let out = self.perform_parallel_fft_direct(value0ab, value1ab, value2ab, value3ab); let [out0, out1] = transpose_complex_2x2_f32(out[0], out[1]); let [out2, out3] = transpose_complex_2x2_f32(out[2], out[3]); buffer.store_complex(out0, 0); buffer.store_complex(out1, 4); buffer.store_complex(out2, 2); buffer.store_complex(out3, 6); } // length 4 fft of a, given as [x0, x1], [x2, x3] // result is [[X0, X1], [X2, X3]] #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, value01: v128, value23: v128) -> [v128; 2] { //we're going to hardcode a step of mixed radix //aka we're going to do the six step algorithm // step 1: transpose // and // step 2: column FFTs let mut temp = parallel_fft2_interleaved_f32(value01, value23); // step 3: apply twiddle factors (only one in this case, and it's either 0 + i or 0 - i) temp[1] = self.rotate.rotate_hi(temp[1]); // step 4: transpose, which we're skipping because we're the previous FFTs were non-contiguous // step 5: row FFTs // and // step 6: transpose by swapping index 1 and 2 parallel_fft2_contiguous_f32(temp[0], temp[1]) } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, values0: v128, values1: v128, values2: v128, values3: v128, ) -> [v128; 4] { //we're going to hardcode a step of mixed radix //aka we're going to do the six step algorithm // step 1: transpose // and // step 2: column FFTs let temp0 = parallel_fft2_interleaved_f32(values0, values2); let mut temp1 = parallel_fft2_interleaved_f32(values1, values3); // step 3: apply twiddle factors (only one in this case, and it's either 0 + i or 0 - i) temp1[1] = self.rotate.rotate_both(temp1[1]); // step 4: transpose, which we're skipping because we're the previous FFTs were non-contiguous // step 5: row FFTs let out0 = parallel_fft2_interleaved_f32(temp0[0], temp1[0]); let out2 = parallel_fft2_interleaved_f32(temp0[1], temp1[1]); // step 6: transpose by swapping index 1 and 2 [out0[0], out2[0], out0[1], out2[1]] } } // _ _ __ _ _ _ _ _ // | || | / /_ | || | | |__ (_) |_ // | || |_ _____ | '_ \| || |_| '_ \| | __| // |__ _| |_____| | (_) |__ _| |_) | | |_ // |_| \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly4 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly4, 4, |this: &WasmSimdF64Butterfly4<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly4, 4, |this: &WasmSimdF64Butterfly4<_>| this.direction ); impl WasmSimdF64Butterfly4 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = if direction == FftDirection::Inverse { Rotate90F64::new(true) } else { Rotate90F64::new(false) }; Self { direction, _phantom: std::marker::PhantomData, rotate, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let value0 = buffer.load_complex(0); let value1 = buffer.load_complex(1); let value2 = buffer.load_complex(2); let value3 = buffer.load_complex(3); let out = self.perform_fft_direct(value0, value1, value2, value3); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 1); buffer.store_complex(out[2], 2); buffer.store_complex(out[3], 3); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value0: v128, value1: v128, value2: v128, value3: v128, ) -> [v128; 4] { //we're going to hardcode a step of mixed radix //aka we're going to do the six step algorithm // step 1: transpose // and // step 2: column FFTs let temp0 = solo_fft2_f64(value0, value2); let mut temp1 = solo_fft2_f64(value1, value3); // step 3: apply twiddle factors (only one in this case, and it's either 0 + i or 0 - i) temp1[1] = self.rotate.rotate(temp1[1]); // step 4: transpose, which we're skipping because we're the previous FFTs were non-contiguous // step 5: row FFTs let out0 = solo_fft2_f64(temp0[0], temp1[0]); let out2 = solo_fft2_f64(temp0[1], temp1[1]); // step 6: transpose by swapping index 1 and 2 [out0[0], out2[0], out0[1], out2[1]] } } // ____ _________ _ _ _ // | ___| |___ /___ \| |__ (_) |_ // |___ \ _____ |_ \ __) | '_ \| | __| // ___) | |_____| ___) / __/| |_) | | |_ // |____/ |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly5 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle12re: v128, twiddle21re: v128, twiddle12im: v128, twiddle21im: v128, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly5, 5, |this: &WasmSimdF32Butterfly5<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly5, 5, |this: &WasmSimdF32Butterfly5<_>| this.direction ); impl WasmSimdF32Butterfly5 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 5, direction); let tw2: Complex = twiddles::compute_twiddle(2, 5, direction); let twiddle12re = f32x4(tw1.re, tw1.re, tw2.re, tw2.re); let twiddle21re = f32x4(tw2.re, tw2.re, tw1.re, tw1.re); let twiddle12im = f32x4(tw1.im, tw1.im, tw2.im, tw2.im); let twiddle21im = f32x4(tw2.im, tw2.im, -tw1.im, -tw1.im); let twiddle1re = f32x4_splat(tw1.re); let twiddle1im = f32x4_splat(tw1.im); let twiddle2re = f32x4_splat(tw2.re); let twiddle2im = f32x4_splat(tw2.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle12re, twiddle21re, twiddle12im, twiddle21im, twiddle1re, twiddle1im, twiddle2re, twiddle2im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let value00 = buffer.load1_complex(0); let value12 = buffer.load_complex(1); let value34 = buffer.load_complex(3); let out = self.perform_fft_direct(value00, value12, value34); buffer.store_partial_lo_complex(out[0], 0); buffer.store_complex(out[1], 1); buffer.store_complex(out[2], 3); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4 ,6, 8}); let value0 = extract_lo_hi_f32(input_packed[0], input_packed[2]); let value1 = extract_hi_lo_f32(input_packed[0], input_packed[3]); let value2 = extract_lo_hi_f32(input_packed[1], input_packed[3]); let value3 = extract_hi_lo_f32(input_packed[1], input_packed[4]); let value4 = extract_lo_hi_f32(input_packed[2], input_packed[4]); let out = self.perform_parallel_fft_direct(value0, value1, value2, value3, value4); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_hi_f32(out[4], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4}); } // length 5 fft of a, given as [x0, x0], [x1, x2], [x3, x4]. // result is [[X0, Z], [X1, X2], [X3, X4]] // Note that Z should not be used. #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value00: v128, value12: v128, value34: v128, ) -> [v128; 3] { // This is a WasmSimd translation of the scalar 5-point butterfly let temp43 = reverse_complex_elements_f32(value34); let x1423p = f32x4_add(value12, temp43); let x1423n = f32x4_sub(value12, temp43); let x1414p = duplicate_lo_f32(x1423p); let x2323p = duplicate_hi_f32(x1423p); let x1414n = duplicate_lo_f32(x1423n); let x2323n = duplicate_hi_f32(x1423n); let temp_a1 = f32x4_mul(self.twiddle12re, x1414p); let temp_b1 = f32x4_mul(self.twiddle12im, x1414n); let temp_a = f32x4_add(temp_a1, f32x4_mul(self.twiddle21re, x2323p)); let temp_a = f32x4_add(value00, temp_a); let temp_b = f32x4_add(temp_b1, f32x4_mul(self.twiddle21im, x2323n)); let b_rot = self.rotate.rotate_both(temp_b); let x00 = f32x4_add(value00, f32x4_add(x1414p, x2323p)); let x12 = f32x4_add(temp_a, b_rot); let x34 = reverse_complex_elements_f32(f32x4_sub(temp_a, b_rot)); [x00, x12, x34] } // length 5 dual fft of x and y, given as (x0, y0), (x1, y1) ... (x4, y4). // result is [(X0, Y0), (X1, Y1) ... (X2, Y2)] #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, value0: v128, value1: v128, value2: v128, value3: v128, value4: v128, ) -> [v128; 5] { // This is a WasmSimd translation of the scalar 3-point butterfly let x14p = f32x4_add(value1, value4); let x14n = f32x4_sub(value1, value4); let x23p = f32x4_add(value2, value3); let x23n = f32x4_sub(value2, value3); let temp_a1_1 = f32x4_mul(self.twiddle1re, x14p); let temp_a1_2 = f32x4_mul(self.twiddle2re, x23p); let temp_b1_1 = f32x4_mul(self.twiddle1im, x14n); let temp_b1_2 = f32x4_mul(self.twiddle2im, x23n); let temp_a2_1 = f32x4_mul(self.twiddle1re, x23p); let temp_a2_2 = f32x4_mul(self.twiddle2re, x14p); let temp_b2_1 = f32x4_mul(self.twiddle2im, x14n); let temp_b2_2 = f32x4_mul(self.twiddle1im, x23n); let temp_a1 = f32x4_add(value0, f32x4_add(temp_a1_1, temp_a1_2)); let temp_b1 = f32x4_add(temp_b1_1, temp_b1_2); let temp_a2 = f32x4_add(value0, f32x4_add(temp_a2_1, temp_a2_2)); let temp_b2 = f32x4_sub(temp_b2_1, temp_b2_2); [ f32x4_add(value0, f32x4_add(x14p, x23p)), f32x4_add(temp_a1, self.rotate.rotate_both(temp_b1)), f32x4_add(temp_a2, self.rotate.rotate_both(temp_b2)), f32x4_sub(temp_a2, self.rotate.rotate_both(temp_b2)), f32x4_sub(temp_a1, self.rotate.rotate_both(temp_b1)), ] } } // ____ __ _ _ _ _ _ // | ___| / /_ | || | | |__ (_) |_ // |___ \ _____ | '_ \| || |_| '_ \| | __| // ___) | |_____| | (_) |__ _| |_) | | |_ // |____/ \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly5 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly5, 5, |this: &WasmSimdF64Butterfly5<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly5, 5, |this: &WasmSimdF64Butterfly5<_>| this.direction ); impl WasmSimdF64Butterfly5 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 5, direction); let tw2: Complex = twiddles::compute_twiddle(2, 5, direction); let twiddle1re = f64x2_splat(tw1.re); let twiddle1im = f64x2_splat(tw1.im); let twiddle2re = f64x2_splat(tw2.re); let twiddle2im = f64x2_splat(tw2.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let value0 = buffer.load_complex(0); let value1 = buffer.load_complex(1); let value2 = buffer.load_complex(2); let value3 = buffer.load_complex(3); let value4 = buffer.load_complex(4); let out = self.perform_fft_direct(value0, value1, value2, value3, value4); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 1); buffer.store_complex(out[2], 2); buffer.store_complex(out[3], 3); buffer.store_complex(out[4], 4); } // length 5 fft of x, given as x0, x1, x2, x3, x4. // result is [X0, X1, X2, X3, X4] #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value0: v128, value1: v128, value2: v128, value3: v128, value4: v128, ) -> [v128; 5] { // This is a WasmSimd translation of the scalar 5-point butterfly let x14p = f64x2_add(value1, value4); let x14n = f64x2_sub(value1, value4); let x23p = f64x2_add(value2, value3); let x23n = f64x2_sub(value2, value3); let temp_a1_1 = f64x2_mul(self.twiddle1re, x14p); let temp_a1_2 = f64x2_mul(self.twiddle2re, x23p); let temp_a2_1 = f64x2_mul(self.twiddle2re, x14p); let temp_a2_2 = f64x2_mul(self.twiddle1re, x23p); let temp_b1_1 = f64x2_mul(self.twiddle1im, x14n); let temp_b1_2 = f64x2_mul(self.twiddle2im, x23n); let temp_b2_1 = f64x2_mul(self.twiddle2im, x14n); let temp_b2_2 = f64x2_mul(self.twiddle1im, x23n); let temp_a1 = f64x2_add(value0, f64x2_add(temp_a1_1, temp_a1_2)); let temp_a2 = f64x2_add(value0, f64x2_add(temp_a2_1, temp_a2_2)); let temp_b1 = f64x2_add(temp_b1_1, temp_b1_2); let temp_b2 = f64x2_sub(temp_b2_1, temp_b2_2); let temp_b1_rot = self.rotate.rotate(temp_b1); let temp_b2_rot = self.rotate.rotate(temp_b2); [ f64x2_add(value0, f64x2_add(x14p, x23p)), f64x2_add(temp_a1, temp_b1_rot), f64x2_add(temp_a2, temp_b2_rot), f64x2_sub(temp_a2, temp_b2_rot), f64x2_sub(temp_a1, temp_b1_rot), ] } } // __ _________ _ _ _ // / /_ |___ /___ \| |__ (_) |_ // | '_ \ _____ |_ \ __) | '_ \| | __| // | (_) | |_____| ___) / __/| |_) | | |_ // \___/ |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly6 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: WasmSimdF32Butterfly3, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly6, 6, |this: &WasmSimdF32Butterfly6<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly6, 6, |this: &WasmSimdF32Butterfly6<_>| this.direction ); impl WasmSimdF32Butterfly6 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf3 = WasmSimdF32Butterfly3::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let value01 = buffer.load_complex(0); let value23 = buffer.load_complex(2); let value45 = buffer.load_complex(4); let out = self.perform_fft_direct(value01, value23, value45); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 2); buffer.store_complex(out[2], 4); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10}); let values = interleave_complex_f32!(input_packed, 3, {0, 1, 2}); let out = self.perform_parallel_fft_direct( values[0], values[1], values[2], values[3], values[4], values[5], ); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0, 1, 2, 3, 4, 5}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value01: v128, value23: v128, value45: v128, ) -> [v128; 3] { // Algorithm: 3x2 good-thomas // Size-3 FFTs down the columns of our reordered array let reord0 = extract_lo_hi_f32(value01, value23); let reord1 = extract_lo_hi_f32(value23, value45); let reord2 = extract_lo_hi_f32(value45, value01); let mid = self.bf3.perform_parallel_fft_direct(reord0, reord1, reord2); // We normally would put twiddle factors right here, but since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = parallel_fft2_contiguous_f32(mid[0], mid[1]); let output2 = solo_fft2_f32(mid[2]); // Reorder into output [ extract_lo_hi_f32(output0, output1), extract_lo_lo_f32(output2, output1), extract_hi_hi_f32(output0, output2), ] } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct( &self, value0: v128, value1: v128, value2: v128, value3: v128, value4: v128, value5: v128, ) -> [v128; 6] { // Algorithm: 3x2 good-thomas // Size-3 FFTs down the columns of our reordered array let mid0 = self.bf3.perform_parallel_fft_direct(value0, value2, value4); let mid1 = self.bf3.perform_parallel_fft_direct(value3, value5, value1); // We normally would put twiddle factors right here, but since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = parallel_fft2_interleaved_f32(mid0[0], mid1[0]); let [output2, output3] = parallel_fft2_interleaved_f32(mid0[1], mid1[1]); let [output4, output5] = parallel_fft2_interleaved_f32(mid0[2], mid1[2]); // Reorder into output [output0, output3, output4, output1, output2, output5] } } // __ __ _ _ _ _ _ // / /_ / /_ | || | | |__ (_) |_ // | '_ \ _____ | '_ \| || |_| '_ \| | __| // | (_) | |_____| | (_) |__ _| |_) | | |_ // \___/ \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly6 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: WasmSimdF64Butterfly3, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly6, 6, |this: &WasmSimdF64Butterfly6<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly6, 6, |this: &WasmSimdF64Butterfly6<_>| this.direction ); impl WasmSimdF64Butterfly6 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf3 = WasmSimdF64Butterfly3::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let value0 = buffer.load_complex(0); let value1 = buffer.load_complex(1); let value2 = buffer.load_complex(2); let value3 = buffer.load_complex(3); let value4 = buffer.load_complex(4); let value5 = buffer.load_complex(5); let out = self.perform_fft_direct(value0, value1, value2, value3, value4, value5); buffer.store_complex(out[0], 0); buffer.store_complex(out[1], 1); buffer.store_complex(out[2], 2); buffer.store_complex(out[3], 3); buffer.store_complex(out[4], 4); buffer.store_complex(out[5], 5); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct( &self, value0: v128, value1: v128, value2: v128, value3: v128, value4: v128, value5: v128, ) -> [v128; 6] { // Algorithm: 3x2 good-thomas // Size-3 FFTs down the columns of our reordered array let mid0 = self.bf3.perform_fft_direct(value0, value2, value4); let mid1 = self.bf3.perform_fft_direct(value3, value5, value1); // We normally would put twiddle factors right here, but since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = solo_fft2_f64(mid0[0], mid1[0]); let [output2, output3] = solo_fft2_f64(mid0[1], mid1[1]); let [output4, output5] = solo_fft2_f64(mid0[2], mid1[2]); // Reorder into output [output0, output3, output4, output1, output2, output5] } } // ___ _________ _ _ _ // ( _ ) |___ /___ \| |__ (_) |_ // / _ \ _____ |_ \ __) | '_ \| | __| // | (_) | |_____| ___) / __/| |_) | | |_ // \___/ |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly8 { root2: v128, root2_dual: v128, direction: FftDirection, bf4: WasmSimdF32Butterfly4, rotate90: Rotate90F32, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly8, 8, |this: &WasmSimdF32Butterfly8<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly8, 8, |this: &WasmSimdF32Butterfly8<_>| this.direction ); impl WasmSimdF32Butterfly8 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf4 = WasmSimdF32Butterfly4::new(direction); let root2 = f32x4(1.0, 1.0, 0.5f32.sqrt(), 0.5f32.sqrt()); let root2_dual = f32x4_splat(0.5f32.sqrt()); let rotate90 = if direction == FftDirection::Inverse { Rotate90F32::new(true) } else { Rotate90F32::new(false) }; Self { root2, root2_dual, direction, bf4, rotate90, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6}); let out = self.perform_fft_direct(input_packed); write_complex_to_array_strided!(out, buffer, 2, {0,1,2,3}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14}); let values = interleave_complex_f32!(input_packed, 4, {0, 1, 2, 3}); let out = self.perform_parallel_fft_direct(values); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4, 6}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0,1,2,3,4,5,6,7}); } #[inline(always)] unsafe fn perform_fft_direct(&self, values: [v128; 4]) -> [v128; 4] { // we're going to hardcode a step of mixed radix // step 1: copy and reorder the input into the scratch let [in02, in13] = transpose_complex_2x2_f32(values[0], values[1]); let [in46, in57] = transpose_complex_2x2_f32(values[2], values[3]); // step 2: column FFTs let val0 = self.bf4.perform_fft_direct(in02, in46); let mut val2 = self.bf4.perform_fft_direct(in13, in57); // step 3: apply twiddle factors let val2b = self.rotate90.rotate_hi(val2[0]); let val2c = f32x4_add(val2b, val2[0]); let val2d = f32x4_mul(val2c, self.root2); val2[0] = extract_lo_hi_f32(val2[0], val2d); let val3b = self.rotate90.rotate_both(val2[1]); let val3c = f32x4_sub(val3b, val2[1]); let val3d = f32x4_mul(val3c, self.root2); val2[1] = extract_lo_hi_f32(val3b, val3d); // step 4: transpose -- skipped because we're going to do the next FFTs non-contiguously // step 5: row FFTs let out0 = parallel_fft2_interleaved_f32(val0[0], val2[0]); let out1 = parallel_fft2_interleaved_f32(val0[1], val2[1]); // step 6: rearrange and copy to buffer [out0[0], out1[0], out0[1], out1[1]] } #[inline(always)] unsafe fn perform_parallel_fft_direct(&self, values: [v128; 8]) -> [v128; 8] { // we're going to hardcode a step of mixed radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let val03 = self .bf4 .perform_parallel_fft_direct(values[0], values[2], values[4], values[6]); let mut val47 = self .bf4 .perform_parallel_fft_direct(values[1], values[3], values[5], values[7]); // step 3: apply twiddle factors let val5b = self.rotate90.rotate_both(val47[1]); let val5c = f32x4_add(val5b, val47[1]); val47[1] = f32x4_mul(val5c, self.root2_dual); val47[2] = self.rotate90.rotate_both(val47[2]); let val7b = self.rotate90.rotate_both(val47[3]); let val7c = f32x4_sub(val7b, val47[3]); val47[3] = f32x4_mul(val7c, self.root2_dual); // step 4: transpose -- skipped because we're going to do the next FFTs non-contiguously // step 5: row FFTs let out0 = parallel_fft2_interleaved_f32(val03[0], val47[0]); let out1 = parallel_fft2_interleaved_f32(val03[1], val47[1]); let out2 = parallel_fft2_interleaved_f32(val03[2], val47[2]); let out3 = parallel_fft2_interleaved_f32(val03[3], val47[3]); // step 6: rearrange and copy to buffer [ out0[0], out1[0], out2[0], out3[0], out0[1], out1[1], out2[1], out3[1], ] } } // ___ __ _ _ _ _ _ // ( _ ) / /_ | || | | |__ (_) |_ // / _ \ _____ | '_ \| || |_| '_ \| | __| // | (_) | |_____| | (_) |__ _| |_) | | |_ // \___/ \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly8 { root2: v128, direction: FftDirection, bf4: WasmSimdF64Butterfly4, rotate90: Rotate90F64, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly8, 8, |this: &WasmSimdF64Butterfly8<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly8, 8, |this: &WasmSimdF64Butterfly8<_>| this.direction ); impl WasmSimdF64Butterfly8 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf4 = WasmSimdF64Butterfly4::new(direction); let root2 = f64x2_splat(0.5f64.sqrt()); let rotate90 = if direction == FftDirection::Inverse { Rotate90F64::new(true) } else { Rotate90F64::new(false) }; Self { root2, direction, bf4, rotate90, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7}); } #[inline(always)] unsafe fn perform_fft_direct(&self, values: [v128; 8]) -> [v128; 8] { // we're going to hardcode a step of mixed radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let val03 = self .bf4 .perform_fft_direct(values[0], values[2], values[4], values[6]); let mut val47 = self .bf4 .perform_fft_direct(values[1], values[3], values[5], values[7]); // step 3: apply twiddle factors let val5b = self.rotate90.rotate(val47[1]); let val5c = f64x2_add(val5b, val47[1]); val47[1] = f64x2_mul(val5c, self.root2); val47[2] = self.rotate90.rotate(val47[2]); let val7b = self.rotate90.rotate(val47[3]); let val7c = f64x2_sub(val7b, val47[3]); val47[3] = f64x2_mul(val7c, self.root2); // step 4: transpose -- skipped because we're going to do the next FFTs non-contiguously // step 5: row FFTs let out0 = solo_fft2_f64(val03[0], val47[0]); let out1 = solo_fft2_f64(val03[1], val47[1]); let out2 = solo_fft2_f64(val03[2], val47[2]); let out3 = solo_fft2_f64(val03[3], val47[3]); // step 6: rearrange and copy to buffer [ out0[0], out1[0], out2[0], out3[0], out0[1], out1[1], out2[1], out3[1], ] } } // ___ _________ _ _ _ // / _ \ |___ /___ \| |__ (_) |_ // | (_) | _____ |_ \ __) | '_ \| | __| // \__, | |_____| ___) / __/| |_) | | |_ // /_/ |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly9 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: WasmSimdF32Butterfly3, twiddle1: v128, twiddle2: v128, twiddle4: v128, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly9, 9, |this: &WasmSimdF32Butterfly9<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly9, 9, |this: &WasmSimdF32Butterfly9<_>| this.direction ); impl WasmSimdF32Butterfly9 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf3 = WasmSimdF32Butterfly3::new(direction); let tw1: Complex = twiddles::compute_twiddle(1, 9, direction); let tw2: Complex = twiddles::compute_twiddle(2, 9, direction); let tw4: Complex = twiddles::compute_twiddle(4, 9, direction); let twiddle1 = f32x4(tw1.re, tw1.im, tw1.re, tw1.im); let twiddle2 = f32x4(tw2.re, tw2.im, tw2.re, tw2.im); let twiddle4 = f32x4(tw4.re, tw4.im, tw4.re, tw4.im); Self { direction, _phantom: std::marker::PhantomData, bf3, twiddle1, twiddle2, twiddle4, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { // A single WasmSimd 9-point will need a lot of shuffling, let's just reuse the dual one let values = read_partial1_complex_to_array!(buffer, {0,1,2,3,4,5,6,7,8}); let out = self.perform_parallel_fft_direct(values); for n in 0..9 { buffer.store_partial_lo_complex(out[n], n); } } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[4]), extract_hi_lo_f32(input_packed[0], input_packed[5]), extract_lo_hi_f32(input_packed[1], input_packed[5]), extract_hi_lo_f32(input_packed[1], input_packed[6]), extract_lo_hi_f32(input_packed[2], input_packed[6]), extract_hi_lo_f32(input_packed[2], input_packed[7]), extract_lo_hi_f32(input_packed[3], input_packed[7]), extract_hi_lo_f32(input_packed[3], input_packed[8]), extract_lo_hi_f32(input_packed[4], input_packed[8]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_hi_f32(out[8], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0,1,2,3,4,5,6,7,8}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [v128; 9]) -> [v128; 9] { // Algorithm: 3x3 mixed radix // Size-3 FFTs down the columns let mid0 = self .bf3 .perform_parallel_fft_direct(values[0], values[3], values[6]); let mut mid1 = self .bf3 .perform_parallel_fft_direct(values[1], values[4], values[7]); let mut mid2 = self .bf3 .perform_parallel_fft_direct(values[2], values[5], values[8]); // Apply twiddle factors. Note that we're re-using twiddle2 mid1[1] = mul_complex_f32(self.twiddle1, mid1[1]); mid1[2] = mul_complex_f32(self.twiddle2, mid1[2]); mid2[1] = mul_complex_f32(self.twiddle2, mid2[1]); mid2[2] = mul_complex_f32(self.twiddle4, mid2[2]); let [output0, output1, output2] = self .bf3 .perform_parallel_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self .bf3 .perform_parallel_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self .bf3 .perform_parallel_fft_direct(mid0[2], mid1[2], mid2[2]); [ output0, output3, output6, output1, output4, output7, output2, output5, output8, ] } } // ___ __ _ _ _ _ _ // / _ \ / /_ | || | | |__ (_) |_ // | (_) | _____ | '_ \| || |_| '_ \| | __| // \__, | |_____| | (_) |__ _| |_) | | |_ // /_/ \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly9 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: WasmSimdF64Butterfly3, twiddle1: v128, twiddle2: v128, twiddle4: v128, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly9, 9, |this: &WasmSimdF64Butterfly9<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly9, 9, |this: &WasmSimdF64Butterfly9<_>| this.direction ); impl WasmSimdF64Butterfly9 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf3 = WasmSimdF64Butterfly3::new(direction); let tw1: Complex = twiddles::compute_twiddle(1, 9, direction); let tw2: Complex = twiddles::compute_twiddle(2, 9, direction); let tw4: Complex = twiddles::compute_twiddle(4, 9, direction); let twiddle1 = f64x2(tw1.re, tw1.im); let twiddle2 = f64x2(tw2.re, tw2.im); let twiddle4 = f64x2(tw4.re, tw4.im); Self { direction, _phantom: std::marker::PhantomData, bf3, twiddle1, twiddle2, twiddle4, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [v128; 9]) -> [v128; 9] { // Algorithm: 3x3 mixed radix // Size-3 FFTs down the columns let mid0 = self.bf3.perform_fft_direct(values[0], values[3], values[6]); let mut mid1 = self.bf3.perform_fft_direct(values[1], values[4], values[7]); let mut mid2 = self.bf3.perform_fft_direct(values[2], values[5], values[8]); // Apply twiddle factors. Note that we're re-using twiddle2 mid1[1] = mul_complex_f64(self.twiddle1, mid1[1]); mid1[2] = mul_complex_f64(self.twiddle2, mid1[2]); mid2[1] = mul_complex_f64(self.twiddle2, mid2[1]); mid2[2] = mul_complex_f64(self.twiddle4, mid2[2]); let [output0, output1, output2] = self.bf3.perform_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self.bf3.perform_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self.bf3.perform_fft_direct(mid0[2], mid1[2], mid2[2]); [ output0, output3, output6, output1, output4, output7, output2, output5, output8, ] } } // _ ___ _________ _ _ _ // / |/ _ \ |___ /___ \| |__ (_) |_ // | | | | | _____ |_ \ __) | '_ \| | __| // | | |_| | |_____| ___) / __/| |_) | | |_ // |_|\___/ |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly10 { direction: FftDirection, _phantom: std::marker::PhantomData, bf5: WasmSimdF32Butterfly5, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly10, 10, |this: &WasmSimdF32Butterfly10<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly10, 10, |this: &WasmSimdF32Butterfly10<_>| this.direction ); impl WasmSimdF32Butterfly10 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf5 = WasmSimdF32Butterfly5::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf5, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8}); let out = self.perform_fft_direct(input_packed); write_complex_to_array_strided!(out, buffer, 2, {0,1,2,3,4}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18}); let values = interleave_complex_f32!(input_packed, 5, {0, 1, 2, 3, 4}); let out = self.perform_parallel_fft_direct(values); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4, 6, 8}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0,1,2,3,4,5,6,7,8,9}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [v128; 5]) -> [v128; 5] { // Algorithm: 5x2 good-thomas // Reorder and pack let reord0 = extract_lo_hi_f32(values[0], values[2]); let reord1 = extract_lo_hi_f32(values[1], values[3]); let reord2 = extract_lo_hi_f32(values[2], values[4]); let reord3 = extract_lo_hi_f32(values[3], values[0]); let reord4 = extract_lo_hi_f32(values[4], values[1]); // Size-5 FFTs down the columns of our reordered array let mids = self .bf5 .perform_parallel_fft_direct(reord0, reord1, reord2, reord3, reord4); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [temp01, temp23] = parallel_fft2_contiguous_f32(mids[0], mids[1]); let [temp45, temp67] = parallel_fft2_contiguous_f32(mids[2], mids[3]); let temp89 = solo_fft2_f32(mids[4]); // Reorder let out01 = extract_lo_hi_f32(temp01, temp23); let out23 = extract_lo_hi_f32(temp45, temp67); let out45 = extract_lo_lo_f32(temp89, temp23); let out67 = extract_hi_lo_f32(temp01, temp67); let out89 = extract_hi_hi_f32(temp45, temp89); [out01, out23, out45, out67, out89] } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [v128; 10]) -> [v128; 10] { // Algorithm: 5x2 good-thomas // Size-5 FFTs down the columns of our reordered array let mid0 = self .bf5 .perform_parallel_fft_direct(values[0], values[2], values[4], values[6], values[8]); let mid1 = self .bf5 .perform_parallel_fft_direct(values[5], values[7], values[9], values[1], values[3]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = parallel_fft2_interleaved_f32(mid0[0], mid1[0]); let [output2, output3] = parallel_fft2_interleaved_f32(mid0[1], mid1[1]); let [output4, output5] = parallel_fft2_interleaved_f32(mid0[2], mid1[2]); let [output6, output7] = parallel_fft2_interleaved_f32(mid0[3], mid1[3]); let [output8, output9] = parallel_fft2_interleaved_f32(mid0[4], mid1[4]); // Reorder and return [ output0, output3, output4, output7, output8, output1, output2, output5, output6, output9, ] } } // _ ___ __ _ _ _ _ _ // / |/ _ \ / /_ | || | | |__ (_) |_ // | | | | | _____ | '_ \| || |_| '_ \| | __| // | | |_| | |_____| | (_) |__ _| |_) | | |_ // |_|\___/ \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly10 { direction: FftDirection, _phantom: std::marker::PhantomData, bf2: WasmSimdF64Butterfly2, bf5: WasmSimdF64Butterfly5, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly10, 10, |this: &WasmSimdF64Butterfly10<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly10, 10, |this: &WasmSimdF64Butterfly10<_>| this.direction ); impl WasmSimdF64Butterfly10 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf2 = WasmSimdF64Butterfly2::new(direction); let bf5 = WasmSimdF64Butterfly5::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf2, bf5, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [v128; 10]) -> [v128; 10] { // Algorithm: 5x2 good-thomas // Size-5 FFTs down the columns of our reordered array let mid0 = self .bf5 .perform_fft_direct(values[0], values[2], values[4], values[6], values[8]); let mid1 = self .bf5 .perform_fft_direct(values[5], values[7], values[9], values[1], values[3]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-2 FFTs down the columns let [output0, output1] = self.bf2.perform_fft_direct(mid0[0], mid1[0]); let [output2, output3] = self.bf2.perform_fft_direct(mid0[1], mid1[1]); let [output4, output5] = self.bf2.perform_fft_direct(mid0[2], mid1[2]); let [output6, output7] = self.bf2.perform_fft_direct(mid0[3], mid1[3]); let [output8, output9] = self.bf2.perform_fft_direct(mid0[4], mid1[4]); // Reorder and return [ output0, output3, output4, output7, output8, output1, output2, output5, output6, output9, ] } } // _ ____ _________ _ _ _ // / |___ \ |___ /___ \| |__ (_) |_ // | | __) | _____ |_ \ __) | '_ \| | __| // | |/ __/ |_____| ___) / __/| |_) | | |_ // |_|_____| |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly12 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: WasmSimdF32Butterfly3, bf4: WasmSimdF32Butterfly4, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly12, 12, |this: &WasmSimdF32Butterfly12<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly12, 12, |this: &WasmSimdF32Butterfly12<_>| this.direction ); impl WasmSimdF32Butterfly12 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf3 = WasmSimdF32Butterfly3::new(direction); let bf4 = WasmSimdF32Butterfly4::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, bf4, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10 }); let out = self.perform_fft_direct(input_packed); write_complex_to_array_strided!(out, buffer, 2, {0,1,2,3,4,5}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22}); let values = interleave_complex_f32!(input_packed, 6, {0, 1, 2, 3, 4, 5}); let out = self.perform_parallel_fft_direct(values); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4, 6, 8, 10}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0,1,2,3,4,5,6,7,8,9, 10, 11}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [v128; 6]) -> [v128; 6] { // Algorithm: 4x3 good-thomas // Reorder and pack let packed03 = extract_lo_hi_f32(values[0], values[1]); let packed47 = extract_lo_hi_f32(values[2], values[3]); let packed69 = extract_lo_hi_f32(values[3], values[4]); let packed101 = extract_lo_hi_f32(values[5], values[0]); let packed811 = extract_lo_hi_f32(values[4], values[5]); let packed25 = extract_lo_hi_f32(values[1], values[2]); // Size-4 FFTs down the columns of our reordered array let mid0 = self.bf4.perform_fft_direct(packed03, packed69); let mid1 = self.bf4.perform_fft_direct(packed47, packed101); let mid2 = self.bf4.perform_fft_direct(packed811, packed25); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-3 FFTs down the columns let [temp03, temp14, temp25] = self .bf3 .perform_parallel_fft_direct(mid0[0], mid1[0], mid2[0]); let [temp69, temp710, temp811] = self .bf3 .perform_parallel_fft_direct(mid0[1], mid1[1], mid2[1]); // Reorder and return [ extract_lo_hi_f32(temp03, temp14), extract_lo_hi_f32(temp811, temp69), extract_lo_hi_f32(temp14, temp25), extract_lo_hi_f32(temp69, temp710), extract_lo_hi_f32(temp25, temp03), extract_lo_hi_f32(temp710, temp811), ] } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [v128; 12]) -> [v128; 12] { // Algorithm: 4x3 good-thomas // Size-4 FFTs down the columns of our reordered array let mid0 = self .bf4 .perform_parallel_fft_direct(values[0], values[3], values[6], values[9]); let mid1 = self .bf4 .perform_parallel_fft_direct(values[4], values[7], values[10], values[1]); let mid2 = self .bf4 .perform_parallel_fft_direct(values[8], values[11], values[2], values[5]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-3 FFTs down the columns let [output0, output1, output2] = self .bf3 .perform_parallel_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self .bf3 .perform_parallel_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self .bf3 .perform_parallel_fft_direct(mid0[2], mid1[2], mid2[2]); let [output9, output10, output11] = self .bf3 .perform_parallel_fft_direct(mid0[3], mid1[3], mid2[3]); // Reorder and return [ output0, output4, output8, output9, output1, output5, output6, output10, output2, output3, output7, output11, ] } } // _ ____ __ _ _ _ _ _ // / |___ \ / /_ | || | | |__ (_) |_ // | | __) | _____ | '_ \| || |_| '_ \| | __| // | |/ __/ |_____| | (_) |__ _| |_) | | |_ // |_|_____| \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly12 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: WasmSimdF64Butterfly3, bf4: WasmSimdF64Butterfly4, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly12, 12, |this: &WasmSimdF64Butterfly12<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly12, 12, |this: &WasmSimdF64Butterfly12<_>| this.direction ); impl WasmSimdF64Butterfly12 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf3 = WasmSimdF64Butterfly3::new(direction); let bf4 = WasmSimdF64Butterfly4::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, bf4, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [v128; 12]) -> [v128; 12] { // Algorithm: 4x3 good-thomas // Size-4 FFTs down the columns of our reordered array let mid0 = self .bf4 .perform_fft_direct(values[0], values[3], values[6], values[9]); let mid1 = self .bf4 .perform_fft_direct(values[4], values[7], values[10], values[1]); let mid2 = self .bf4 .perform_fft_direct(values[8], values[11], values[2], values[5]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-3 FFTs down the columns let [output0, output1, output2] = self.bf3.perform_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self.bf3.perform_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self.bf3.perform_fft_direct(mid0[2], mid1[2], mid2[2]); let [output9, output10, output11] = self.bf3.perform_fft_direct(mid0[3], mid1[3], mid2[3]); [ output0, output4, output8, output9, output1, output5, output6, output10, output2, output3, output7, output11, ] } } // _ ____ _________ _ _ _ // / | ___| |___ /___ \| |__ (_) |_ // | |___ \ _____ |_ \ __) | '_ \| | __| // | |___) | |_____| ___) / __/| |_) | | |_ // |_|____/ |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly15 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: WasmSimdF32Butterfly3, bf5: WasmSimdF32Butterfly5, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly15, 15, |this: &WasmSimdF32Butterfly15<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly15, 15, |this: &WasmSimdF32Butterfly15<_>| this.direction ); impl WasmSimdF32Butterfly15 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf3 = WasmSimdF32Butterfly3::new(direction); let bf5 = WasmSimdF32Butterfly5::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, bf5, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { // A single WasmSimd 15-point will need a lot of shuffling, let's just reuse the dual one let values = read_partial1_complex_to_array!(buffer, {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14}); let out = self.perform_parallel_fft_direct(values); for n in 0..15 { buffer.store_partial_lo_complex(out[n], n); } } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[7]), extract_hi_lo_f32(input_packed[0], input_packed[8]), extract_lo_hi_f32(input_packed[1], input_packed[8]), extract_hi_lo_f32(input_packed[1], input_packed[9]), extract_lo_hi_f32(input_packed[2], input_packed[9]), extract_hi_lo_f32(input_packed[2], input_packed[10]), extract_lo_hi_f32(input_packed[3], input_packed[10]), extract_hi_lo_f32(input_packed[3], input_packed[11]), extract_lo_hi_f32(input_packed[4], input_packed[11]), extract_hi_lo_f32(input_packed[4], input_packed[12]), extract_lo_hi_f32(input_packed[5], input_packed[12]), extract_hi_lo_f32(input_packed[5], input_packed[13]), extract_lo_hi_f32(input_packed[6], input_packed[13]), extract_hi_lo_f32(input_packed[6], input_packed[14]), extract_lo_hi_f32(input_packed[7], input_packed[14]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_hi_f32(out[14], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0,1,2,3,4,5,6,7,8,9, 10, 11, 12, 13, 14}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [v128; 15]) -> [v128; 15] { // Algorithm: 5x3 good-thomas // Size-5 FFTs down the columns of our reordered array let mid0 = self .bf5 .perform_parallel_fft_direct(values[0], values[3], values[6], values[9], values[12]); let mid1 = self .bf5 .perform_parallel_fft_direct(values[5], values[8], values[11], values[14], values[2]); let mid2 = self .bf5 .perform_parallel_fft_direct(values[10], values[13], values[1], values[4], values[7]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-3 FFTs down the columns let [output0, output1, output2] = self .bf3 .perform_parallel_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self .bf3 .perform_parallel_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self .bf3 .perform_parallel_fft_direct(mid0[2], mid1[2], mid2[2]); let [output9, output10, output11] = self .bf3 .perform_parallel_fft_direct(mid0[3], mid1[3], mid2[3]); let [output12, output13, output14] = self .bf3 .perform_parallel_fft_direct(mid0[4], mid1[4], mid2[4]); [ output0, output4, output8, output9, output13, output2, output3, output7, output11, output12, output1, output5, output6, output10, output14, ] } } // _ ____ __ _ _ _ _ _ // / | ___| / /_ | || | | |__ (_) |_ // | |___ \ _____ | '_ \| || |_| '_ \| | __| // | |___) | |_____| | (_) |__ _| |_) | | |_ // |_|____/ \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly15 { direction: FftDirection, _phantom: std::marker::PhantomData, bf3: WasmSimdF64Butterfly3, bf5: WasmSimdF64Butterfly5, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly15, 15, |this: &WasmSimdF64Butterfly15<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly15, 15, |this: &WasmSimdF64Butterfly15<_>| this.direction ); impl WasmSimdF64Butterfly15 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf3 = WasmSimdF64Butterfly3::new(direction); let bf5 = WasmSimdF64Butterfly5::new(direction); Self { direction, _phantom: std::marker::PhantomData, bf3, bf5, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [v128; 15]) -> [v128; 15] { // Algorithm: 5x3 good-thomas // Size-5 FFTs down the columns of our reordered array let mid0 = self .bf5 .perform_fft_direct(values[0], values[3], values[6], values[9], values[12]); let mid1 = self .bf5 .perform_fft_direct(values[5], values[8], values[11], values[14], values[2]); let mid2 = self .bf5 .perform_fft_direct(values[10], values[13], values[1], values[4], values[7]); // Since this is good-thomas algorithm, we don't need twiddle factors // Transpose the data and do size-3 FFTs down the columns let [output0, output1, output2] = self.bf3.perform_fft_direct(mid0[0], mid1[0], mid2[0]); let [output3, output4, output5] = self.bf3.perform_fft_direct(mid0[1], mid1[1], mid2[1]); let [output6, output7, output8] = self.bf3.perform_fft_direct(mid0[2], mid1[2], mid2[2]); let [output9, output10, output11] = self.bf3.perform_fft_direct(mid0[3], mid1[3], mid2[3]); let [output12, output13, output14] = self.bf3.perform_fft_direct(mid0[4], mid1[4], mid2[4]); [ output0, output4, output8, output9, output13, output2, output3, output7, output11, output12, output1, output5, output6, output10, output14, ] } } // _ __ _________ _ _ _ // / |/ /_ |___ /___ \| |__ (_) |_ // | | '_ \ _____ |_ \ __) | '_ \| | __| // | | (_) | |_____| ___) / __/| |_) | | |_ // |_|\___/ |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly16 { direction: FftDirection, bf4: WasmSimdF32Butterfly4, bf8: WasmSimdF32Butterfly8, rotate90: Rotate90F32, twiddle01: v128, twiddle23: v128, twiddle01conj: v128, twiddle23conj: v128, twiddle1: v128, twiddle2: v128, twiddle3: v128, twiddle1c: v128, twiddle2c: v128, twiddle3c: v128, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly16, 16, |this: &WasmSimdF32Butterfly16<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly16, 16, |this: &WasmSimdF32Butterfly16<_>| this.direction ); impl WasmSimdF32Butterfly16 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf8 = WasmSimdF32Butterfly8::new(direction); let bf4 = WasmSimdF32Butterfly4::new(direction); let rotate90 = if direction == FftDirection::Inverse { Rotate90F32::new(true) } else { Rotate90F32::new(false) }; let tw1: Complex = twiddles::compute_twiddle(1, 16, direction); let tw2: Complex = twiddles::compute_twiddle(2, 16, direction); let tw3: Complex = twiddles::compute_twiddle(3, 16, direction); let twiddle01 = f32x4(1.0, 0.0, tw1.re, tw1.im); let twiddle23 = f32x4(tw2.re, tw2.im, tw3.re, tw3.im); let twiddle01conj = f32x4(1.0, 0.0, tw1.re, -tw1.im); let twiddle23conj = f32x4(tw2.re, -tw2.im, tw3.re, -tw3.im); let twiddle1 = f32x4(tw1.re, tw1.im, tw1.re, tw1.im); let twiddle2 = f32x4(tw2.re, tw2.im, tw2.re, tw2.im); let twiddle3 = f32x4(tw3.re, tw3.im, tw3.re, tw3.im); let twiddle1c = f32x4(tw1.re, -tw1.im, tw1.re, -tw1.im); let twiddle2c = f32x4(tw2.re, -tw2.im, tw2.re, -tw2.im); let twiddle3c = f32x4(tw3.re, -tw3.im, tw3.re, -tw3.im); Self { direction, bf4, bf8, rotate90, twiddle01, twiddle23, twiddle01conj, twiddle23conj, twiddle1, twiddle2, twiddle3, twiddle1c, twiddle2c, twiddle3c, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14 }); let out = self.perform_fft_direct(input_packed); write_complex_to_array_strided!(out, buffer, 2, {0,1,2,3,4,5,6,7}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30}); let values = interleave_complex_f32!(input_packed, 8, {0, 1, 2, 3 ,4 ,5 ,6 ,7}); let out = self.perform_parallel_fft_direct(values); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4, 6, 8, 10, 12, 14}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0,1,2,3,4,5,6,7,8,9, 10, 11,12,13,14, 15}); } #[inline(always)] unsafe fn perform_fft_direct(&self, input: [v128; 8]) -> [v128; 8] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch let in0002 = extract_lo_lo_f32(input[0], input[1]); let in0406 = extract_lo_lo_f32(input[2], input[3]); let in0810 = extract_lo_lo_f32(input[4], input[5]); let in1214 = extract_lo_lo_f32(input[6], input[7]); let in0105 = extract_hi_hi_f32(input[0], input[2]); let in0913 = extract_hi_hi_f32(input[4], input[6]); let in1503 = extract_hi_hi_f32(input[7], input[1]); let in0711 = extract_hi_hi_f32(input[3], input[5]); let in_evens = [in0002, in0406, in0810, in1214]; // step 2: column FFTs let evens = self.bf8.perform_fft_direct(in_evens); let mut odds1 = self.bf4.perform_fft_direct(in0105, in0913); let mut odds3 = self.bf4.perform_fft_direct(in1503, in0711); // step 3: apply twiddle factors odds1[0] = mul_complex_f32(odds1[0], self.twiddle01); odds3[0] = mul_complex_f32(odds3[0], self.twiddle01conj); odds1[1] = mul_complex_f32(odds1[1], self.twiddle23); odds3[1] = mul_complex_f32(odds3[1], self.twiddle23conj); // step 4: cross FFTs let mut temp0 = parallel_fft2_interleaved_f32(odds1[0], odds3[0]); let mut temp1 = parallel_fft2_interleaved_f32(odds1[1], odds3[1]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate_both(temp0[1]); temp1[1] = self.rotate90.rotate_both(temp1[1]); //step 5: copy/add/subtract data back to buffer [ f32x4_add(evens[0], temp0[0]), f32x4_add(evens[1], temp1[0]), f32x4_add(evens[2], temp0[1]), f32x4_add(evens[3], temp1[1]), f32x4_sub(evens[0], temp0[0]), f32x4_sub(evens[1], temp1[0]), f32x4_sub(evens[2], temp0[1]), f32x4_sub(evens[3], temp1[1]), ] } #[inline(always)] unsafe fn perform_parallel_fft_direct(&self, input: [v128; 16]) -> [v128; 16] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let evens = self.bf8.perform_parallel_fft_direct([ input[0], input[2], input[4], input[6], input[8], input[10], input[12], input[14], ]); let mut odds1 = self .bf4 .perform_parallel_fft_direct(input[1], input[5], input[9], input[13]); let mut odds3 = self .bf4 .perform_parallel_fft_direct(input[15], input[3], input[7], input[11]); // step 3: apply twiddle factors odds1[1] = mul_complex_f32(odds1[1], self.twiddle1); odds3[1] = mul_complex_f32(odds3[1], self.twiddle1c); odds1[2] = mul_complex_f32(odds1[2], self.twiddle2); odds3[2] = mul_complex_f32(odds3[2], self.twiddle2c); odds1[3] = mul_complex_f32(odds1[3], self.twiddle3); odds3[3] = mul_complex_f32(odds3[3], self.twiddle3c); // step 4: cross FFTs let mut temp0 = parallel_fft2_interleaved_f32(odds1[0], odds3[0]); let mut temp1 = parallel_fft2_interleaved_f32(odds1[1], odds3[1]); let mut temp2 = parallel_fft2_interleaved_f32(odds1[2], odds3[2]); let mut temp3 = parallel_fft2_interleaved_f32(odds1[3], odds3[3]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate_both(temp0[1]); temp1[1] = self.rotate90.rotate_both(temp1[1]); temp2[1] = self.rotate90.rotate_both(temp2[1]); temp3[1] = self.rotate90.rotate_both(temp3[1]); //step 5: copy/add/subtract data back to buffer [ f32x4_add(evens[0], temp0[0]), f32x4_add(evens[1], temp1[0]), f32x4_add(evens[2], temp2[0]), f32x4_add(evens[3], temp3[0]), f32x4_add(evens[4], temp0[1]), f32x4_add(evens[5], temp1[1]), f32x4_add(evens[6], temp2[1]), f32x4_add(evens[7], temp3[1]), f32x4_sub(evens[0], temp0[0]), f32x4_sub(evens[1], temp1[0]), f32x4_sub(evens[2], temp2[0]), f32x4_sub(evens[3], temp3[0]), f32x4_sub(evens[4], temp0[1]), f32x4_sub(evens[5], temp1[1]), f32x4_sub(evens[6], temp2[1]), f32x4_sub(evens[7], temp3[1]), ] } } // _ __ __ _ _ _ _ _ // / |/ /_ / /_ | || | | |__ (_) |_ // | | '_ \ _____ | '_ \| || |_| '_ \| | __| // | | (_) | |_____| | (_) |__ _| |_) | | |_ // |_|\___/ \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly16 { direction: FftDirection, bf4: WasmSimdF64Butterfly4, bf8: WasmSimdF64Butterfly8, rotate90: Rotate90F64, twiddle1: v128, twiddle2: v128, twiddle3: v128, twiddle1c: v128, twiddle2c: v128, twiddle3c: v128, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly16, 16, |this: &WasmSimdF64Butterfly16<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly16, 16, |this: &WasmSimdF64Butterfly16<_>| this.direction ); impl WasmSimdF64Butterfly16 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf8 = WasmSimdF64Butterfly8::new(direction); let bf4 = WasmSimdF64Butterfly4::new(direction); let rotate90 = if direction == FftDirection::Inverse { Rotate90F64::new(true) } else { Rotate90F64::new(false) }; let twiddle1 = unsafe { v128_load( &twiddles::compute_twiddle::(1, 16, direction) as *const _ as *const v128, ) }; let twiddle2 = unsafe { v128_load( &twiddles::compute_twiddle::(2, 16, direction) as *const _ as *const v128, ) }; let twiddle3 = unsafe { v128_load( &twiddles::compute_twiddle::(3, 16, direction) as *const _ as *const v128, ) }; let twiddle1c = unsafe { v128_load( &twiddles::compute_twiddle::(1, 16, direction).conj() as *const _ as *const v128, ) }; let twiddle2c = unsafe { v128_load( &twiddles::compute_twiddle::(2, 16, direction).conj() as *const _ as *const v128, ) }; let twiddle3c = unsafe { v128_load( &twiddles::compute_twiddle::(3, 16, direction).conj() as *const _ as *const v128, ) }; Self { direction, bf4, bf8, rotate90, twiddle1, twiddle2, twiddle3, twiddle1c, twiddle2c, twiddle3c, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}); } #[inline(always)] unsafe fn perform_fft_direct(&self, input: [v128; 16]) -> [v128; 16] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let evens = self.bf8.perform_fft_direct([ input[0], input[2], input[4], input[6], input[8], input[10], input[12], input[14], ]); let mut odds1 = self .bf4 .perform_fft_direct(input[1], input[5], input[9], input[13]); let mut odds3 = self .bf4 .perform_fft_direct(input[15], input[3], input[7], input[11]); // step 3: apply twiddle factors odds1[1] = mul_complex_f64(odds1[1], self.twiddle1); odds3[1] = mul_complex_f64(odds3[1], self.twiddle1c); odds1[2] = mul_complex_f64(odds1[2], self.twiddle2); odds3[2] = mul_complex_f64(odds3[2], self.twiddle2c); odds1[3] = mul_complex_f64(odds1[3], self.twiddle3); odds3[3] = mul_complex_f64(odds3[3], self.twiddle3c); // step 4: cross FFTs let mut temp0 = solo_fft2_f64(odds1[0], odds3[0]); let mut temp1 = solo_fft2_f64(odds1[1], odds3[1]); let mut temp2 = solo_fft2_f64(odds1[2], odds3[2]); let mut temp3 = solo_fft2_f64(odds1[3], odds3[3]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate(temp0[1]); temp1[1] = self.rotate90.rotate(temp1[1]); temp2[1] = self.rotate90.rotate(temp2[1]); temp3[1] = self.rotate90.rotate(temp3[1]); //step 5: copy/add/subtract data back to buffer [ f64x2_add(evens[0], temp0[0]), f64x2_add(evens[1], temp1[0]), f64x2_add(evens[2], temp2[0]), f64x2_add(evens[3], temp3[0]), f64x2_add(evens[4], temp0[1]), f64x2_add(evens[5], temp1[1]), f64x2_add(evens[6], temp2[1]), f64x2_add(evens[7], temp3[1]), f64x2_sub(evens[0], temp0[0]), f64x2_sub(evens[1], temp1[0]), f64x2_sub(evens[2], temp2[0]), f64x2_sub(evens[3], temp3[0]), f64x2_sub(evens[4], temp0[1]), f64x2_sub(evens[5], temp1[1]), f64x2_sub(evens[6], temp2[1]), f64x2_sub(evens[7], temp3[1]), ] } } // _________ _________ _ _ _ // |___ /___ \ |___ /___ \| |__ (_) |_ // |_ \ __) | _____ |_ \ __) | '_ \| | __| // ___) / __/ |_____| ___) / __/| |_) | | |_ // |____/_____| |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly32 { direction: FftDirection, bf8: WasmSimdF32Butterfly8, bf16: WasmSimdF32Butterfly16, rotate90: Rotate90F32, twiddle01: v128, twiddle23: v128, twiddle45: v128, twiddle67: v128, twiddle01conj: v128, twiddle23conj: v128, twiddle45conj: v128, twiddle67conj: v128, twiddle1: v128, twiddle2: v128, twiddle3: v128, twiddle4: v128, twiddle5: v128, twiddle6: v128, twiddle7: v128, twiddle1c: v128, twiddle2c: v128, twiddle3c: v128, twiddle4c: v128, twiddle5c: v128, twiddle6c: v128, twiddle7c: v128, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly32, 32, |this: &WasmSimdF32Butterfly32<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly32, 32, |this: &WasmSimdF32Butterfly32<_>| this.direction ); impl WasmSimdF32Butterfly32 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let bf8 = WasmSimdF32Butterfly8::new(direction); let bf16 = WasmSimdF32Butterfly16::new(direction); let rotate90 = if direction == FftDirection::Inverse { Rotate90F32::new(true) } else { Rotate90F32::new(false) }; let tw1: Complex = twiddles::compute_twiddle(1, 32, direction); let tw2: Complex = twiddles::compute_twiddle(2, 32, direction); let tw3: Complex = twiddles::compute_twiddle(3, 32, direction); let tw4: Complex = twiddles::compute_twiddle(4, 32, direction); let tw5: Complex = twiddles::compute_twiddle(5, 32, direction); let tw6: Complex = twiddles::compute_twiddle(6, 32, direction); let tw7: Complex = twiddles::compute_twiddle(7, 32, direction); let twiddle01 = f32x4(1.0, 0.0, tw1.re, tw1.im); let twiddle23 = f32x4(tw2.re, tw2.im, tw3.re, tw3.im); let twiddle45 = f32x4(tw4.re, tw4.im, tw5.re, tw5.im); let twiddle67 = f32x4(tw6.re, tw6.im, tw7.re, tw7.im); let twiddle01conj = f32x4(1.0, 0.0, tw1.re, -tw1.im); let twiddle23conj = f32x4(tw2.re, -tw2.im, tw3.re, -tw3.im); let twiddle45conj = f32x4(tw4.re, -tw4.im, tw5.re, -tw5.im); let twiddle67conj = f32x4(tw6.re, -tw6.im, tw7.re, -tw7.im); let twiddle1 = f32x4(tw1.re, tw1.im, tw1.re, tw1.im); let twiddle2 = f32x4(tw2.re, tw2.im, tw2.re, tw2.im); let twiddle3 = f32x4(tw3.re, tw3.im, tw3.re, tw3.im); let twiddle4 = f32x4(tw4.re, tw4.im, tw4.re, tw4.im); let twiddle5 = f32x4(tw5.re, tw5.im, tw5.re, tw5.im); let twiddle6 = f32x4(tw6.re, tw6.im, tw6.re, tw6.im); let twiddle7 = f32x4(tw7.re, tw7.im, tw7.re, tw7.im); let twiddle1c = f32x4(tw1.re, -tw1.im, tw1.re, -tw1.im); let twiddle2c = f32x4(tw2.re, -tw2.im, tw2.re, -tw2.im); let twiddle3c = f32x4(tw3.re, -tw3.im, tw3.re, -tw3.im); let twiddle4c = f32x4(tw4.re, -tw4.im, tw4.re, -tw4.im); let twiddle5c = f32x4(tw5.re, -tw5.im, tw5.re, -tw5.im); let twiddle6c = f32x4(tw6.re, -tw6.im, tw6.re, -tw6.im); let twiddle7c = f32x4(tw7.re, -tw7.im, tw7.re, -tw7.im); Self { direction, bf8, bf16, rotate90, twiddle01, twiddle23, twiddle45, twiddle67, twiddle01conj, twiddle23conj, twiddle45conj, twiddle67conj, twiddle1, twiddle2, twiddle3, twiddle4, twiddle5, twiddle6, twiddle7, twiddle1c, twiddle2c, twiddle3c, twiddle4c, twiddle5c, twiddle6c, twiddle7c, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30 }); let out = self.perform_fft_direct(input_packed); write_complex_to_array_strided!(out, buffer, 2, {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62}); let values = interleave_complex_f32!(input_packed, 16, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}); let out = self.perform_parallel_fft_direct(values); let out_sorted = separate_interleaved_complex_f32!(out, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30}); write_complex_to_array_strided!(out_sorted, buffer, 2, {0,1,2,3,4,5,6,7,8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 }); } #[inline(always)] unsafe fn perform_fft_direct(&self, input: [v128; 16]) -> [v128; 16] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch let in0002 = extract_lo_lo_f32(input[0], input[1]); let in0406 = extract_lo_lo_f32(input[2], input[3]); let in0810 = extract_lo_lo_f32(input[4], input[5]); let in1214 = extract_lo_lo_f32(input[6], input[7]); let in1618 = extract_lo_lo_f32(input[8], input[9]); let in2022 = extract_lo_lo_f32(input[10], input[11]); let in2426 = extract_lo_lo_f32(input[12], input[13]); let in2830 = extract_lo_lo_f32(input[14], input[15]); let in0105 = extract_hi_hi_f32(input[0], input[2]); let in0913 = extract_hi_hi_f32(input[4], input[6]); let in1721 = extract_hi_hi_f32(input[8], input[10]); let in2529 = extract_hi_hi_f32(input[12], input[14]); let in3103 = extract_hi_hi_f32(input[15], input[1]); let in0711 = extract_hi_hi_f32(input[3], input[5]); let in1519 = extract_hi_hi_f32(input[7], input[9]); let in2327 = extract_hi_hi_f32(input[11], input[13]); let in_evens = [ in0002, in0406, in0810, in1214, in1618, in2022, in2426, in2830, ]; // step 2: column FFTs let evens = self.bf16.perform_fft_direct(in_evens); let mut odds1 = self .bf8 .perform_fft_direct([in0105, in0913, in1721, in2529]); let mut odds3 = self .bf8 .perform_fft_direct([in3103, in0711, in1519, in2327]); // step 3: apply twiddle factors odds1[0] = mul_complex_f32(odds1[0], self.twiddle01); odds3[0] = mul_complex_f32(odds3[0], self.twiddle01conj); odds1[1] = mul_complex_f32(odds1[1], self.twiddle23); odds3[1] = mul_complex_f32(odds3[1], self.twiddle23conj); odds1[2] = mul_complex_f32(odds1[2], self.twiddle45); odds3[2] = mul_complex_f32(odds3[2], self.twiddle45conj); odds1[3] = mul_complex_f32(odds1[3], self.twiddle67); odds3[3] = mul_complex_f32(odds3[3], self.twiddle67conj); // step 4: cross FFTs let mut temp0 = parallel_fft2_interleaved_f32(odds1[0], odds3[0]); let mut temp1 = parallel_fft2_interleaved_f32(odds1[1], odds3[1]); let mut temp2 = parallel_fft2_interleaved_f32(odds1[2], odds3[2]); let mut temp3 = parallel_fft2_interleaved_f32(odds1[3], odds3[3]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate_both(temp0[1]); temp1[1] = self.rotate90.rotate_both(temp1[1]); temp2[1] = self.rotate90.rotate_both(temp2[1]); temp3[1] = self.rotate90.rotate_both(temp3[1]); //step 5: copy/add/subtract data back to buffer [ f32x4_add(evens[0], temp0[0]), f32x4_add(evens[1], temp1[0]), f32x4_add(evens[2], temp2[0]), f32x4_add(evens[3], temp3[0]), f32x4_add(evens[4], temp0[1]), f32x4_add(evens[5], temp1[1]), f32x4_add(evens[6], temp2[1]), f32x4_add(evens[7], temp3[1]), f32x4_sub(evens[0], temp0[0]), f32x4_sub(evens[1], temp1[0]), f32x4_sub(evens[2], temp2[0]), f32x4_sub(evens[3], temp3[0]), f32x4_sub(evens[4], temp0[1]), f32x4_sub(evens[5], temp1[1]), f32x4_sub(evens[6], temp2[1]), f32x4_sub(evens[7], temp3[1]), ] } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, input: [v128; 32]) -> [v128; 32] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let evens = self.bf16.perform_parallel_fft_direct([ input[0], input[2], input[4], input[6], input[8], input[10], input[12], input[14], input[16], input[18], input[20], input[22], input[24], input[26], input[28], input[30], ]); let mut odds1 = self.bf8.perform_parallel_fft_direct([ input[1], input[5], input[9], input[13], input[17], input[21], input[25], input[29], ]); let mut odds3 = self.bf8.perform_parallel_fft_direct([ input[31], input[3], input[7], input[11], input[15], input[19], input[23], input[27], ]); // step 3: apply twiddle factors odds1[1] = mul_complex_f32(odds1[1], self.twiddle1); odds3[1] = mul_complex_f32(odds3[1], self.twiddle1c); odds1[2] = mul_complex_f32(odds1[2], self.twiddle2); odds3[2] = mul_complex_f32(odds3[2], self.twiddle2c); odds1[3] = mul_complex_f32(odds1[3], self.twiddle3); odds3[3] = mul_complex_f32(odds3[3], self.twiddle3c); odds1[4] = mul_complex_f32(odds1[4], self.twiddle4); odds3[4] = mul_complex_f32(odds3[4], self.twiddle4c); odds1[5] = mul_complex_f32(odds1[5], self.twiddle5); odds3[5] = mul_complex_f32(odds3[5], self.twiddle5c); odds1[6] = mul_complex_f32(odds1[6], self.twiddle6); odds3[6] = mul_complex_f32(odds3[6], self.twiddle6c); odds1[7] = mul_complex_f32(odds1[7], self.twiddle7); odds3[7] = mul_complex_f32(odds3[7], self.twiddle7c); // step 4: cross FFTs let mut temp0 = parallel_fft2_interleaved_f32(odds1[0], odds3[0]); let mut temp1 = parallel_fft2_interleaved_f32(odds1[1], odds3[1]); let mut temp2 = parallel_fft2_interleaved_f32(odds1[2], odds3[2]); let mut temp3 = parallel_fft2_interleaved_f32(odds1[3], odds3[3]); let mut temp4 = parallel_fft2_interleaved_f32(odds1[4], odds3[4]); let mut temp5 = parallel_fft2_interleaved_f32(odds1[5], odds3[5]); let mut temp6 = parallel_fft2_interleaved_f32(odds1[6], odds3[6]); let mut temp7 = parallel_fft2_interleaved_f32(odds1[7], odds3[7]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate_both(temp0[1]); temp1[1] = self.rotate90.rotate_both(temp1[1]); temp2[1] = self.rotate90.rotate_both(temp2[1]); temp3[1] = self.rotate90.rotate_both(temp3[1]); temp4[1] = self.rotate90.rotate_both(temp4[1]); temp5[1] = self.rotate90.rotate_both(temp5[1]); temp6[1] = self.rotate90.rotate_both(temp6[1]); temp7[1] = self.rotate90.rotate_both(temp7[1]); //step 5: copy/add/subtract data back to buffer [ f32x4_add(evens[0], temp0[0]), f32x4_add(evens[1], temp1[0]), f32x4_add(evens[2], temp2[0]), f32x4_add(evens[3], temp3[0]), f32x4_add(evens[4], temp4[0]), f32x4_add(evens[5], temp5[0]), f32x4_add(evens[6], temp6[0]), f32x4_add(evens[7], temp7[0]), f32x4_add(evens[8], temp0[1]), f32x4_add(evens[9], temp1[1]), f32x4_add(evens[10], temp2[1]), f32x4_add(evens[11], temp3[1]), f32x4_add(evens[12], temp4[1]), f32x4_add(evens[13], temp5[1]), f32x4_add(evens[14], temp6[1]), f32x4_add(evens[15], temp7[1]), f32x4_sub(evens[0], temp0[0]), f32x4_sub(evens[1], temp1[0]), f32x4_sub(evens[2], temp2[0]), f32x4_sub(evens[3], temp3[0]), f32x4_sub(evens[4], temp4[0]), f32x4_sub(evens[5], temp5[0]), f32x4_sub(evens[6], temp6[0]), f32x4_sub(evens[7], temp7[0]), f32x4_sub(evens[8], temp0[1]), f32x4_sub(evens[9], temp1[1]), f32x4_sub(evens[10], temp2[1]), f32x4_sub(evens[11], temp3[1]), f32x4_sub(evens[12], temp4[1]), f32x4_sub(evens[13], temp5[1]), f32x4_sub(evens[14], temp6[1]), f32x4_sub(evens[15], temp7[1]), ] } } // _________ __ _ _ _ _ _ // |___ /___ \ / /_ | || | | |__ (_) |_ // |_ \ __) | _____ | '_ \| || |_| '_ \| | __| // ___) / __/ |_____| | (_) |__ _| |_) | | |_ // |____/_____| \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly32 { direction: FftDirection, bf8: WasmSimdF64Butterfly8, bf16: WasmSimdF64Butterfly16, rotate90: Rotate90F64, twiddle1: v128, twiddle2: v128, twiddle3: v128, twiddle4: v128, twiddle5: v128, twiddle6: v128, twiddle7: v128, twiddle1c: v128, twiddle2c: v128, twiddle3c: v128, twiddle4c: v128, twiddle5c: v128, twiddle6c: v128, twiddle7c: v128, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly32, 32, |this: &WasmSimdF64Butterfly32<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly32, 32, |this: &WasmSimdF64Butterfly32<_>| this.direction ); impl WasmSimdF64Butterfly32 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let bf8 = WasmSimdF64Butterfly8::new(direction); let bf16 = WasmSimdF64Butterfly16::new(direction); let rotate90 = if direction == FftDirection::Inverse { Rotate90F64::new(true) } else { Rotate90F64::new(false) }; let twiddle1 = unsafe { v128_load( &twiddles::compute_twiddle::(1, 32, direction) as *const _ as *const v128, ) }; let twiddle2 = unsafe { v128_load( &twiddles::compute_twiddle::(2, 32, direction) as *const _ as *const v128, ) }; let twiddle3 = unsafe { v128_load( &twiddles::compute_twiddle::(3, 32, direction) as *const _ as *const v128, ) }; let twiddle4 = unsafe { v128_load( &twiddles::compute_twiddle::(4, 32, direction) as *const _ as *const v128, ) }; let twiddle5 = unsafe { v128_load( &twiddles::compute_twiddle::(5, 32, direction) as *const _ as *const v128, ) }; let twiddle6 = unsafe { v128_load( &twiddles::compute_twiddle::(6, 32, direction) as *const _ as *const v128, ) }; let twiddle7 = unsafe { v128_load( &twiddles::compute_twiddle::(7, 32, direction) as *const _ as *const v128, ) }; let twiddle1c = unsafe { v128_load( &twiddles::compute_twiddle::(1, 32, direction).conj() as *const _ as *const v128, ) }; let twiddle2c = unsafe { v128_load( &twiddles::compute_twiddle::(2, 32, direction).conj() as *const _ as *const v128, ) }; let twiddle3c = unsafe { v128_load( &twiddles::compute_twiddle::(3, 32, direction).conj() as *const _ as *const v128, ) }; let twiddle4c = unsafe { v128_load( &twiddles::compute_twiddle::(4, 32, direction).conj() as *const _ as *const v128, ) }; let twiddle5c = unsafe { v128_load( &twiddles::compute_twiddle::(5, 32, direction).conj() as *const _ as *const v128, ) }; let twiddle6c = unsafe { v128_load( &twiddles::compute_twiddle::(6, 32, direction).conj() as *const _ as *const v128, ) }; let twiddle7c = unsafe { v128_load( &twiddles::compute_twiddle::(7, 32, direction).conj() as *const _ as *const v128, ) }; Self { direction, bf8, bf16, rotate90, twiddle1, twiddle2, twiddle3, twiddle4, twiddle5, twiddle6, twiddle7, twiddle1c, twiddle2c, twiddle3c, twiddle4c, twiddle5c, twiddle6c, twiddle7c, } } #[inline(always)] unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31}); } #[inline(always)] unsafe fn perform_fft_direct(&self, input: [v128; 32]) -> [v128; 32] { // we're going to hardcode a step of split radix // step 1: copy and reorder the input into the scratch // and // step 2: column FFTs let evens = self.bf16.perform_fft_direct([ input[0], input[2], input[4], input[6], input[8], input[10], input[12], input[14], input[16], input[18], input[20], input[22], input[24], input[26], input[28], input[30], ]); let mut odds1 = self.bf8.perform_fft_direct([ input[1], input[5], input[9], input[13], input[17], input[21], input[25], input[29], ]); let mut odds3 = self.bf8.perform_fft_direct([ input[31], input[3], input[7], input[11], input[15], input[19], input[23], input[27], ]); // step 3: apply twiddle factors odds1[1] = mul_complex_f64(odds1[1], self.twiddle1); odds3[1] = mul_complex_f64(odds3[1], self.twiddle1c); odds1[2] = mul_complex_f64(odds1[2], self.twiddle2); odds3[2] = mul_complex_f64(odds3[2], self.twiddle2c); odds1[3] = mul_complex_f64(odds1[3], self.twiddle3); odds3[3] = mul_complex_f64(odds3[3], self.twiddle3c); odds1[4] = mul_complex_f64(odds1[4], self.twiddle4); odds3[4] = mul_complex_f64(odds3[4], self.twiddle4c); odds1[5] = mul_complex_f64(odds1[5], self.twiddle5); odds3[5] = mul_complex_f64(odds3[5], self.twiddle5c); odds1[6] = mul_complex_f64(odds1[6], self.twiddle6); odds3[6] = mul_complex_f64(odds3[6], self.twiddle6c); odds1[7] = mul_complex_f64(odds1[7], self.twiddle7); odds3[7] = mul_complex_f64(odds3[7], self.twiddle7c); // step 4: cross FFTs let mut temp0 = solo_fft2_f64(odds1[0], odds3[0]); let mut temp1 = solo_fft2_f64(odds1[1], odds3[1]); let mut temp2 = solo_fft2_f64(odds1[2], odds3[2]); let mut temp3 = solo_fft2_f64(odds1[3], odds3[3]); let mut temp4 = solo_fft2_f64(odds1[4], odds3[4]); let mut temp5 = solo_fft2_f64(odds1[5], odds3[5]); let mut temp6 = solo_fft2_f64(odds1[6], odds3[6]); let mut temp7 = solo_fft2_f64(odds1[7], odds3[7]); // apply the butterfly 4 twiddle factor, which is just a rotation temp0[1] = self.rotate90.rotate(temp0[1]); temp1[1] = self.rotate90.rotate(temp1[1]); temp2[1] = self.rotate90.rotate(temp2[1]); temp3[1] = self.rotate90.rotate(temp3[1]); temp4[1] = self.rotate90.rotate(temp4[1]); temp5[1] = self.rotate90.rotate(temp5[1]); temp6[1] = self.rotate90.rotate(temp6[1]); temp7[1] = self.rotate90.rotate(temp7[1]); //step 5: copy/add/subtract data back to buffer [ f64x2_add(evens[0], temp0[0]), f64x2_add(evens[1], temp1[0]), f64x2_add(evens[2], temp2[0]), f64x2_add(evens[3], temp3[0]), f64x2_add(evens[4], temp4[0]), f64x2_add(evens[5], temp5[0]), f64x2_add(evens[6], temp6[0]), f64x2_add(evens[7], temp7[0]), f64x2_add(evens[8], temp0[1]), f64x2_add(evens[9], temp1[1]), f64x2_add(evens[10], temp2[1]), f64x2_add(evens[11], temp3[1]), f64x2_add(evens[12], temp4[1]), f64x2_add(evens[13], temp5[1]), f64x2_add(evens[14], temp6[1]), f64x2_add(evens[15], temp7[1]), f64x2_sub(evens[0], temp0[0]), f64x2_sub(evens[1], temp1[0]), f64x2_sub(evens[2], temp2[0]), f64x2_sub(evens[3], temp3[0]), f64x2_sub(evens[4], temp4[0]), f64x2_sub(evens[5], temp5[0]), f64x2_sub(evens[6], temp6[0]), f64x2_sub(evens[7], temp7[0]), f64x2_sub(evens[8], temp0[1]), f64x2_sub(evens[9], temp1[1]), f64x2_sub(evens[10], temp2[1]), f64x2_sub(evens[11], temp3[1]), f64x2_sub(evens[12], temp4[1]), f64x2_sub(evens[13], temp5[1]), f64x2_sub(evens[14], temp6[1]), f64x2_sub(evens[15], temp7[1]), ] } } #[cfg(test)] mod unit_tests { use super::*; use crate::algorithm::Dft; use crate::test_utils::{check_fft_algorithm, compare_vectors}; use wasm_bindgen_test::wasm_bindgen_test; //the tests for all butterflies will be identical except for the identifiers used and size //so it's ideal for a macro macro_rules! test_butterfly_32_func { ($test_name:ident, $struct_name:ident, $size:expr) => { #[wasm_bindgen_test] fn $test_name() { let butterfly = $struct_name::new(FftDirection::Forward); check_fft_algorithm::(&butterfly, $size, FftDirection::Forward); let butterfly_direction = $struct_name::new(FftDirection::Inverse); check_fft_algorithm::(&butterfly_direction, $size, FftDirection::Inverse); } }; } test_butterfly_32_func!(test_wasm_simdf32_butterfly1, WasmSimdF32Butterfly1, 1); test_butterfly_32_func!(test_wasm_simdf32_butterfly2, WasmSimdF32Butterfly2, 2); test_butterfly_32_func!(test_wasm_simdf32_butterfly3, WasmSimdF32Butterfly3, 3); test_butterfly_32_func!(test_wasm_simdf32_butterfly4, WasmSimdF32Butterfly4, 4); test_butterfly_32_func!(test_wasm_simdf32_butterfly5, WasmSimdF32Butterfly5, 5); test_butterfly_32_func!(test_wasm_simdf32_butterfly6, WasmSimdF32Butterfly6, 6); test_butterfly_32_func!(test_wasm_simdf32_butterfly8, WasmSimdF32Butterfly8, 8); test_butterfly_32_func!(test_wasm_simdf32_butterfly9, WasmSimdF32Butterfly9, 9); test_butterfly_32_func!(test_wasm_simdf32_butterfly10, WasmSimdF32Butterfly10, 10); test_butterfly_32_func!(test_wasm_simdf32_butterfly12, WasmSimdF32Butterfly12, 12); test_butterfly_32_func!(test_wasm_simdf32_butterfly15, WasmSimdF32Butterfly15, 15); test_butterfly_32_func!(test_wasm_simdf32_butterfly16, WasmSimdF32Butterfly16, 16); test_butterfly_32_func!(test_wasm_simdf32_butterfly32, WasmSimdF32Butterfly32, 32); //the tests for all butterflies will be identical except for the identifiers used and size //so it's ideal for a macro macro_rules! test_butterfly_64_func { ($test_name:ident, $struct_name:ident, $size:expr) => { #[wasm_bindgen_test] fn $test_name() { let butterfly = $struct_name::new(FftDirection::Forward); check_fft_algorithm::(&butterfly, $size, FftDirection::Forward); let butterfly_direction = $struct_name::new(FftDirection::Inverse); check_fft_algorithm::(&butterfly_direction, $size, FftDirection::Inverse); } }; } test_butterfly_64_func!(test_wasm_simdf64_butterfly1, WasmSimdF64Butterfly1, 1); test_butterfly_64_func!(test_wasm_simdf64_butterfly2, WasmSimdF64Butterfly2, 2); test_butterfly_64_func!(test_wasm_simdf64_butterfly3, WasmSimdF64Butterfly3, 3); test_butterfly_64_func!(test_wasm_simdf64_butterfly4, WasmSimdF64Butterfly4, 4); test_butterfly_64_func!(test_wasm_simdf64_butterfly5, WasmSimdF64Butterfly5, 5); test_butterfly_64_func!(test_wasm_simdf64_butterfly6, WasmSimdF64Butterfly6, 6); test_butterfly_64_func!(test_wasm_simdf64_butterfly8, WasmSimdF64Butterfly8, 8); test_butterfly_64_func!(test_wasm_simdf64_butterfly9, WasmSimdF64Butterfly9, 9); test_butterfly_64_func!(test_wasm_simdf64_butterfly10, WasmSimdF64Butterfly10, 10); test_butterfly_64_func!(test_wasm_simdf64_butterfly12, WasmSimdF64Butterfly12, 12); test_butterfly_64_func!(test_wasm_simdf64_butterfly15, WasmSimdF64Butterfly15, 15); test_butterfly_64_func!(test_wasm_simdf64_butterfly16, WasmSimdF64Butterfly16, 16); test_butterfly_64_func!(test_wasm_simdf64_butterfly32, WasmSimdF64Butterfly32, 32); #[wasm_bindgen_test] fn test_solo_fft2_32() { unsafe { let val1 = Complex::::new(1.0, 2.5); let val2 = Complex::::new(3.2, 4.2); let mut val = vec![val1, val2]; let in_packed = v128_load(val.as_ptr() as *const v128); let dft = Dft::new(2, FftDirection::Forward); let bf2 = WasmSimdF32Butterfly2::::new(FftDirection::Forward); dft.process(&mut val); let res_packed = bf2.perform_fft_direct(in_packed); let res = std::mem::transmute::; 2]>(res_packed); assert_eq!(val[0], res[0]); assert_eq!(val[1], res[1]); } } #[wasm_bindgen_test] fn test_parallel_fft2_32() { unsafe { let val_a1 = Complex::::new(1.0, 2.5); let val_a2 = Complex::::new(3.2, 4.2); let val_b1 = Complex::::new(6.0, 24.5); let val_b2 = Complex::::new(4.3, 34.2); let mut val_a = vec![val_a1, val_a2]; let mut val_b = vec![val_b1, val_b2]; let p1 = v128_load(val_a.as_ptr() as *const v128); let p2 = v128_load(val_b.as_ptr() as *const v128); let dft = Dft::new(2, FftDirection::Forward); let bf2 = WasmSimdF32Butterfly2::::new(FftDirection::Forward); dft.process(&mut val_a); dft.process(&mut val_b); let res_both = bf2.perform_parallel_fft_direct(p1, p2); let res = std::mem::transmute::<[v128; 2], [Complex; 4]>(res_both); let wasmsimd_res_a = [res[0], res[2]]; let wasmsimd_res_b = [res[1], res[3]]; assert!(compare_vectors(&val_a, &wasmsimd_res_a)); assert!(compare_vectors(&val_b, &wasmsimd_res_b)); } } } rustfft-6.2.0/src/wasm_simd/wasm_simd_common.rs000064400000000000000000000230670072674642500200410ustar 00000000000000use std::any::TypeId; /// Calculate the sum of an expression consisting of just plus and minus, like `value = a + b - c + d`. /// The expression is rewritten to `value = a + (b - (c - d))` (note the flipped sign on d). /// After this the `$add` and `$sub` functions are used to make the calculation. /// For f32 using `f32x4_add` and `f32x4_sub`, the expression `value = a + b - c + d` becomes: /// ``` /// let value = f32x4_add(a, f32x4_sub(b, f32x4_sub(c, d))); /// ``` /// Only plus and minus are supported, and all the terms must be plain scalar variables. /// Using array indices, like `value = temp[0] + temp[1]` is not supported. macro_rules! calc_sum { ($add:ident, $sub:ident, + $acc:tt + $($rest:tt)*)=> { $add($acc, calc_sum!($add, $sub, + $($rest)*)) }; ($add:ident, $sub:ident, + $acc:tt - $($rest:tt)*)=> { $sub($acc, calc_sum!($add, $sub, - $($rest)*)) }; ($add:ident, $sub:ident, - $acc:tt + $($rest:tt)*)=> { $sub($acc, calc_sum!($add, $sub, + $($rest)*)) }; ($add:ident, $sub:ident, - $acc:tt - $($rest:tt)*)=> { $add($acc, calc_sum!($add, $sub, - $($rest)*)) }; ($add:ident, $sub:ident, $acc:tt + $($rest:tt)*)=> { $add($acc, calc_sum!($add, $sub, + $($rest)*)) }; ($add:ident, $sub:ident, $acc:tt - $($rest:tt)*)=> { $sub($acc, calc_sum!($add, $sub, - $($rest)*)) }; ($add:ident, $sub:ident, + $val:tt) => {$val}; ($add:ident, $sub:ident, - $val:tt) => {$val}; } /// Calculate the sum of an expression consisting of just plus and minus, like a + b - c + d macro_rules! calc_f32 { ($($tokens:tt)*) => { calc_sum!(f32x4_add, f32x4_sub, $($tokens)*)}; } /// Calculate the sum of an expression consisting of just plus and minus, like a + b - c + d macro_rules! calc_f64 { ($($tokens:tt)*) => { calc_sum!(f64x2_add, f64x2_sub, $($tokens)*)}; } /// Helper function to assert we have the right float type pub fn assert_f32() { let id_f32 = TypeId::of::(); let id_t = TypeId::of::(); assert!(id_t == id_f32, "Wrong float type, must be f32"); } /// Helper function to assert we have the right float type pub fn assert_f64() { let id_f64 = TypeId::of::(); let id_t = TypeId::of::(); assert!(id_t == id_f64, "Wrong float type, must be f64"); } /// Shuffle elements to interleave two contiguous sets of f32, from an array of simd vectors to a new array of simd vectors macro_rules! interleave_complex_f32 { ($input:ident, $offset:literal, { $($idx:literal),* }) => { [ $( extract_lo_lo_f32($input[$idx], $input[$idx+$offset]), extract_hi_hi_f32($input[$idx], $input[$idx+$offset]), )* ] } } /// Shuffle elements to interleave two contiguous sets of f32, from an array of simd vectors to a new array of simd vectors /// This statement: /// ``` /// let values = separate_interleaved_complex_f32!(input, {0, 2, 4}); /// ``` /// is equivalent to: /// ``` /// let values = [ /// extract_lo_lo_f32(input[0], input[1]), /// extract_lo_lo_f32(input[2], input[3]), /// extract_lo_lo_f32(input[4], input[5]), /// extract_hi_hi_f32(input[0], input[1]), /// extract_hi_hi_f32(input[2], input[3]), /// extract_hi_hi_f32(input[4], input[5]), /// ]; macro_rules! separate_interleaved_complex_f32 { ($input:ident, { $($idx:literal),* }) => { [ $( extract_lo_lo_f32($input[$idx], $input[$idx+1]), )* $( extract_hi_hi_f32($input[$idx], $input[$idx+1]), )* ] } } macro_rules! boilerplate_fft_wasm_simd_oop { ($struct_name:ident, $len_fn:expr) => { impl Fft for $struct_name { fn process_outofplace_with_scratch( &self, input: &mut [Complex], output: &mut [Complex], _scratch: &mut [Complex], ) { if self.len() == 0 { return; } if input.len() < self.len() || output.len() != input.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here } let result = unsafe { array_utils::iter_chunks_zipped( input, output, self.len(), |in_chunk, out_chunk| { self.perform_fft_out_of_place(in_chunk, out_chunk, &mut []) }, ) }; if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); } } fn process_with_scratch(&self, buffer: &mut [Complex], scratch: &mut [Complex]) { if self.len() == 0 { return; } let required_scratch = self.get_inplace_scratch_len(); if scratch.len() < required_scratch || buffer.len() < self.len() { // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); return; // Unreachable, because fft_error_inplace asserts, but it helps codegen to put it here } let scratch = &mut scratch[..required_scratch]; let result = unsafe { array_utils::iter_chunks(buffer, self.len(), |chunk| { self.perform_fft_out_of_place(chunk, scratch, &mut []); chunk.copy_from_slice(scratch); }) }; if result.is_err() { // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us fft_error_inplace( self.len(), buffer.len(), self.get_inplace_scratch_len(), scratch.len(), ); } } #[inline(always)] fn get_inplace_scratch_len(&self) -> usize { self.len() } #[inline(always)] fn get_outofplace_scratch_len(&self) -> usize { 0 } } impl Length for $struct_name { #[inline(always)] fn len(&self) -> usize { $len_fn(self) } } impl Direction for $struct_name { #[inline(always)] fn fft_direction(&self) -> FftDirection { self.direction } } }; } #[cfg(test)] mod unit_tests { use core::arch::wasm32::*; use wasm_bindgen_test::wasm_bindgen_test; #[wasm_bindgen_test] fn test_calc_f32() { unsafe { let a = f32x4(1.0, 1.0, 1.0, 1.0); let b = f32x4(2.0, 2.0, 2.0, 2.0); let c = f32x4(3.0, 3.0, 3.0, 3.0); let d = f32x4(4.0, 4.0, 4.0, 4.0); let e = f32x4(5.0, 5.0, 5.0, 5.0); let f = f32x4(6.0, 6.0, 6.0, 6.0); let g = f32x4(7.0, 7.0, 7.0, 7.0); let h = f32x4(8.0, 8.0, 8.0, 8.0); let i = f32x4(9.0, 9.0, 9.0, 9.0); let expected: f32 = 1.0 + 2.0 - 3.0 + 4.0 - 5.0 + 6.0 - 7.0 - 8.0 + 9.0; let res = calc_f32!(a + b - c + d - e + f - g - h + i); let sum = std::mem::transmute::(res); assert_eq!(sum[0], expected); assert_eq!(sum[1], expected); assert_eq!(sum[2], expected); assert_eq!(sum[3], expected); } } #[wasm_bindgen_test] fn test_calc_f64() { unsafe { let a = f64x2(1.0, 1.0); let b = f64x2(2.0, 2.0); let c = f64x2(3.0, 3.0); let d = f64x2(4.0, 4.0); let e = f64x2(5.0, 5.0); let f = f64x2(6.0, 6.0); let g = f64x2(7.0, 7.0); let h = f64x2(8.0, 8.0); let i = f64x2(9.0, 9.0); let expected: f64 = 1.0 + 2.0 - 3.0 + 4.0 - 5.0 + 6.0 - 7.0 - 8.0 + 9.0; let res = calc_f64!(a + b - c + d - e + f - g - h + i); let sum = std::mem::transmute::(res); assert_eq!(sum[0], expected); assert_eq!(sum[1], expected); } } } rustfft-6.2.0/src/wasm_simd/wasm_simd_planner.rs000064400000000000000000001031620072674642500202030ustar 00000000000000use num_integer::gcd; use crate::algorithm::{ BluesteinsAlgorithm, Dft, GoodThomasAlgorithm, GoodThomasAlgorithmSmall, MixedRadix, MixedRadixSmall, RadersAlgorithm, }; use crate::math_utils::PrimeFactor; use crate::wasm_simd::*; use crate::{fft_cache::FftCache, math_utils::PrimeFactors, Fft, FftDirection, FftNum}; use std::{any::TypeId, collections::HashMap, sync::Arc}; const MIN_RADIX4_BITS: u32 = 6; // smallest size to consider radix 4 an option is 2^6 = 64 const MAX_RADER_PRIME_FACTOR: usize = 23; // don't use Raders if the inner fft length has prime factor larger than this const MIN_BLUESTEIN_MIXED_RADIX_LEN: usize = 90; // only use mixed radix for the inner fft of Bluestein if length is larger than this /// A Recipe is a structure that describes the design of a FFT, without actually creating it. /// It is used as a middle step in the planning process. #[derive(Debug, PartialEq, Clone)] pub enum Recipe { Dft(usize), MixedRadix { left_fft: Arc, right_fft: Arc, }, #[allow(dead_code)] GoodThomasAlgorithm { left_fft: Arc, right_fft: Arc, }, MixedRadixSmall { left_fft: Arc, right_fft: Arc, }, GoodThomasAlgorithmSmall { left_fft: Arc, right_fft: Arc, }, RadersAlgorithm { inner_fft: Arc, }, BluesteinsAlgorithm { len: usize, inner_fft: Arc, }, Radix4(usize), Butterfly1, Butterfly2, Butterfly3, Butterfly4, Butterfly5, Butterfly6, Butterfly7, Butterfly8, Butterfly9, Butterfly10, Butterfly11, Butterfly12, Butterfly13, Butterfly15, Butterfly16, Butterfly17, Butterfly19, Butterfly23, Butterfly29, Butterfly31, Butterfly32, } impl Recipe { pub fn len(&self) -> usize { match self { Recipe::Dft(length) => *length, Recipe::Radix4(length) => *length, Recipe::Butterfly1 => 1, Recipe::Butterfly2 => 2, Recipe::Butterfly3 => 3, Recipe::Butterfly4 => 4, Recipe::Butterfly5 => 5, Recipe::Butterfly6 => 6, Recipe::Butterfly7 => 7, Recipe::Butterfly8 => 8, Recipe::Butterfly9 => 9, Recipe::Butterfly10 => 10, Recipe::Butterfly11 => 11, Recipe::Butterfly12 => 12, Recipe::Butterfly13 => 13, Recipe::Butterfly15 => 15, Recipe::Butterfly16 => 16, Recipe::Butterfly17 => 17, Recipe::Butterfly19 => 19, Recipe::Butterfly23 => 23, Recipe::Butterfly29 => 29, Recipe::Butterfly31 => 31, Recipe::Butterfly32 => 32, Recipe::MixedRadix { left_fft, right_fft, } => left_fft.len() * right_fft.len(), Recipe::GoodThomasAlgorithm { left_fft, right_fft, } => left_fft.len() * right_fft.len(), Recipe::MixedRadixSmall { left_fft, right_fft, } => left_fft.len() * right_fft.len(), Recipe::GoodThomasAlgorithmSmall { left_fft, right_fft, } => left_fft.len() * right_fft.len(), Recipe::RadersAlgorithm { inner_fft } => inner_fft.len() + 1, Recipe::BluesteinsAlgorithm { len, .. } => *len, } } } /// The WASM FFT planner creates new FFT algorithm instances using a mix of scalar and WASM SIMD accelerated algorithms. /// WASM SIMD is supported when using fairly recent browser versions as outlined in [the WebAssembly roadmap](https://webassembly.org/roadmap/). /// /// RustFFT has several FFT algorithms available. For a given FFT size, `FftPlannerWasmSimd` decides which of the /// available FFT algorithms to use and then initializes them. /// /// ~~~ /// // Perform a forward Fft of size 1234 /// use std::sync::Arc; /// use rustfft::{FftPlannerWasmSimd, num_complex::Complex}; /// /// if let Ok(mut planner) = FftPlannerWasmSimd::new() { /// let fft = planner.plan_fft_forward(1234); /// /// let mut buffer = vec![Complex{ re: 0.0f32, im: 0.0f32 }; 1234]; /// fft.process(&mut buffer); /// /// // The FFT instance returned by the planner has the type `Arc>`, /// // where T is the numeric type, ie f32 or f64, so it's cheap to clone /// let fft_clone = Arc::clone(&fft); /// } /// ~~~ /// /// If you plan on creating multiple FFT instances, it is recommended to re-use the same planner for all of them. This /// is because the planner re-uses internal data across FFT instances wherever possible, saving memory and reducing /// setup time. (FFT instances created with one planner will never re-use data and buffers with FFT instances created /// by a different planner) /// /// Each FFT instance owns [`Arc`s](std::sync::Arc) to its internal data, rather than borrowing it from the planner, so it's perfectly /// safe to drop the planner after creating Fft instances. pub struct FftPlannerWasmSimd { algorithm_cache: FftCache, recipe_cache: HashMap>, } impl FftPlannerWasmSimd { /// Creates a new `FftPlannerWasmSimd` instance. /// /// Returns `Ok(planner_instance)` if we're compiling for the WASM target and WASM SIMD was enabled in feature flags. /// Returns `Err(())` if WASM SIMD support is not available. pub fn new() -> Result { let id_f32 = TypeId::of::(); let id_f64 = TypeId::of::(); let id_t = TypeId::of::(); if id_t != id_f32 && id_t != id_f64 { return Err(()); } Ok(Self { algorithm_cache: FftCache::new(), recipe_cache: HashMap::new(), }) } /// Returns a `Fft` instance which uses WebAssembly SIMD instructions to compute FFTs of size `len`. /// /// If the provided `direction` is `FftDirection::Forward`, the returned instance will compute forward FFTs. If it's `FftDirection::Inverse`, it will compute inverse FFTs. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft(&mut self, len: usize, direction: FftDirection) -> Arc> { let recipe = self.design_fft_for_len(len); self.build_fft(&recipe, direction) } /// Returns a `Fft` instance which uses WebAssembly SIMD instructions to compute forward FFTs of size `len`. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_forward(&mut self, len: usize) -> Arc> { self.plan_fft(len, FftDirection::Forward) } /// Returns a `Fft` instance which uses WebAssembly SIMD instructions to compute inverse FFTs of size `len. /// /// If this is called multiple times, the planner will attempt to re-use internal data between calls, reducing memory usage and FFT initialization time. pub fn plan_fft_inverse(&mut self, _len: usize) -> Arc> { self.plan_fft(_len, FftDirection::Inverse) } } impl FftPlannerWasmSimd { fn design_fft_for_len(&mut self, len: usize) -> Arc { if len < 1 { Arc::new(Recipe::Dft(len)) } else if let Some(recipe) = self.recipe_cache.get(&len) { Arc::clone(&recipe) } else { let factors = PrimeFactors::compute(len); let recipe = self.design_fft_with_factors(len, factors); self.recipe_cache.insert(len, Arc::clone(&recipe)); recipe } } fn build_fft(&mut self, recipe: &Recipe, direction: FftDirection) -> Arc> { let len = recipe.len(); if let Some(instance) = self.algorithm_cache.get(len, direction) { instance } else { let fft = self.build_new_fft(recipe, direction); self.algorithm_cache.insert(&fft); fft } } fn build_new_fft(&mut self, recipe: &Recipe, direction: FftDirection) -> Arc> { let id_f32 = TypeId::of::(); let id_f64 = TypeId::of::(); let id_t = TypeId::of::(); match recipe { Recipe::Dft(len) => Arc::new(Dft::new(*len, direction)) as Arc>, Recipe::Radix4(len) => { if id_t == id_f32 { Arc::new(WasmSimd32Radix4::new(*len, direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimd64Radix4::new(*len, direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly1 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly1::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly1::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly2 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly2::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly2::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly3 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly3::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly3::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly4 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly4::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly4::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly5 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly5::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly5::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly6 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly6::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly6::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly7 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly7::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly7::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly8 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly8::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly8::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly9 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly9::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly9::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly10 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly10::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly10::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly11 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly11::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly11::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly12 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly12::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly12::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly13 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly13::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly13::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly15 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly15::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly15::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly16 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly16::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly16::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly17 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly17::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly17::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly19 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly19::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly19::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly23 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly23::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly23::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly29 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly29::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly29::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly31 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly31::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly31::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::Butterfly32 => { if id_t == id_f32 { Arc::new(WasmSimdF32Butterfly32::new(direction)) as Arc> } else if id_t == id_f64 { Arc::new(WasmSimdF64Butterfly32::new(direction)) as Arc> } else { panic!("Not f32 or f64"); } } Recipe::MixedRadix { left_fft, right_fft, } => { let left_fft = self.build_fft(&left_fft, direction); let right_fft = self.build_fft(&right_fft, direction); Arc::new(MixedRadix::new(left_fft, right_fft)) as Arc> } Recipe::GoodThomasAlgorithm { left_fft, right_fft, } => { let left_fft = self.build_fft(&left_fft, direction); let right_fft = self.build_fft(&right_fft, direction); Arc::new(GoodThomasAlgorithm::new(left_fft, right_fft)) as Arc> } Recipe::MixedRadixSmall { left_fft, right_fft, } => { let left_fft = self.build_fft(&left_fft, direction); let right_fft = self.build_fft(&right_fft, direction); Arc::new(MixedRadixSmall::new(left_fft, right_fft)) as Arc> } Recipe::GoodThomasAlgorithmSmall { left_fft, right_fft, } => { let left_fft = self.build_fft(&left_fft, direction); let right_fft = self.build_fft(&right_fft, direction); Arc::new(GoodThomasAlgorithmSmall::new(left_fft, right_fft)) as Arc> } Recipe::RadersAlgorithm { inner_fft } => { let inner_fft = self.build_fft(&inner_fft, direction); Arc::new(RadersAlgorithm::new(inner_fft)) as Arc> } Recipe::BluesteinsAlgorithm { len, inner_fft } => { let inner_fft = self.build_fft(&inner_fft, direction); Arc::new(BluesteinsAlgorithm::new(*len, inner_fft)) as Arc> } } } fn design_fft_with_factors(&mut self, len: usize, factors: PrimeFactors) -> Arc { if let Some(fft_instance) = self.design_butterfly_algorithm(len) { fft_instance } else if factors.is_prime() { self.design_prime(len) } else if len.trailing_zeros() >= MIN_RADIX4_BITS { if len.is_power_of_two() { Arc::new(Recipe::Radix4(len)) } else { let non_power_of_two = factors .remove_factors(PrimeFactor { value: 2, count: len.trailing_zeros(), }) .unwrap(); let power_of_two = PrimeFactors::compute(1 << len.trailing_zeros()); self.design_mixed_radix(power_of_two, non_power_of_two) } } else { // Can we do this as a mixed radix with just two butterflies? // Loop through and find all combinations // If more than one is found, keep the one where the factors are closer together. // For example length 20 where 10x2 and 5x4 are possible, we use 5x4. let butterflies: [usize; 20] = [ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 19, 23, 29, 31, 32, ]; let mut bf_left = 0; let mut bf_right = 0; // If the length is below 14, or over 1024 we don't need to try this. if len > 13 && len <= 1024 { for (n, bf_l) in butterflies.iter().enumerate() { if len % bf_l == 0 { let bf_r = len / bf_l; if butterflies.iter().skip(n).any(|&m| m == bf_r) { bf_left = *bf_l; bf_right = bf_r; } } } if bf_left > 0 { let fact_l = PrimeFactors::compute(bf_left); let fact_r = PrimeFactors::compute(bf_right); return self.design_mixed_radix(fact_l, fact_r); } } // Not possible with just butterflies, go with the general solution. let (left_factors, right_factors) = factors.partition_factors(); self.design_mixed_radix(left_factors, right_factors) } } fn design_mixed_radix( &mut self, left_factors: PrimeFactors, right_factors: PrimeFactors, ) -> Arc { let left_len = left_factors.get_product(); let right_len = right_factors.get_product(); //neither size is a butterfly, so go with the normal algorithm let left_fft = self.design_fft_with_factors(left_len, left_factors); let right_fft = self.design_fft_with_factors(right_len, right_factors); //if both left_len and right_len are small, use algorithms optimized for small FFTs if left_len < 33 && right_len < 33 { // for small FFTs, if gcd is 1, good-thomas is faster if gcd(left_len, right_len) == 1 { Arc::new(Recipe::GoodThomasAlgorithmSmall { left_fft, right_fft, }) } else { Arc::new(Recipe::MixedRadixSmall { left_fft, right_fft, }) } } else { Arc::new(Recipe::MixedRadix { left_fft, right_fft, }) } } /// Returns Some(instance) if we have a butterfly available for this size. Returns None if there is no butterfly available for this size fn design_butterfly_algorithm(&mut self, len: usize) -> Option> { match len { 1 => Some(Arc::new(Recipe::Butterfly1)), 2 => Some(Arc::new(Recipe::Butterfly2)), 3 => Some(Arc::new(Recipe::Butterfly3)), 4 => Some(Arc::new(Recipe::Butterfly4)), 5 => Some(Arc::new(Recipe::Butterfly5)), 6 => Some(Arc::new(Recipe::Butterfly6)), 7 => Some(Arc::new(Recipe::Butterfly7)), 8 => Some(Arc::new(Recipe::Butterfly8)), 9 => Some(Arc::new(Recipe::Butterfly9)), 10 => Some(Arc::new(Recipe::Butterfly10)), 11 => Some(Arc::new(Recipe::Butterfly11)), 12 => Some(Arc::new(Recipe::Butterfly12)), 13 => Some(Arc::new(Recipe::Butterfly13)), 15 => Some(Arc::new(Recipe::Butterfly15)), 16 => Some(Arc::new(Recipe::Butterfly16)), 17 => Some(Arc::new(Recipe::Butterfly17)), 19 => Some(Arc::new(Recipe::Butterfly19)), 23 => Some(Arc::new(Recipe::Butterfly23)), 29 => Some(Arc::new(Recipe::Butterfly29)), 31 => Some(Arc::new(Recipe::Butterfly31)), 32 => Some(Arc::new(Recipe::Butterfly32)), _ => None, } } fn design_prime(&mut self, len: usize) -> Arc { let inner_fft_len_rader = len - 1; let raders_factors = PrimeFactors::compute(inner_fft_len_rader); // If any of the prime factors is too large, Rader's gets slow and Bluestein's is the better choice if raders_factors .get_other_factors() .iter() .any(|val| val.value > MAX_RADER_PRIME_FACTOR) { let inner_fft_len_pow2 = (2 * len - 1).checked_next_power_of_two().unwrap(); // for long ffts a mixed radix inner fft is faster than a longer radix4 let min_inner_len = 2 * len - 1; let mixed_radix_len = 3 * inner_fft_len_pow2 / 4; let inner_fft = if mixed_radix_len >= min_inner_len && len >= MIN_BLUESTEIN_MIXED_RADIX_LEN { let mixed_radix_factors = PrimeFactors::compute(mixed_radix_len); self.design_fft_with_factors(mixed_radix_len, mixed_radix_factors) } else { Arc::new(Recipe::Radix4(inner_fft_len_pow2)) }; Arc::new(Recipe::BluesteinsAlgorithm { len, inner_fft }) } else { let inner_fft = self.design_fft_with_factors(inner_fft_len_rader, raders_factors); Arc::new(Recipe::RadersAlgorithm { inner_fft }) } } } #[cfg(test)] mod unit_tests { use super::*; use wasm_bindgen_test::*; fn is_mixedradix(plan: &Recipe) -> bool { match plan { &Recipe::MixedRadix { .. } => true, _ => false, } } fn is_mixedradixsmall(plan: &Recipe) -> bool { match plan { &Recipe::MixedRadixSmall { .. } => true, _ => false, } } fn is_goodthomassmall(plan: &Recipe) -> bool { match plan { &Recipe::GoodThomasAlgorithmSmall { .. } => true, _ => false, } } fn is_raders(plan: &Recipe) -> bool { match plan { &Recipe::RadersAlgorithm { .. } => true, _ => false, } } fn is_bluesteins(plan: &Recipe) -> bool { match plan { &Recipe::BluesteinsAlgorithm { .. } => true, _ => false, } } #[wasm_bindgen_test] fn test_plan_sse_trivial() { // Length 0 and 1 should use Dft let mut planner = FftPlannerWasmSimd::::new().unwrap(); for len in 0..1 { let plan = planner.design_fft_for_len(len); assert_eq!(*plan, Recipe::Dft(len)); assert_eq!(plan.len(), len, "Recipe reports wrong length"); } } #[wasm_bindgen_test] fn test_plan_sse_largepoweroftwo() { // Powers of 2 above 6 should use Radix4 let mut planner = FftPlannerWasmSimd::::new().unwrap(); for pow in 6..32 { let len = 1 << pow; let plan = planner.design_fft_for_len(len); assert_eq!(*plan, Recipe::Radix4(len)); assert_eq!(plan.len(), len, "Recipe reports wrong length"); } } #[wasm_bindgen_test] fn test_plan_sse_butterflies() { // Check that all butterflies are used let mut planner = FftPlannerWasmSimd::::new().unwrap(); assert_eq!(*planner.design_fft_for_len(2), Recipe::Butterfly2); assert_eq!(*planner.design_fft_for_len(3), Recipe::Butterfly3); assert_eq!(*planner.design_fft_for_len(4), Recipe::Butterfly4); assert_eq!(*planner.design_fft_for_len(5), Recipe::Butterfly5); assert_eq!(*planner.design_fft_for_len(6), Recipe::Butterfly6); assert_eq!(*planner.design_fft_for_len(7), Recipe::Butterfly7); assert_eq!(*planner.design_fft_for_len(8), Recipe::Butterfly8); assert_eq!(*planner.design_fft_for_len(9), Recipe::Butterfly9); assert_eq!(*planner.design_fft_for_len(10), Recipe::Butterfly10); assert_eq!(*planner.design_fft_for_len(11), Recipe::Butterfly11); assert_eq!(*planner.design_fft_for_len(12), Recipe::Butterfly12); assert_eq!(*planner.design_fft_for_len(13), Recipe::Butterfly13); assert_eq!(*planner.design_fft_for_len(15), Recipe::Butterfly15); assert_eq!(*planner.design_fft_for_len(16), Recipe::Butterfly16); assert_eq!(*planner.design_fft_for_len(17), Recipe::Butterfly17); assert_eq!(*planner.design_fft_for_len(19), Recipe::Butterfly19); assert_eq!(*planner.design_fft_for_len(23), Recipe::Butterfly23); assert_eq!(*planner.design_fft_for_len(29), Recipe::Butterfly29); assert_eq!(*planner.design_fft_for_len(31), Recipe::Butterfly31); assert_eq!(*planner.design_fft_for_len(32), Recipe::Butterfly32); } #[wasm_bindgen_test] fn test_plan_sse_mixedradix() { // Products of several different primes should become MixedRadix let mut planner = FftPlannerWasmSimd::::new().unwrap(); for pow2 in 2..5 { for pow3 in 2..5 { for pow5 in 2..5 { for pow7 in 2..5 { let len = 2usize.pow(pow2) * 3usize.pow(pow3) * 5usize.pow(pow5) * 7usize.pow(pow7); let plan = planner.design_fft_for_len(len); assert!(is_mixedradix(&plan), "Expected MixedRadix, got {:?}", plan); assert_eq!(plan.len(), len, "Recipe reports wrong length"); } } } } } #[wasm_bindgen_test] fn test_plan_sse_mixedradixsmall() { // Products of two "small" lengths < 31 that have a common divisor >1, and isn't a power of 2 should be MixedRadixSmall let mut planner = FftPlannerWasmSimd::::new().unwrap(); for len in [5 * 20, 5 * 25].iter() { let plan = planner.design_fft_for_len(*len); assert!( is_mixedradixsmall(&plan), "Expected MixedRadixSmall, got {:?}", plan ); assert_eq!(plan.len(), *len, "Recipe reports wrong length"); } } #[wasm_bindgen_test] fn test_plan_sse_goodthomasbutterfly() { let mut planner = FftPlannerWasmSimd::::new().unwrap(); for len in [3 * 7, 5 * 7, 11 * 13, 2 * 29].iter() { let plan = planner.design_fft_for_len(*len); assert!( is_goodthomassmall(&plan), "Expected GoodThomasAlgorithmSmall, got {:?}", plan ); assert_eq!(plan.len(), *len, "Recipe reports wrong length"); } } #[wasm_bindgen_test] fn test_plan_sse_bluestein_vs_rader() { let difficultprimes: [usize; 11] = [59, 83, 107, 149, 167, 173, 179, 359, 719, 1439, 2879]; let easyprimes: [usize; 24] = [ 53, 61, 67, 71, 73, 79, 89, 97, 101, 103, 109, 113, 127, 131, 137, 139, 151, 157, 163, 181, 191, 193, 197, 199, ]; let mut planner = FftPlannerWasmSimd::::new().unwrap(); for len in difficultprimes.iter() { let plan = planner.design_fft_for_len(*len); assert!( is_bluesteins(&plan), "Expected BluesteinsAlgorithm, got {:?}", plan ); assert_eq!(plan.len(), *len, "Recipe reports wrong length"); } for len in easyprimes.iter() { let plan = planner.design_fft_for_len(*len); assert!(is_raders(&plan), "Expected RadersAlgorithm, got {:?}", plan); assert_eq!(plan.len(), *len, "Recipe reports wrong length"); } } #[wasm_bindgen_test] fn test_sse_fft_cache() { { // Check that FFTs are reused if they're both forward let mut planner = FftPlannerWasmSimd::::new().unwrap(); let fft_a = planner.plan_fft(1234, FftDirection::Forward); let fft_b = planner.plan_fft(1234, FftDirection::Forward); assert!(Arc::ptr_eq(&fft_a, &fft_b), "Existing fft was not reused"); } { // Check that FFTs are reused if they're both inverse let mut planner = FftPlannerWasmSimd::::new().unwrap(); let fft_a = planner.plan_fft(1234, FftDirection::Inverse); let fft_b = planner.plan_fft(1234, FftDirection::Inverse); assert!(Arc::ptr_eq(&fft_a, &fft_b), "Existing fft was not reused"); } { // Check that FFTs are NOT resued if they don't both have the same direction let mut planner = FftPlannerWasmSimd::::new().unwrap(); let fft_a = planner.plan_fft(1234, FftDirection::Forward); let fft_b = planner.plan_fft(1234, FftDirection::Inverse); assert!( !Arc::ptr_eq(&fft_a, &fft_b), "Existing fft was reused, even though directions don't match" ); } } #[wasm_bindgen_test] fn test_sse_recipe_cache() { // Check that all butterflies are used let mut planner = FftPlannerWasmSimd::::new().unwrap(); let fft_a = planner.design_fft_for_len(1234); let fft_b = planner.design_fft_for_len(1234); assert!( Arc::ptr_eq(&fft_a, &fft_b), "Existing recipe was not reused" ); } } rustfft-6.2.0/src/wasm_simd/wasm_simd_prime_butterflies.rs000064400000000000000000013124740072674642500223010ustar 00000000000000/// Auto-generated prime length butterflies /// The code here is mostly autogenerated by the python script tools/gen_sse_butterflies.py, and then translated from SSE to WASM SIMD. /// /// The algorithm is derived directly from the definition of the DFT, by eliminating any repeated calculations. /// See the comments in src/algorithm/butterflies.rs for a detailed description. /// /// The script generates the code for performing a single f64 fft, as well as dual f32 fft. /// It also generates the code for reading and writing the input and output. /// The single 32-bit ffts reuse the dual ffts. use core::arch::wasm32::*; use num_complex::Complex; use crate::{common::FftNum, FftDirection}; use crate::array_utils; use crate::array_utils::workaround_transmute_mut; use crate::array_utils::DoubleBuf; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::twiddles; use crate::{Direction, Fft, Length}; use super::wasm_simd_butterflies::{parallel_fft2_interleaved_f32, solo_fft2_f64}; use super::wasm_simd_common::{assert_f32, assert_f64}; use super::wasm_simd_utils::*; use super::wasm_simd_vector::WasmSimdArrayMut; // _____ _________ _ _ _ // |___ | |___ /___ \| |__ (_) |_ // / / _____ |_ \ __) | '_ \| | __| // / / |_____| ___) / __/| |_) | | |_ // /_/ |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly7 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, twiddle3re: v128, twiddle3im: v128, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly7, 7, |this: &WasmSimdF32Butterfly7<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly7, 7, |this: &WasmSimdF32Butterfly7<_>| { this.direction } ); impl WasmSimdF32Butterfly7 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 7, direction); let tw2: Complex = twiddles::compute_twiddle(2, 7, direction); let tw3: Complex = twiddles::compute_twiddle(3, 7, direction); let twiddle1re = f32x4_splat(tw1.re); let twiddle1im = f32x4_splat(tw1.im); let twiddle2re = f32x4_splat(tw2.re); let twiddle2im = f32x4_splat(tw2.im); let twiddle3re = f32x4_splat(tw3.re); let twiddle3im = f32x4_splat(tw3.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[3]), extract_hi_lo_f32(input_packed[0], input_packed[4]), extract_lo_hi_f32(input_packed[1], input_packed[4]), extract_hi_lo_f32(input_packed[1], input_packed[5]), extract_lo_hi_f32(input_packed[2], input_packed[5]), extract_hi_lo_f32(input_packed[2], input_packed[6]), extract_lo_hi_f32(input_packed[3], input_packed[6]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_hi_f32(out[6], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0,1,2,3,4,5,6}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [v128; 7]) -> [v128; 7] { let [x1p6, x1m6] = parallel_fft2_interleaved_f32(values[1], values[6]); let [x2p5, x2m5] = parallel_fft2_interleaved_f32(values[2], values[5]); let [x3p4, x3m4] = parallel_fft2_interleaved_f32(values[3], values[4]); let t_a1_1 = f32x4_mul(self.twiddle1re, x1p6); let t_a1_2 = f32x4_mul(self.twiddle2re, x2p5); let t_a1_3 = f32x4_mul(self.twiddle3re, x3p4); let t_a2_1 = f32x4_mul(self.twiddle2re, x1p6); let t_a2_2 = f32x4_mul(self.twiddle3re, x2p5); let t_a2_3 = f32x4_mul(self.twiddle1re, x3p4); let t_a3_1 = f32x4_mul(self.twiddle3re, x1p6); let t_a3_2 = f32x4_mul(self.twiddle1re, x2p5); let t_a3_3 = f32x4_mul(self.twiddle2re, x3p4); let t_b1_1 = f32x4_mul(self.twiddle1im, x1m6); let t_b1_2 = f32x4_mul(self.twiddle2im, x2m5); let t_b1_3 = f32x4_mul(self.twiddle3im, x3m4); let t_b2_1 = f32x4_mul(self.twiddle2im, x1m6); let t_b2_2 = f32x4_mul(self.twiddle3im, x2m5); let t_b2_3 = f32x4_mul(self.twiddle1im, x3m4); let t_b3_1 = f32x4_mul(self.twiddle3im, x1m6); let t_b3_2 = f32x4_mul(self.twiddle1im, x2m5); let t_b3_3 = f32x4_mul(self.twiddle2im, x3m4); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3); let t_b2 = calc_f32!(t_b2_1 - t_b2_2 - t_b2_3); let t_b3 = calc_f32!(t_b3_1 - t_b3_2 + t_b3_3); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let y0 = calc_f32!(x0 + x1p6 + x2p5 + x3p4); let [y1, y6] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y5] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y4] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); [y0, y1, y2, y3, y4, y5, y6] } } // _____ __ _ _ _ _ _ // |___ | / /_ | || | | |__ (_) |_ // / / _____ | '_ \| || |_| '_ \| | __| // / / |_____| | (_) |__ _| |_) | | |_ // /_/ \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly7 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, twiddle3re: v128, twiddle3im: v128, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly7, 7, |this: &WasmSimdF64Butterfly7<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly7, 7, |this: &WasmSimdF64Butterfly7<_>| { this.direction } ); impl WasmSimdF64Butterfly7 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 7, direction); let tw2: Complex = twiddles::compute_twiddle(2, 7, direction); let tw3: Complex = twiddles::compute_twiddle(3, 7, direction); let twiddle1re = f64x2_splat(tw1.re); let twiddle1im = f64x2_splat(tw1.im); let twiddle2re = f64x2_splat(tw2.re); let twiddle2im = f64x2_splat(tw2.im); let twiddle3re = f64x2_splat(tw3.re); let twiddle3im = f64x2_splat(tw3.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [v128; 7]) -> [v128; 7] { let [x1p6, x1m6] = solo_fft2_f64(values[1], values[6]); let [x2p5, x2m5] = solo_fft2_f64(values[2], values[5]); let [x3p4, x3m4] = solo_fft2_f64(values[3], values[4]); let t_a1_1 = f64x2_mul(self.twiddle1re, x1p6); let t_a1_2 = f64x2_mul(self.twiddle2re, x2p5); let t_a1_3 = f64x2_mul(self.twiddle3re, x3p4); let t_a2_1 = f64x2_mul(self.twiddle2re, x1p6); let t_a2_2 = f64x2_mul(self.twiddle3re, x2p5); let t_a2_3 = f64x2_mul(self.twiddle1re, x3p4); let t_a3_1 = f64x2_mul(self.twiddle3re, x1p6); let t_a3_2 = f64x2_mul(self.twiddle1re, x2p5); let t_a3_3 = f64x2_mul(self.twiddle2re, x3p4); let t_b1_1 = f64x2_mul(self.twiddle1im, x1m6); let t_b1_2 = f64x2_mul(self.twiddle2im, x2m5); let t_b1_3 = f64x2_mul(self.twiddle3im, x3m4); let t_b2_1 = f64x2_mul(self.twiddle2im, x1m6); let t_b2_2 = f64x2_mul(self.twiddle3im, x2m5); let t_b2_3 = f64x2_mul(self.twiddle1im, x3m4); let t_b3_1 = f64x2_mul(self.twiddle3im, x1m6); let t_b3_2 = f64x2_mul(self.twiddle1im, x2m5); let t_b3_3 = f64x2_mul(self.twiddle2im, x3m4); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3); let t_b2 = calc_f64!(t_b2_1 - t_b2_2 - t_b2_3); let t_b3 = calc_f64!(t_b3_1 - t_b3_2 + t_b3_3); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let y0 = calc_f64!(x0 + x1p6 + x2p5 + x3p4); let [y1, y6] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y5] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y4] = solo_fft2_f64(t_a3, t_b3_rot); [y0, y1, y2, y3, y4, y5, y6] } } // _ _ _________ _ _ _ // / / | |___ /___ \| |__ (_) |_ // | | | _____ |_ \ __) | '_ \| | __| // | | | |_____| ___) / __/| |_) | | |_ // |_|_| |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly11 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, twiddle3re: v128, twiddle3im: v128, twiddle4re: v128, twiddle4im: v128, twiddle5re: v128, twiddle5im: v128, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly11, 11, |this: &WasmSimdF32Butterfly11<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly11, 11, |this: &WasmSimdF32Butterfly11<_>| this.direction ); impl WasmSimdF32Butterfly11 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 11, direction); let tw2: Complex = twiddles::compute_twiddle(2, 11, direction); let tw3: Complex = twiddles::compute_twiddle(3, 11, direction); let tw4: Complex = twiddles::compute_twiddle(4, 11, direction); let tw5: Complex = twiddles::compute_twiddle(5, 11, direction); let twiddle1re = f32x4_splat(tw1.re); let twiddle1im = f32x4_splat(tw1.im); let twiddle2re = f32x4_splat(tw2.re); let twiddle2im = f32x4_splat(tw2.im); let twiddle3re = f32x4_splat(tw3.re); let twiddle3im = f32x4_splat(tw3.im); let twiddle4re = f32x4_splat(tw4.re); let twiddle4im = f32x4_splat(tw4.im); let twiddle5re = f32x4_splat(tw5.re); let twiddle5im = f32x4_splat(tw5.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[5]), extract_hi_lo_f32(input_packed[0], input_packed[6]), extract_lo_hi_f32(input_packed[1], input_packed[6]), extract_hi_lo_f32(input_packed[1], input_packed[7]), extract_lo_hi_f32(input_packed[2], input_packed[7]), extract_hi_lo_f32(input_packed[2], input_packed[8]), extract_lo_hi_f32(input_packed[3], input_packed[8]), extract_hi_lo_f32(input_packed[3], input_packed[9]), extract_lo_hi_f32(input_packed[4], input_packed[9]), extract_hi_lo_f32(input_packed[4], input_packed[10]), extract_lo_hi_f32(input_packed[5], input_packed[10]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_hi_f32(out[10], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [v128; 11]) -> [v128; 11] { let [x1p10, x1m10] = parallel_fft2_interleaved_f32(values[1], values[10]); let [x2p9, x2m9] = parallel_fft2_interleaved_f32(values[2], values[9]); let [x3p8, x3m8] = parallel_fft2_interleaved_f32(values[3], values[8]); let [x4p7, x4m7] = parallel_fft2_interleaved_f32(values[4], values[7]); let [x5p6, x5m6] = parallel_fft2_interleaved_f32(values[5], values[6]); let t_a1_1 = f32x4_mul(self.twiddle1re, x1p10); let t_a1_2 = f32x4_mul(self.twiddle2re, x2p9); let t_a1_3 = f32x4_mul(self.twiddle3re, x3p8); let t_a1_4 = f32x4_mul(self.twiddle4re, x4p7); let t_a1_5 = f32x4_mul(self.twiddle5re, x5p6); let t_a2_1 = f32x4_mul(self.twiddle2re, x1p10); let t_a2_2 = f32x4_mul(self.twiddle4re, x2p9); let t_a2_3 = f32x4_mul(self.twiddle5re, x3p8); let t_a2_4 = f32x4_mul(self.twiddle3re, x4p7); let t_a2_5 = f32x4_mul(self.twiddle1re, x5p6); let t_a3_1 = f32x4_mul(self.twiddle3re, x1p10); let t_a3_2 = f32x4_mul(self.twiddle5re, x2p9); let t_a3_3 = f32x4_mul(self.twiddle2re, x3p8); let t_a3_4 = f32x4_mul(self.twiddle1re, x4p7); let t_a3_5 = f32x4_mul(self.twiddle4re, x5p6); let t_a4_1 = f32x4_mul(self.twiddle4re, x1p10); let t_a4_2 = f32x4_mul(self.twiddle3re, x2p9); let t_a4_3 = f32x4_mul(self.twiddle1re, x3p8); let t_a4_4 = f32x4_mul(self.twiddle5re, x4p7); let t_a4_5 = f32x4_mul(self.twiddle2re, x5p6); let t_a5_1 = f32x4_mul(self.twiddle5re, x1p10); let t_a5_2 = f32x4_mul(self.twiddle1re, x2p9); let t_a5_3 = f32x4_mul(self.twiddle4re, x3p8); let t_a5_4 = f32x4_mul(self.twiddle2re, x4p7); let t_a5_5 = f32x4_mul(self.twiddle3re, x5p6); let t_b1_1 = f32x4_mul(self.twiddle1im, x1m10); let t_b1_2 = f32x4_mul(self.twiddle2im, x2m9); let t_b1_3 = f32x4_mul(self.twiddle3im, x3m8); let t_b1_4 = f32x4_mul(self.twiddle4im, x4m7); let t_b1_5 = f32x4_mul(self.twiddle5im, x5m6); let t_b2_1 = f32x4_mul(self.twiddle2im, x1m10); let t_b2_2 = f32x4_mul(self.twiddle4im, x2m9); let t_b2_3 = f32x4_mul(self.twiddle5im, x3m8); let t_b2_4 = f32x4_mul(self.twiddle3im, x4m7); let t_b2_5 = f32x4_mul(self.twiddle1im, x5m6); let t_b3_1 = f32x4_mul(self.twiddle3im, x1m10); let t_b3_2 = f32x4_mul(self.twiddle5im, x2m9); let t_b3_3 = f32x4_mul(self.twiddle2im, x3m8); let t_b3_4 = f32x4_mul(self.twiddle1im, x4m7); let t_b3_5 = f32x4_mul(self.twiddle4im, x5m6); let t_b4_1 = f32x4_mul(self.twiddle4im, x1m10); let t_b4_2 = f32x4_mul(self.twiddle3im, x2m9); let t_b4_3 = f32x4_mul(self.twiddle1im, x3m8); let t_b4_4 = f32x4_mul(self.twiddle5im, x4m7); let t_b4_5 = f32x4_mul(self.twiddle2im, x5m6); let t_b5_1 = f32x4_mul(self.twiddle5im, x1m10); let t_b5_2 = f32x4_mul(self.twiddle1im, x2m9); let t_b5_3 = f32x4_mul(self.twiddle4im, x3m8); let t_b5_4 = f32x4_mul(self.twiddle2im, x4m7); let t_b5_5 = f32x4_mul(self.twiddle3im, x5m6); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 - t_b2_3 - t_b2_4 - t_b2_5); let t_b3 = calc_f32!(t_b3_1 - t_b3_2 - t_b3_3 + t_b3_4 + t_b3_5); let t_b4 = calc_f32!(t_b4_1 - t_b4_2 + t_b4_3 + t_b4_4 - t_b4_5); let t_b5 = calc_f32!(t_b5_1 - t_b5_2 + t_b5_3 - t_b5_4 + t_b5_5); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let y0 = calc_f32!(x0 + x1p10 + x2p9 + x3p8 + x4p7 + x5p6); let [y1, y10] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y9] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y8] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y7] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y6] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10] } } // _ _ __ _ _ _ _ _ // / / | / /_ | || | | |__ (_) |_ // | | | _____ | '_ \| || |_| '_ \| | __| // | | | |_____| | (_) |__ _| |_) | | |_ // |_|_| \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly11 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, twiddle3re: v128, twiddle3im: v128, twiddle4re: v128, twiddle4im: v128, twiddle5re: v128, twiddle5im: v128, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly11, 11, |this: &WasmSimdF64Butterfly11<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly11, 11, |this: &WasmSimdF64Butterfly11<_>| this.direction ); impl WasmSimdF64Butterfly11 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 11, direction); let tw2: Complex = twiddles::compute_twiddle(2, 11, direction); let tw3: Complex = twiddles::compute_twiddle(3, 11, direction); let tw4: Complex = twiddles::compute_twiddle(4, 11, direction); let tw5: Complex = twiddles::compute_twiddle(5, 11, direction); let twiddle1re = f64x2_splat(tw1.re); let twiddle1im = f64x2_splat(tw1.im); let twiddle2re = f64x2_splat(tw2.re); let twiddle2im = f64x2_splat(tw2.im); let twiddle3re = f64x2_splat(tw3.re); let twiddle3im = f64x2_splat(tw3.im); let twiddle4re = f64x2_splat(tw4.re); let twiddle4im = f64x2_splat(tw4.im); let twiddle5re = f64x2_splat(tw5.re); let twiddle5im = f64x2_splat(tw5.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [v128; 11]) -> [v128; 11] { let [x1p10, x1m10] = solo_fft2_f64(values[1], values[10]); let [x2p9, x2m9] = solo_fft2_f64(values[2], values[9]); let [x3p8, x3m8] = solo_fft2_f64(values[3], values[8]); let [x4p7, x4m7] = solo_fft2_f64(values[4], values[7]); let [x5p6, x5m6] = solo_fft2_f64(values[5], values[6]); let t_a1_1 = f64x2_mul(self.twiddle1re, x1p10); let t_a1_2 = f64x2_mul(self.twiddle2re, x2p9); let t_a1_3 = f64x2_mul(self.twiddle3re, x3p8); let t_a1_4 = f64x2_mul(self.twiddle4re, x4p7); let t_a1_5 = f64x2_mul(self.twiddle5re, x5p6); let t_a2_1 = f64x2_mul(self.twiddle2re, x1p10); let t_a2_2 = f64x2_mul(self.twiddle4re, x2p9); let t_a2_3 = f64x2_mul(self.twiddle5re, x3p8); let t_a2_4 = f64x2_mul(self.twiddle3re, x4p7); let t_a2_5 = f64x2_mul(self.twiddle1re, x5p6); let t_a3_1 = f64x2_mul(self.twiddle3re, x1p10); let t_a3_2 = f64x2_mul(self.twiddle5re, x2p9); let t_a3_3 = f64x2_mul(self.twiddle2re, x3p8); let t_a3_4 = f64x2_mul(self.twiddle1re, x4p7); let t_a3_5 = f64x2_mul(self.twiddle4re, x5p6); let t_a4_1 = f64x2_mul(self.twiddle4re, x1p10); let t_a4_2 = f64x2_mul(self.twiddle3re, x2p9); let t_a4_3 = f64x2_mul(self.twiddle1re, x3p8); let t_a4_4 = f64x2_mul(self.twiddle5re, x4p7); let t_a4_5 = f64x2_mul(self.twiddle2re, x5p6); let t_a5_1 = f64x2_mul(self.twiddle5re, x1p10); let t_a5_2 = f64x2_mul(self.twiddle1re, x2p9); let t_a5_3 = f64x2_mul(self.twiddle4re, x3p8); let t_a5_4 = f64x2_mul(self.twiddle2re, x4p7); let t_a5_5 = f64x2_mul(self.twiddle3re, x5p6); let t_b1_1 = f64x2_mul(self.twiddle1im, x1m10); let t_b1_2 = f64x2_mul(self.twiddle2im, x2m9); let t_b1_3 = f64x2_mul(self.twiddle3im, x3m8); let t_b1_4 = f64x2_mul(self.twiddle4im, x4m7); let t_b1_5 = f64x2_mul(self.twiddle5im, x5m6); let t_b2_1 = f64x2_mul(self.twiddle2im, x1m10); let t_b2_2 = f64x2_mul(self.twiddle4im, x2m9); let t_b2_3 = f64x2_mul(self.twiddle5im, x3m8); let t_b2_4 = f64x2_mul(self.twiddle3im, x4m7); let t_b2_5 = f64x2_mul(self.twiddle1im, x5m6); let t_b3_1 = f64x2_mul(self.twiddle3im, x1m10); let t_b3_2 = f64x2_mul(self.twiddle5im, x2m9); let t_b3_3 = f64x2_mul(self.twiddle2im, x3m8); let t_b3_4 = f64x2_mul(self.twiddle1im, x4m7); let t_b3_5 = f64x2_mul(self.twiddle4im, x5m6); let t_b4_1 = f64x2_mul(self.twiddle4im, x1m10); let t_b4_2 = f64x2_mul(self.twiddle3im, x2m9); let t_b4_3 = f64x2_mul(self.twiddle1im, x3m8); let t_b4_4 = f64x2_mul(self.twiddle5im, x4m7); let t_b4_5 = f64x2_mul(self.twiddle2im, x5m6); let t_b5_1 = f64x2_mul(self.twiddle5im, x1m10); let t_b5_2 = f64x2_mul(self.twiddle1im, x2m9); let t_b5_3 = f64x2_mul(self.twiddle4im, x3m8); let t_b5_4 = f64x2_mul(self.twiddle2im, x4m7); let t_b5_5 = f64x2_mul(self.twiddle3im, x5m6); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 - t_b2_3 - t_b2_4 - t_b2_5); let t_b3 = calc_f64!(t_b3_1 - t_b3_2 - t_b3_3 + t_b3_4 + t_b3_5); let t_b4 = calc_f64!(t_b4_1 - t_b4_2 + t_b4_3 + t_b4_4 - t_b4_5); let t_b5 = calc_f64!(t_b5_1 - t_b5_2 + t_b5_3 - t_b5_4 + t_b5_5); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let y0 = calc_f64!(x0 + x1p10 + x2p9 + x3p8 + x4p7 + x5p6); let [y1, y10] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y9] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y8] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y7] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y6] = solo_fft2_f64(t_a5, t_b5_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10] } } // _ _____ _________ _ _ _ // / |___ / |___ /___ \| |__ (_) |_ // | | |_ \ _____ |_ \ __) | '_ \| | __| // | |___) | |_____| ___) / __/| |_) | | |_ // |_|____/ |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly13 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, twiddle3re: v128, twiddle3im: v128, twiddle4re: v128, twiddle4im: v128, twiddle5re: v128, twiddle5im: v128, twiddle6re: v128, twiddle6im: v128, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly13, 13, |this: &WasmSimdF32Butterfly13<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly13, 13, |this: &WasmSimdF32Butterfly13<_>| this.direction ); impl WasmSimdF32Butterfly13 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 13, direction); let tw2: Complex = twiddles::compute_twiddle(2, 13, direction); let tw3: Complex = twiddles::compute_twiddle(3, 13, direction); let tw4: Complex = twiddles::compute_twiddle(4, 13, direction); let tw5: Complex = twiddles::compute_twiddle(5, 13, direction); let tw6: Complex = twiddles::compute_twiddle(6, 13, direction); let twiddle1re = f32x4_splat(tw1.re); let twiddle1im = f32x4_splat(tw1.im); let twiddle2re = f32x4_splat(tw2.re); let twiddle2im = f32x4_splat(tw2.im); let twiddle3re = f32x4_splat(tw3.re); let twiddle3im = f32x4_splat(tw3.im); let twiddle4re = f32x4_splat(tw4.re); let twiddle4im = f32x4_splat(tw4.im); let twiddle5re = f32x4_splat(tw5.re); let twiddle5im = f32x4_splat(tw5.im); let twiddle6re = f32x4_splat(tw6.re); let twiddle6im = f32x4_splat(tw6.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[6]), extract_hi_lo_f32(input_packed[0], input_packed[7]), extract_lo_hi_f32(input_packed[1], input_packed[7]), extract_hi_lo_f32(input_packed[1], input_packed[8]), extract_lo_hi_f32(input_packed[2], input_packed[8]), extract_hi_lo_f32(input_packed[2], input_packed[9]), extract_lo_hi_f32(input_packed[3], input_packed[9]), extract_hi_lo_f32(input_packed[3], input_packed[10]), extract_lo_hi_f32(input_packed[4], input_packed[10]), extract_hi_lo_f32(input_packed[4], input_packed[11]), extract_lo_hi_f32(input_packed[5], input_packed[11]), extract_hi_lo_f32(input_packed[5], input_packed[12]), extract_lo_hi_f32(input_packed[6], input_packed[12]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_hi_f32(out[12], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [v128; 13]) -> [v128; 13] { let [x1p12, x1m12] = parallel_fft2_interleaved_f32(values[1], values[12]); let [x2p11, x2m11] = parallel_fft2_interleaved_f32(values[2], values[11]); let [x3p10, x3m10] = parallel_fft2_interleaved_f32(values[3], values[10]); let [x4p9, x4m9] = parallel_fft2_interleaved_f32(values[4], values[9]); let [x5p8, x5m8] = parallel_fft2_interleaved_f32(values[5], values[8]); let [x6p7, x6m7] = parallel_fft2_interleaved_f32(values[6], values[7]); let t_a1_1 = f32x4_mul(self.twiddle1re, x1p12); let t_a1_2 = f32x4_mul(self.twiddle2re, x2p11); let t_a1_3 = f32x4_mul(self.twiddle3re, x3p10); let t_a1_4 = f32x4_mul(self.twiddle4re, x4p9); let t_a1_5 = f32x4_mul(self.twiddle5re, x5p8); let t_a1_6 = f32x4_mul(self.twiddle6re, x6p7); let t_a2_1 = f32x4_mul(self.twiddle2re, x1p12); let t_a2_2 = f32x4_mul(self.twiddle4re, x2p11); let t_a2_3 = f32x4_mul(self.twiddle6re, x3p10); let t_a2_4 = f32x4_mul(self.twiddle5re, x4p9); let t_a2_5 = f32x4_mul(self.twiddle3re, x5p8); let t_a2_6 = f32x4_mul(self.twiddle1re, x6p7); let t_a3_1 = f32x4_mul(self.twiddle3re, x1p12); let t_a3_2 = f32x4_mul(self.twiddle6re, x2p11); let t_a3_3 = f32x4_mul(self.twiddle4re, x3p10); let t_a3_4 = f32x4_mul(self.twiddle1re, x4p9); let t_a3_5 = f32x4_mul(self.twiddle2re, x5p8); let t_a3_6 = f32x4_mul(self.twiddle5re, x6p7); let t_a4_1 = f32x4_mul(self.twiddle4re, x1p12); let t_a4_2 = f32x4_mul(self.twiddle5re, x2p11); let t_a4_3 = f32x4_mul(self.twiddle1re, x3p10); let t_a4_4 = f32x4_mul(self.twiddle3re, x4p9); let t_a4_5 = f32x4_mul(self.twiddle6re, x5p8); let t_a4_6 = f32x4_mul(self.twiddle2re, x6p7); let t_a5_1 = f32x4_mul(self.twiddle5re, x1p12); let t_a5_2 = f32x4_mul(self.twiddle3re, x2p11); let t_a5_3 = f32x4_mul(self.twiddle2re, x3p10); let t_a5_4 = f32x4_mul(self.twiddle6re, x4p9); let t_a5_5 = f32x4_mul(self.twiddle1re, x5p8); let t_a5_6 = f32x4_mul(self.twiddle4re, x6p7); let t_a6_1 = f32x4_mul(self.twiddle6re, x1p12); let t_a6_2 = f32x4_mul(self.twiddle1re, x2p11); let t_a6_3 = f32x4_mul(self.twiddle5re, x3p10); let t_a6_4 = f32x4_mul(self.twiddle2re, x4p9); let t_a6_5 = f32x4_mul(self.twiddle4re, x5p8); let t_a6_6 = f32x4_mul(self.twiddle3re, x6p7); let t_b1_1 = f32x4_mul(self.twiddle1im, x1m12); let t_b1_2 = f32x4_mul(self.twiddle2im, x2m11); let t_b1_3 = f32x4_mul(self.twiddle3im, x3m10); let t_b1_4 = f32x4_mul(self.twiddle4im, x4m9); let t_b1_5 = f32x4_mul(self.twiddle5im, x5m8); let t_b1_6 = f32x4_mul(self.twiddle6im, x6m7); let t_b2_1 = f32x4_mul(self.twiddle2im, x1m12); let t_b2_2 = f32x4_mul(self.twiddle4im, x2m11); let t_b2_3 = f32x4_mul(self.twiddle6im, x3m10); let t_b2_4 = f32x4_mul(self.twiddle5im, x4m9); let t_b2_5 = f32x4_mul(self.twiddle3im, x5m8); let t_b2_6 = f32x4_mul(self.twiddle1im, x6m7); let t_b3_1 = f32x4_mul(self.twiddle3im, x1m12); let t_b3_2 = f32x4_mul(self.twiddle6im, x2m11); let t_b3_3 = f32x4_mul(self.twiddle4im, x3m10); let t_b3_4 = f32x4_mul(self.twiddle1im, x4m9); let t_b3_5 = f32x4_mul(self.twiddle2im, x5m8); let t_b3_6 = f32x4_mul(self.twiddle5im, x6m7); let t_b4_1 = f32x4_mul(self.twiddle4im, x1m12); let t_b4_2 = f32x4_mul(self.twiddle5im, x2m11); let t_b4_3 = f32x4_mul(self.twiddle1im, x3m10); let t_b4_4 = f32x4_mul(self.twiddle3im, x4m9); let t_b4_5 = f32x4_mul(self.twiddle6im, x5m8); let t_b4_6 = f32x4_mul(self.twiddle2im, x6m7); let t_b5_1 = f32x4_mul(self.twiddle5im, x1m12); let t_b5_2 = f32x4_mul(self.twiddle3im, x2m11); let t_b5_3 = f32x4_mul(self.twiddle2im, x3m10); let t_b5_4 = f32x4_mul(self.twiddle6im, x4m9); let t_b5_5 = f32x4_mul(self.twiddle1im, x5m8); let t_b5_6 = f32x4_mul(self.twiddle4im, x6m7); let t_b6_1 = f32x4_mul(self.twiddle6im, x1m12); let t_b6_2 = f32x4_mul(self.twiddle1im, x2m11); let t_b6_3 = f32x4_mul(self.twiddle5im, x3m10); let t_b6_4 = f32x4_mul(self.twiddle2im, x4m9); let t_b6_5 = f32x4_mul(self.twiddle4im, x5m8); let t_b6_6 = f32x4_mul(self.twiddle3im, x6m7); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6); let t_a6 = calc_f32!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 + t_b2_3 - t_b2_4 - t_b2_5 - t_b2_6); let t_b3 = calc_f32!(t_b3_1 + t_b3_2 - t_b3_3 - t_b3_4 + t_b3_5 + t_b3_6); let t_b4 = calc_f32!(t_b4_1 - t_b4_2 - t_b4_3 + t_b4_4 - t_b4_5 - t_b4_6); let t_b5 = calc_f32!(t_b5_1 - t_b5_2 + t_b5_3 - t_b5_4 - t_b5_5 + t_b5_6); let t_b6 = calc_f32!(t_b6_1 - t_b6_2 + t_b6_3 - t_b6_4 + t_b6_5 - t_b6_6); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let y0 = calc_f32!(x0 + x1p12 + x2p11 + x3p10 + x4p9 + x5p8 + x6p7); let [y1, y12] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y11] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y10] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y9] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y8] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y7] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12] } } // _ _____ __ _ _ _ _ _ // / |___ / / /_ | || | | |__ (_) |_ // | | |_ \ _____ | '_ \| || |_| '_ \| | __| // | |___) | |_____| | (_) |__ _| |_) | | |_ // |_|____/ \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly13 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, twiddle3re: v128, twiddle3im: v128, twiddle4re: v128, twiddle4im: v128, twiddle5re: v128, twiddle5im: v128, twiddle6re: v128, twiddle6im: v128, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly13, 13, |this: &WasmSimdF64Butterfly13<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly13, 13, |this: &WasmSimdF64Butterfly13<_>| this.direction ); impl WasmSimdF64Butterfly13 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 13, direction); let tw2: Complex = twiddles::compute_twiddle(2, 13, direction); let tw3: Complex = twiddles::compute_twiddle(3, 13, direction); let tw4: Complex = twiddles::compute_twiddle(4, 13, direction); let tw5: Complex = twiddles::compute_twiddle(5, 13, direction); let tw6: Complex = twiddles::compute_twiddle(6, 13, direction); let twiddle1re = f64x2_splat(tw1.re); let twiddle1im = f64x2_splat(tw1.im); let twiddle2re = f64x2_splat(tw2.re); let twiddle2im = f64x2_splat(tw2.im); let twiddle3re = f64x2_splat(tw3.re); let twiddle3im = f64x2_splat(tw3.im); let twiddle4re = f64x2_splat(tw4.re); let twiddle4im = f64x2_splat(tw4.im); let twiddle5re = f64x2_splat(tw5.re); let twiddle5im = f64x2_splat(tw5.im); let twiddle6re = f64x2_splat(tw6.re); let twiddle6im = f64x2_splat(tw6.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [v128; 13]) -> [v128; 13] { let [x1p12, x1m12] = solo_fft2_f64(values[1], values[12]); let [x2p11, x2m11] = solo_fft2_f64(values[2], values[11]); let [x3p10, x3m10] = solo_fft2_f64(values[3], values[10]); let [x4p9, x4m9] = solo_fft2_f64(values[4], values[9]); let [x5p8, x5m8] = solo_fft2_f64(values[5], values[8]); let [x6p7, x6m7] = solo_fft2_f64(values[6], values[7]); let t_a1_1 = f64x2_mul(self.twiddle1re, x1p12); let t_a1_2 = f64x2_mul(self.twiddle2re, x2p11); let t_a1_3 = f64x2_mul(self.twiddle3re, x3p10); let t_a1_4 = f64x2_mul(self.twiddle4re, x4p9); let t_a1_5 = f64x2_mul(self.twiddle5re, x5p8); let t_a1_6 = f64x2_mul(self.twiddle6re, x6p7); let t_a2_1 = f64x2_mul(self.twiddle2re, x1p12); let t_a2_2 = f64x2_mul(self.twiddle4re, x2p11); let t_a2_3 = f64x2_mul(self.twiddle6re, x3p10); let t_a2_4 = f64x2_mul(self.twiddle5re, x4p9); let t_a2_5 = f64x2_mul(self.twiddle3re, x5p8); let t_a2_6 = f64x2_mul(self.twiddle1re, x6p7); let t_a3_1 = f64x2_mul(self.twiddle3re, x1p12); let t_a3_2 = f64x2_mul(self.twiddle6re, x2p11); let t_a3_3 = f64x2_mul(self.twiddle4re, x3p10); let t_a3_4 = f64x2_mul(self.twiddle1re, x4p9); let t_a3_5 = f64x2_mul(self.twiddle2re, x5p8); let t_a3_6 = f64x2_mul(self.twiddle5re, x6p7); let t_a4_1 = f64x2_mul(self.twiddle4re, x1p12); let t_a4_2 = f64x2_mul(self.twiddle5re, x2p11); let t_a4_3 = f64x2_mul(self.twiddle1re, x3p10); let t_a4_4 = f64x2_mul(self.twiddle3re, x4p9); let t_a4_5 = f64x2_mul(self.twiddle6re, x5p8); let t_a4_6 = f64x2_mul(self.twiddle2re, x6p7); let t_a5_1 = f64x2_mul(self.twiddle5re, x1p12); let t_a5_2 = f64x2_mul(self.twiddle3re, x2p11); let t_a5_3 = f64x2_mul(self.twiddle2re, x3p10); let t_a5_4 = f64x2_mul(self.twiddle6re, x4p9); let t_a5_5 = f64x2_mul(self.twiddle1re, x5p8); let t_a5_6 = f64x2_mul(self.twiddle4re, x6p7); let t_a6_1 = f64x2_mul(self.twiddle6re, x1p12); let t_a6_2 = f64x2_mul(self.twiddle1re, x2p11); let t_a6_3 = f64x2_mul(self.twiddle5re, x3p10); let t_a6_4 = f64x2_mul(self.twiddle2re, x4p9); let t_a6_5 = f64x2_mul(self.twiddle4re, x5p8); let t_a6_6 = f64x2_mul(self.twiddle3re, x6p7); let t_b1_1 = f64x2_mul(self.twiddle1im, x1m12); let t_b1_2 = f64x2_mul(self.twiddle2im, x2m11); let t_b1_3 = f64x2_mul(self.twiddle3im, x3m10); let t_b1_4 = f64x2_mul(self.twiddle4im, x4m9); let t_b1_5 = f64x2_mul(self.twiddle5im, x5m8); let t_b1_6 = f64x2_mul(self.twiddle6im, x6m7); let t_b2_1 = f64x2_mul(self.twiddle2im, x1m12); let t_b2_2 = f64x2_mul(self.twiddle4im, x2m11); let t_b2_3 = f64x2_mul(self.twiddle6im, x3m10); let t_b2_4 = f64x2_mul(self.twiddle5im, x4m9); let t_b2_5 = f64x2_mul(self.twiddle3im, x5m8); let t_b2_6 = f64x2_mul(self.twiddle1im, x6m7); let t_b3_1 = f64x2_mul(self.twiddle3im, x1m12); let t_b3_2 = f64x2_mul(self.twiddle6im, x2m11); let t_b3_3 = f64x2_mul(self.twiddle4im, x3m10); let t_b3_4 = f64x2_mul(self.twiddle1im, x4m9); let t_b3_5 = f64x2_mul(self.twiddle2im, x5m8); let t_b3_6 = f64x2_mul(self.twiddle5im, x6m7); let t_b4_1 = f64x2_mul(self.twiddle4im, x1m12); let t_b4_2 = f64x2_mul(self.twiddle5im, x2m11); let t_b4_3 = f64x2_mul(self.twiddle1im, x3m10); let t_b4_4 = f64x2_mul(self.twiddle3im, x4m9); let t_b4_5 = f64x2_mul(self.twiddle6im, x5m8); let t_b4_6 = f64x2_mul(self.twiddle2im, x6m7); let t_b5_1 = f64x2_mul(self.twiddle5im, x1m12); let t_b5_2 = f64x2_mul(self.twiddle3im, x2m11); let t_b5_3 = f64x2_mul(self.twiddle2im, x3m10); let t_b5_4 = f64x2_mul(self.twiddle6im, x4m9); let t_b5_5 = f64x2_mul(self.twiddle1im, x5m8); let t_b5_6 = f64x2_mul(self.twiddle4im, x6m7); let t_b6_1 = f64x2_mul(self.twiddle6im, x1m12); let t_b6_2 = f64x2_mul(self.twiddle1im, x2m11); let t_b6_3 = f64x2_mul(self.twiddle5im, x3m10); let t_b6_4 = f64x2_mul(self.twiddle2im, x4m9); let t_b6_5 = f64x2_mul(self.twiddle4im, x5m8); let t_b6_6 = f64x2_mul(self.twiddle3im, x6m7); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6); let t_a6 = calc_f64!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 + t_b2_3 - t_b2_4 - t_b2_5 - t_b2_6); let t_b3 = calc_f64!(t_b3_1 + t_b3_2 - t_b3_3 - t_b3_4 + t_b3_5 + t_b3_6); let t_b4 = calc_f64!(t_b4_1 - t_b4_2 - t_b4_3 + t_b4_4 - t_b4_5 - t_b4_6); let t_b5 = calc_f64!(t_b5_1 - t_b5_2 + t_b5_3 - t_b5_4 - t_b5_5 + t_b5_6); let t_b6 = calc_f64!(t_b6_1 - t_b6_2 + t_b6_3 - t_b6_4 + t_b6_5 - t_b6_6); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let y0 = calc_f64!(x0 + x1p12 + x2p11 + x3p10 + x4p9 + x5p8 + x6p7); let [y1, y12] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y11] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y10] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y9] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y8] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y7] = solo_fft2_f64(t_a6, t_b6_rot); [y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12] } } // _ _____ _________ _ _ _ // / |___ | |___ /___ \| |__ (_) |_ // | | / / _____ |_ \ __) | '_ \| | __| // | | / / |_____| ___) / __/| |_) | | |_ // |_|/_/ |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly17 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, twiddle3re: v128, twiddle3im: v128, twiddle4re: v128, twiddle4im: v128, twiddle5re: v128, twiddle5im: v128, twiddle6re: v128, twiddle6im: v128, twiddle7re: v128, twiddle7im: v128, twiddle8re: v128, twiddle8im: v128, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly17, 17, |this: &WasmSimdF32Butterfly17<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly17, 17, |this: &WasmSimdF32Butterfly17<_>| this.direction ); impl WasmSimdF32Butterfly17 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 17, direction); let tw2: Complex = twiddles::compute_twiddle(2, 17, direction); let tw3: Complex = twiddles::compute_twiddle(3, 17, direction); let tw4: Complex = twiddles::compute_twiddle(4, 17, direction); let tw5: Complex = twiddles::compute_twiddle(5, 17, direction); let tw6: Complex = twiddles::compute_twiddle(6, 17, direction); let tw7: Complex = twiddles::compute_twiddle(7, 17, direction); let tw8: Complex = twiddles::compute_twiddle(8, 17, direction); let twiddle1re = f32x4_splat(tw1.re); let twiddle1im = f32x4_splat(tw1.im); let twiddle2re = f32x4_splat(tw2.re); let twiddle2im = f32x4_splat(tw2.im); let twiddle3re = f32x4_splat(tw3.re); let twiddle3im = f32x4_splat(tw3.im); let twiddle4re = f32x4_splat(tw4.re); let twiddle4im = f32x4_splat(tw4.im); let twiddle5re = f32x4_splat(tw5.re); let twiddle5im = f32x4_splat(tw5.im); let twiddle6re = f32x4_splat(tw6.re); let twiddle6im = f32x4_splat(tw6.im); let twiddle7re = f32x4_splat(tw7.re); let twiddle7im = f32x4_splat(tw7.im); let twiddle8re = f32x4_splat(tw8.re); let twiddle8im = f32x4_splat(tw8.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[8]), extract_hi_lo_f32(input_packed[0], input_packed[9]), extract_lo_hi_f32(input_packed[1], input_packed[9]), extract_hi_lo_f32(input_packed[1], input_packed[10]), extract_lo_hi_f32(input_packed[2], input_packed[10]), extract_hi_lo_f32(input_packed[2], input_packed[11]), extract_lo_hi_f32(input_packed[3], input_packed[11]), extract_hi_lo_f32(input_packed[3], input_packed[12]), extract_lo_hi_f32(input_packed[4], input_packed[12]), extract_hi_lo_f32(input_packed[4], input_packed[13]), extract_lo_hi_f32(input_packed[5], input_packed[13]), extract_hi_lo_f32(input_packed[5], input_packed[14]), extract_lo_hi_f32(input_packed[6], input_packed[14]), extract_hi_lo_f32(input_packed[6], input_packed[15]), extract_lo_hi_f32(input_packed[7], input_packed[15]), extract_hi_lo_f32(input_packed[7], input_packed[16]), extract_lo_hi_f32(input_packed[8], input_packed[16]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_lo_f32(out[14], out[15]), extract_lo_hi_f32(out[16], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), extract_hi_hi_f32(out[15], out[16]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [v128; 17]) -> [v128; 17] { let [x1p16, x1m16] = parallel_fft2_interleaved_f32(values[1], values[16]); let [x2p15, x2m15] = parallel_fft2_interleaved_f32(values[2], values[15]); let [x3p14, x3m14] = parallel_fft2_interleaved_f32(values[3], values[14]); let [x4p13, x4m13] = parallel_fft2_interleaved_f32(values[4], values[13]); let [x5p12, x5m12] = parallel_fft2_interleaved_f32(values[5], values[12]); let [x6p11, x6m11] = parallel_fft2_interleaved_f32(values[6], values[11]); let [x7p10, x7m10] = parallel_fft2_interleaved_f32(values[7], values[10]); let [x8p9, x8m9] = parallel_fft2_interleaved_f32(values[8], values[9]); let t_a1_1 = f32x4_mul(self.twiddle1re, x1p16); let t_a1_2 = f32x4_mul(self.twiddle2re, x2p15); let t_a1_3 = f32x4_mul(self.twiddle3re, x3p14); let t_a1_4 = f32x4_mul(self.twiddle4re, x4p13); let t_a1_5 = f32x4_mul(self.twiddle5re, x5p12); let t_a1_6 = f32x4_mul(self.twiddle6re, x6p11); let t_a1_7 = f32x4_mul(self.twiddle7re, x7p10); let t_a1_8 = f32x4_mul(self.twiddle8re, x8p9); let t_a2_1 = f32x4_mul(self.twiddle2re, x1p16); let t_a2_2 = f32x4_mul(self.twiddle4re, x2p15); let t_a2_3 = f32x4_mul(self.twiddle6re, x3p14); let t_a2_4 = f32x4_mul(self.twiddle8re, x4p13); let t_a2_5 = f32x4_mul(self.twiddle7re, x5p12); let t_a2_6 = f32x4_mul(self.twiddle5re, x6p11); let t_a2_7 = f32x4_mul(self.twiddle3re, x7p10); let t_a2_8 = f32x4_mul(self.twiddle1re, x8p9); let t_a3_1 = f32x4_mul(self.twiddle3re, x1p16); let t_a3_2 = f32x4_mul(self.twiddle6re, x2p15); let t_a3_3 = f32x4_mul(self.twiddle8re, x3p14); let t_a3_4 = f32x4_mul(self.twiddle5re, x4p13); let t_a3_5 = f32x4_mul(self.twiddle2re, x5p12); let t_a3_6 = f32x4_mul(self.twiddle1re, x6p11); let t_a3_7 = f32x4_mul(self.twiddle4re, x7p10); let t_a3_8 = f32x4_mul(self.twiddle7re, x8p9); let t_a4_1 = f32x4_mul(self.twiddle4re, x1p16); let t_a4_2 = f32x4_mul(self.twiddle8re, x2p15); let t_a4_3 = f32x4_mul(self.twiddle5re, x3p14); let t_a4_4 = f32x4_mul(self.twiddle1re, x4p13); let t_a4_5 = f32x4_mul(self.twiddle3re, x5p12); let t_a4_6 = f32x4_mul(self.twiddle7re, x6p11); let t_a4_7 = f32x4_mul(self.twiddle6re, x7p10); let t_a4_8 = f32x4_mul(self.twiddle2re, x8p9); let t_a5_1 = f32x4_mul(self.twiddle5re, x1p16); let t_a5_2 = f32x4_mul(self.twiddle7re, x2p15); let t_a5_3 = f32x4_mul(self.twiddle2re, x3p14); let t_a5_4 = f32x4_mul(self.twiddle3re, x4p13); let t_a5_5 = f32x4_mul(self.twiddle8re, x5p12); let t_a5_6 = f32x4_mul(self.twiddle4re, x6p11); let t_a5_7 = f32x4_mul(self.twiddle1re, x7p10); let t_a5_8 = f32x4_mul(self.twiddle6re, x8p9); let t_a6_1 = f32x4_mul(self.twiddle6re, x1p16); let t_a6_2 = f32x4_mul(self.twiddle5re, x2p15); let t_a6_3 = f32x4_mul(self.twiddle1re, x3p14); let t_a6_4 = f32x4_mul(self.twiddle7re, x4p13); let t_a6_5 = f32x4_mul(self.twiddle4re, x5p12); let t_a6_6 = f32x4_mul(self.twiddle2re, x6p11); let t_a6_7 = f32x4_mul(self.twiddle8re, x7p10); let t_a6_8 = f32x4_mul(self.twiddle3re, x8p9); let t_a7_1 = f32x4_mul(self.twiddle7re, x1p16); let t_a7_2 = f32x4_mul(self.twiddle3re, x2p15); let t_a7_3 = f32x4_mul(self.twiddle4re, x3p14); let t_a7_4 = f32x4_mul(self.twiddle6re, x4p13); let t_a7_5 = f32x4_mul(self.twiddle1re, x5p12); let t_a7_6 = f32x4_mul(self.twiddle8re, x6p11); let t_a7_7 = f32x4_mul(self.twiddle2re, x7p10); let t_a7_8 = f32x4_mul(self.twiddle5re, x8p9); let t_a8_1 = f32x4_mul(self.twiddle8re, x1p16); let t_a8_2 = f32x4_mul(self.twiddle1re, x2p15); let t_a8_3 = f32x4_mul(self.twiddle7re, x3p14); let t_a8_4 = f32x4_mul(self.twiddle2re, x4p13); let t_a8_5 = f32x4_mul(self.twiddle6re, x5p12); let t_a8_6 = f32x4_mul(self.twiddle3re, x6p11); let t_a8_7 = f32x4_mul(self.twiddle5re, x7p10); let t_a8_8 = f32x4_mul(self.twiddle4re, x8p9); let t_b1_1 = f32x4_mul(self.twiddle1im, x1m16); let t_b1_2 = f32x4_mul(self.twiddle2im, x2m15); let t_b1_3 = f32x4_mul(self.twiddle3im, x3m14); let t_b1_4 = f32x4_mul(self.twiddle4im, x4m13); let t_b1_5 = f32x4_mul(self.twiddle5im, x5m12); let t_b1_6 = f32x4_mul(self.twiddle6im, x6m11); let t_b1_7 = f32x4_mul(self.twiddle7im, x7m10); let t_b1_8 = f32x4_mul(self.twiddle8im, x8m9); let t_b2_1 = f32x4_mul(self.twiddle2im, x1m16); let t_b2_2 = f32x4_mul(self.twiddle4im, x2m15); let t_b2_3 = f32x4_mul(self.twiddle6im, x3m14); let t_b2_4 = f32x4_mul(self.twiddle8im, x4m13); let t_b2_5 = f32x4_mul(self.twiddle7im, x5m12); let t_b2_6 = f32x4_mul(self.twiddle5im, x6m11); let t_b2_7 = f32x4_mul(self.twiddle3im, x7m10); let t_b2_8 = f32x4_mul(self.twiddle1im, x8m9); let t_b3_1 = f32x4_mul(self.twiddle3im, x1m16); let t_b3_2 = f32x4_mul(self.twiddle6im, x2m15); let t_b3_3 = f32x4_mul(self.twiddle8im, x3m14); let t_b3_4 = f32x4_mul(self.twiddle5im, x4m13); let t_b3_5 = f32x4_mul(self.twiddle2im, x5m12); let t_b3_6 = f32x4_mul(self.twiddle1im, x6m11); let t_b3_7 = f32x4_mul(self.twiddle4im, x7m10); let t_b3_8 = f32x4_mul(self.twiddle7im, x8m9); let t_b4_1 = f32x4_mul(self.twiddle4im, x1m16); let t_b4_2 = f32x4_mul(self.twiddle8im, x2m15); let t_b4_3 = f32x4_mul(self.twiddle5im, x3m14); let t_b4_4 = f32x4_mul(self.twiddle1im, x4m13); let t_b4_5 = f32x4_mul(self.twiddle3im, x5m12); let t_b4_6 = f32x4_mul(self.twiddle7im, x6m11); let t_b4_7 = f32x4_mul(self.twiddle6im, x7m10); let t_b4_8 = f32x4_mul(self.twiddle2im, x8m9); let t_b5_1 = f32x4_mul(self.twiddle5im, x1m16); let t_b5_2 = f32x4_mul(self.twiddle7im, x2m15); let t_b5_3 = f32x4_mul(self.twiddle2im, x3m14); let t_b5_4 = f32x4_mul(self.twiddle3im, x4m13); let t_b5_5 = f32x4_mul(self.twiddle8im, x5m12); let t_b5_6 = f32x4_mul(self.twiddle4im, x6m11); let t_b5_7 = f32x4_mul(self.twiddle1im, x7m10); let t_b5_8 = f32x4_mul(self.twiddle6im, x8m9); let t_b6_1 = f32x4_mul(self.twiddle6im, x1m16); let t_b6_2 = f32x4_mul(self.twiddle5im, x2m15); let t_b6_3 = f32x4_mul(self.twiddle1im, x3m14); let t_b6_4 = f32x4_mul(self.twiddle7im, x4m13); let t_b6_5 = f32x4_mul(self.twiddle4im, x5m12); let t_b6_6 = f32x4_mul(self.twiddle2im, x6m11); let t_b6_7 = f32x4_mul(self.twiddle8im, x7m10); let t_b6_8 = f32x4_mul(self.twiddle3im, x8m9); let t_b7_1 = f32x4_mul(self.twiddle7im, x1m16); let t_b7_2 = f32x4_mul(self.twiddle3im, x2m15); let t_b7_3 = f32x4_mul(self.twiddle4im, x3m14); let t_b7_4 = f32x4_mul(self.twiddle6im, x4m13); let t_b7_5 = f32x4_mul(self.twiddle1im, x5m12); let t_b7_6 = f32x4_mul(self.twiddle8im, x6m11); let t_b7_7 = f32x4_mul(self.twiddle2im, x7m10); let t_b7_8 = f32x4_mul(self.twiddle5im, x8m9); let t_b8_1 = f32x4_mul(self.twiddle8im, x1m16); let t_b8_2 = f32x4_mul(self.twiddle1im, x2m15); let t_b8_3 = f32x4_mul(self.twiddle7im, x3m14); let t_b8_4 = f32x4_mul(self.twiddle2im, x4m13); let t_b8_5 = f32x4_mul(self.twiddle6im, x5m12); let t_b8_6 = f32x4_mul(self.twiddle3im, x6m11); let t_b8_7 = f32x4_mul(self.twiddle5im, x7m10); let t_b8_8 = f32x4_mul(self.twiddle4im, x8m9); let x0 = values[0]; let t_a1 = calc_f32!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8); let t_a2 = calc_f32!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8); let t_a3 = calc_f32!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8); let t_a4 = calc_f32!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8); let t_a5 = calc_f32!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8); let t_a6 = calc_f32!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8); let t_a7 = calc_f32!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8); let t_a8 = calc_f32!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8); let t_b1 = calc_f32!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8); let t_b2 = calc_f32!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 - t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8); let t_b3 = calc_f32!(t_b3_1 + t_b3_2 - t_b3_3 - t_b3_4 - t_b3_5 + t_b3_6 + t_b3_7 + t_b3_8); let t_b4 = calc_f32!(t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 + t_b4_5 + t_b4_6 - t_b4_7 - t_b4_8); let t_b5 = calc_f32!(t_b5_1 - t_b5_2 - t_b5_3 + t_b5_4 + t_b5_5 - t_b5_6 + t_b5_7 + t_b5_8); let t_b6 = calc_f32!(t_b6_1 - t_b6_2 + t_b6_3 + t_b6_4 - t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8); let t_b7 = calc_f32!(t_b7_1 - t_b7_2 + t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 + t_b7_8); let t_b8 = calc_f32!(t_b8_1 - t_b8_2 + t_b8_3 - t_b8_4 + t_b8_5 - t_b8_6 + t_b8_7 - t_b8_8); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let t_b7_rot = self.rotate.rotate_both(t_b7); let t_b8_rot = self.rotate.rotate_both(t_b8); let y0 = calc_f32!(x0 + x1p16 + x2p15 + x3p14 + x4p13 + x5p12 + x6p11 + x7p10 + x8p9); let [y1, y16] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y15] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y14] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y13] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y12] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y11] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); let [y7, y10] = parallel_fft2_interleaved_f32(t_a7, t_b7_rot); let [y8, y9] = parallel_fft2_interleaved_f32(t_a8, t_b8_rot); [ y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, ] } } // _ _____ __ _ _ _ _ _ // / |___ | / /_ | || | | |__ (_) |_ // | | / / _____ | '_ \| || |_| '_ \| | __| // | | / / |_____| | (_) |__ _| |_) | | |_ // |_|/_/ \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly17 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, twiddle3re: v128, twiddle3im: v128, twiddle4re: v128, twiddle4im: v128, twiddle5re: v128, twiddle5im: v128, twiddle6re: v128, twiddle6im: v128, twiddle7re: v128, twiddle7im: v128, twiddle8re: v128, twiddle8im: v128, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly17, 17, |this: &WasmSimdF64Butterfly17<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly17, 17, |this: &WasmSimdF64Butterfly17<_>| this.direction ); impl WasmSimdF64Butterfly17 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 17, direction); let tw2: Complex = twiddles::compute_twiddle(2, 17, direction); let tw3: Complex = twiddles::compute_twiddle(3, 17, direction); let tw4: Complex = twiddles::compute_twiddle(4, 17, direction); let tw5: Complex = twiddles::compute_twiddle(5, 17, direction); let tw6: Complex = twiddles::compute_twiddle(6, 17, direction); let tw7: Complex = twiddles::compute_twiddle(7, 17, direction); let tw8: Complex = twiddles::compute_twiddle(8, 17, direction); let twiddle1re = f64x2_splat(tw1.re); let twiddle1im = f64x2_splat(tw1.im); let twiddle2re = f64x2_splat(tw2.re); let twiddle2im = f64x2_splat(tw2.im); let twiddle3re = f64x2_splat(tw3.re); let twiddle3im = f64x2_splat(tw3.im); let twiddle4re = f64x2_splat(tw4.re); let twiddle4im = f64x2_splat(tw4.im); let twiddle5re = f64x2_splat(tw5.re); let twiddle5im = f64x2_splat(tw5.im); let twiddle6re = f64x2_splat(tw6.re); let twiddle6im = f64x2_splat(tw6.im); let twiddle7re = f64x2_splat(tw7.re); let twiddle7im = f64x2_splat(tw7.im); let twiddle8re = f64x2_splat(tw8.re); let twiddle8im = f64x2_splat(tw8.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [v128; 17]) -> [v128; 17] { let [x1p16, x1m16] = solo_fft2_f64(values[1], values[16]); let [x2p15, x2m15] = solo_fft2_f64(values[2], values[15]); let [x3p14, x3m14] = solo_fft2_f64(values[3], values[14]); let [x4p13, x4m13] = solo_fft2_f64(values[4], values[13]); let [x5p12, x5m12] = solo_fft2_f64(values[5], values[12]); let [x6p11, x6m11] = solo_fft2_f64(values[6], values[11]); let [x7p10, x7m10] = solo_fft2_f64(values[7], values[10]); let [x8p9, x8m9] = solo_fft2_f64(values[8], values[9]); let t_a1_1 = f64x2_mul(self.twiddle1re, x1p16); let t_a1_2 = f64x2_mul(self.twiddle2re, x2p15); let t_a1_3 = f64x2_mul(self.twiddle3re, x3p14); let t_a1_4 = f64x2_mul(self.twiddle4re, x4p13); let t_a1_5 = f64x2_mul(self.twiddle5re, x5p12); let t_a1_6 = f64x2_mul(self.twiddle6re, x6p11); let t_a1_7 = f64x2_mul(self.twiddle7re, x7p10); let t_a1_8 = f64x2_mul(self.twiddle8re, x8p9); let t_a2_1 = f64x2_mul(self.twiddle2re, x1p16); let t_a2_2 = f64x2_mul(self.twiddle4re, x2p15); let t_a2_3 = f64x2_mul(self.twiddle6re, x3p14); let t_a2_4 = f64x2_mul(self.twiddle8re, x4p13); let t_a2_5 = f64x2_mul(self.twiddle7re, x5p12); let t_a2_6 = f64x2_mul(self.twiddle5re, x6p11); let t_a2_7 = f64x2_mul(self.twiddle3re, x7p10); let t_a2_8 = f64x2_mul(self.twiddle1re, x8p9); let t_a3_1 = f64x2_mul(self.twiddle3re, x1p16); let t_a3_2 = f64x2_mul(self.twiddle6re, x2p15); let t_a3_3 = f64x2_mul(self.twiddle8re, x3p14); let t_a3_4 = f64x2_mul(self.twiddle5re, x4p13); let t_a3_5 = f64x2_mul(self.twiddle2re, x5p12); let t_a3_6 = f64x2_mul(self.twiddle1re, x6p11); let t_a3_7 = f64x2_mul(self.twiddle4re, x7p10); let t_a3_8 = f64x2_mul(self.twiddle7re, x8p9); let t_a4_1 = f64x2_mul(self.twiddle4re, x1p16); let t_a4_2 = f64x2_mul(self.twiddle8re, x2p15); let t_a4_3 = f64x2_mul(self.twiddle5re, x3p14); let t_a4_4 = f64x2_mul(self.twiddle1re, x4p13); let t_a4_5 = f64x2_mul(self.twiddle3re, x5p12); let t_a4_6 = f64x2_mul(self.twiddle7re, x6p11); let t_a4_7 = f64x2_mul(self.twiddle6re, x7p10); let t_a4_8 = f64x2_mul(self.twiddle2re, x8p9); let t_a5_1 = f64x2_mul(self.twiddle5re, x1p16); let t_a5_2 = f64x2_mul(self.twiddle7re, x2p15); let t_a5_3 = f64x2_mul(self.twiddle2re, x3p14); let t_a5_4 = f64x2_mul(self.twiddle3re, x4p13); let t_a5_5 = f64x2_mul(self.twiddle8re, x5p12); let t_a5_6 = f64x2_mul(self.twiddle4re, x6p11); let t_a5_7 = f64x2_mul(self.twiddle1re, x7p10); let t_a5_8 = f64x2_mul(self.twiddle6re, x8p9); let t_a6_1 = f64x2_mul(self.twiddle6re, x1p16); let t_a6_2 = f64x2_mul(self.twiddle5re, x2p15); let t_a6_3 = f64x2_mul(self.twiddle1re, x3p14); let t_a6_4 = f64x2_mul(self.twiddle7re, x4p13); let t_a6_5 = f64x2_mul(self.twiddle4re, x5p12); let t_a6_6 = f64x2_mul(self.twiddle2re, x6p11); let t_a6_7 = f64x2_mul(self.twiddle8re, x7p10); let t_a6_8 = f64x2_mul(self.twiddle3re, x8p9); let t_a7_1 = f64x2_mul(self.twiddle7re, x1p16); let t_a7_2 = f64x2_mul(self.twiddle3re, x2p15); let t_a7_3 = f64x2_mul(self.twiddle4re, x3p14); let t_a7_4 = f64x2_mul(self.twiddle6re, x4p13); let t_a7_5 = f64x2_mul(self.twiddle1re, x5p12); let t_a7_6 = f64x2_mul(self.twiddle8re, x6p11); let t_a7_7 = f64x2_mul(self.twiddle2re, x7p10); let t_a7_8 = f64x2_mul(self.twiddle5re, x8p9); let t_a8_1 = f64x2_mul(self.twiddle8re, x1p16); let t_a8_2 = f64x2_mul(self.twiddle1re, x2p15); let t_a8_3 = f64x2_mul(self.twiddle7re, x3p14); let t_a8_4 = f64x2_mul(self.twiddle2re, x4p13); let t_a8_5 = f64x2_mul(self.twiddle6re, x5p12); let t_a8_6 = f64x2_mul(self.twiddle3re, x6p11); let t_a8_7 = f64x2_mul(self.twiddle5re, x7p10); let t_a8_8 = f64x2_mul(self.twiddle4re, x8p9); let t_b1_1 = f64x2_mul(self.twiddle1im, x1m16); let t_b1_2 = f64x2_mul(self.twiddle2im, x2m15); let t_b1_3 = f64x2_mul(self.twiddle3im, x3m14); let t_b1_4 = f64x2_mul(self.twiddle4im, x4m13); let t_b1_5 = f64x2_mul(self.twiddle5im, x5m12); let t_b1_6 = f64x2_mul(self.twiddle6im, x6m11); let t_b1_7 = f64x2_mul(self.twiddle7im, x7m10); let t_b1_8 = f64x2_mul(self.twiddle8im, x8m9); let t_b2_1 = f64x2_mul(self.twiddle2im, x1m16); let t_b2_2 = f64x2_mul(self.twiddle4im, x2m15); let t_b2_3 = f64x2_mul(self.twiddle6im, x3m14); let t_b2_4 = f64x2_mul(self.twiddle8im, x4m13); let t_b2_5 = f64x2_mul(self.twiddle7im, x5m12); let t_b2_6 = f64x2_mul(self.twiddle5im, x6m11); let t_b2_7 = f64x2_mul(self.twiddle3im, x7m10); let t_b2_8 = f64x2_mul(self.twiddle1im, x8m9); let t_b3_1 = f64x2_mul(self.twiddle3im, x1m16); let t_b3_2 = f64x2_mul(self.twiddle6im, x2m15); let t_b3_3 = f64x2_mul(self.twiddle8im, x3m14); let t_b3_4 = f64x2_mul(self.twiddle5im, x4m13); let t_b3_5 = f64x2_mul(self.twiddle2im, x5m12); let t_b3_6 = f64x2_mul(self.twiddle1im, x6m11); let t_b3_7 = f64x2_mul(self.twiddle4im, x7m10); let t_b3_8 = f64x2_mul(self.twiddle7im, x8m9); let t_b4_1 = f64x2_mul(self.twiddle4im, x1m16); let t_b4_2 = f64x2_mul(self.twiddle8im, x2m15); let t_b4_3 = f64x2_mul(self.twiddle5im, x3m14); let t_b4_4 = f64x2_mul(self.twiddle1im, x4m13); let t_b4_5 = f64x2_mul(self.twiddle3im, x5m12); let t_b4_6 = f64x2_mul(self.twiddle7im, x6m11); let t_b4_7 = f64x2_mul(self.twiddle6im, x7m10); let t_b4_8 = f64x2_mul(self.twiddle2im, x8m9); let t_b5_1 = f64x2_mul(self.twiddle5im, x1m16); let t_b5_2 = f64x2_mul(self.twiddle7im, x2m15); let t_b5_3 = f64x2_mul(self.twiddle2im, x3m14); let t_b5_4 = f64x2_mul(self.twiddle3im, x4m13); let t_b5_5 = f64x2_mul(self.twiddle8im, x5m12); let t_b5_6 = f64x2_mul(self.twiddle4im, x6m11); let t_b5_7 = f64x2_mul(self.twiddle1im, x7m10); let t_b5_8 = f64x2_mul(self.twiddle6im, x8m9); let t_b6_1 = f64x2_mul(self.twiddle6im, x1m16); let t_b6_2 = f64x2_mul(self.twiddle5im, x2m15); let t_b6_3 = f64x2_mul(self.twiddle1im, x3m14); let t_b6_4 = f64x2_mul(self.twiddle7im, x4m13); let t_b6_5 = f64x2_mul(self.twiddle4im, x5m12); let t_b6_6 = f64x2_mul(self.twiddle2im, x6m11); let t_b6_7 = f64x2_mul(self.twiddle8im, x7m10); let t_b6_8 = f64x2_mul(self.twiddle3im, x8m9); let t_b7_1 = f64x2_mul(self.twiddle7im, x1m16); let t_b7_2 = f64x2_mul(self.twiddle3im, x2m15); let t_b7_3 = f64x2_mul(self.twiddle4im, x3m14); let t_b7_4 = f64x2_mul(self.twiddle6im, x4m13); let t_b7_5 = f64x2_mul(self.twiddle1im, x5m12); let t_b7_6 = f64x2_mul(self.twiddle8im, x6m11); let t_b7_7 = f64x2_mul(self.twiddle2im, x7m10); let t_b7_8 = f64x2_mul(self.twiddle5im, x8m9); let t_b8_1 = f64x2_mul(self.twiddle8im, x1m16); let t_b8_2 = f64x2_mul(self.twiddle1im, x2m15); let t_b8_3 = f64x2_mul(self.twiddle7im, x3m14); let t_b8_4 = f64x2_mul(self.twiddle2im, x4m13); let t_b8_5 = f64x2_mul(self.twiddle6im, x5m12); let t_b8_6 = f64x2_mul(self.twiddle3im, x6m11); let t_b8_7 = f64x2_mul(self.twiddle5im, x7m10); let t_b8_8 = f64x2_mul(self.twiddle4im, x8m9); let x0 = values[0]; let t_a1 = calc_f64!(x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8); let t_a2 = calc_f64!(x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8); let t_a3 = calc_f64!(x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8); let t_a4 = calc_f64!(x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8); let t_a5 = calc_f64!(x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8); let t_a6 = calc_f64!(x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8); let t_a7 = calc_f64!(x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8); let t_a8 = calc_f64!(x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8); let t_b1 = calc_f64!(t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8); let t_b2 = calc_f64!(t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 - t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8); let t_b3 = calc_f64!(t_b3_1 + t_b3_2 - t_b3_3 - t_b3_4 - t_b3_5 + t_b3_6 + t_b3_7 + t_b3_8); let t_b4 = calc_f64!(t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 + t_b4_5 + t_b4_6 - t_b4_7 - t_b4_8); let t_b5 = calc_f64!(t_b5_1 - t_b5_2 - t_b5_3 + t_b5_4 + t_b5_5 - t_b5_6 + t_b5_7 + t_b5_8); let t_b6 = calc_f64!(t_b6_1 - t_b6_2 + t_b6_3 + t_b6_4 - t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8); let t_b7 = calc_f64!(t_b7_1 - t_b7_2 + t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 + t_b7_8); let t_b8 = calc_f64!(t_b8_1 - t_b8_2 + t_b8_3 - t_b8_4 + t_b8_5 - t_b8_6 + t_b8_7 - t_b8_8); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let t_b7_rot = self.rotate.rotate(t_b7); let t_b8_rot = self.rotate.rotate(t_b8); let y0 = calc_f64!(x0 + x1p16 + x2p15 + x3p14 + x4p13 + x5p12 + x6p11 + x7p10 + x8p9); let [y1, y16] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y15] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y14] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y13] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y12] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y11] = solo_fft2_f64(t_a6, t_b6_rot); let [y7, y10] = solo_fft2_f64(t_a7, t_b7_rot); let [y8, y9] = solo_fft2_f64(t_a8, t_b8_rot); [ y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, ] } } // _ ___ _________ _ _ _ // / |/ _ \ |___ /___ \| |__ (_) |_ // | | (_) | _____ |_ \ __) | '_ \| | __| // | |\__, | |_____| ___) / __/| |_) | | |_ // |_| /_/ |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly19 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, twiddle3re: v128, twiddle3im: v128, twiddle4re: v128, twiddle4im: v128, twiddle5re: v128, twiddle5im: v128, twiddle6re: v128, twiddle6im: v128, twiddle7re: v128, twiddle7im: v128, twiddle8re: v128, twiddle8im: v128, twiddle9re: v128, twiddle9im: v128, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly19, 19, |this: &WasmSimdF32Butterfly19<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly19, 19, |this: &WasmSimdF32Butterfly19<_>| this.direction ); impl WasmSimdF32Butterfly19 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 19, direction); let tw2: Complex = twiddles::compute_twiddle(2, 19, direction); let tw3: Complex = twiddles::compute_twiddle(3, 19, direction); let tw4: Complex = twiddles::compute_twiddle(4, 19, direction); let tw5: Complex = twiddles::compute_twiddle(5, 19, direction); let tw6: Complex = twiddles::compute_twiddle(6, 19, direction); let tw7: Complex = twiddles::compute_twiddle(7, 19, direction); let tw8: Complex = twiddles::compute_twiddle(8, 19, direction); let tw9: Complex = twiddles::compute_twiddle(9, 19, direction); let twiddle1re = f32x4_splat(tw1.re); let twiddle1im = f32x4_splat(tw1.im); let twiddle2re = f32x4_splat(tw2.re); let twiddle2im = f32x4_splat(tw2.im); let twiddle3re = f32x4_splat(tw3.re); let twiddle3im = f32x4_splat(tw3.im); let twiddle4re = f32x4_splat(tw4.re); let twiddle4im = f32x4_splat(tw4.im); let twiddle5re = f32x4_splat(tw5.re); let twiddle5im = f32x4_splat(tw5.im); let twiddle6re = f32x4_splat(tw6.re); let twiddle6im = f32x4_splat(tw6.im); let twiddle7re = f32x4_splat(tw7.re); let twiddle7im = f32x4_splat(tw7.im); let twiddle8re = f32x4_splat(tw8.re); let twiddle8im = f32x4_splat(tw8.im); let twiddle9re = f32x4_splat(tw9.re); let twiddle9im = f32x4_splat(tw9.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[9]), extract_hi_lo_f32(input_packed[0], input_packed[10]), extract_lo_hi_f32(input_packed[1], input_packed[10]), extract_hi_lo_f32(input_packed[1], input_packed[11]), extract_lo_hi_f32(input_packed[2], input_packed[11]), extract_hi_lo_f32(input_packed[2], input_packed[12]), extract_lo_hi_f32(input_packed[3], input_packed[12]), extract_hi_lo_f32(input_packed[3], input_packed[13]), extract_lo_hi_f32(input_packed[4], input_packed[13]), extract_hi_lo_f32(input_packed[4], input_packed[14]), extract_lo_hi_f32(input_packed[5], input_packed[14]), extract_hi_lo_f32(input_packed[5], input_packed[15]), extract_lo_hi_f32(input_packed[6], input_packed[15]), extract_hi_lo_f32(input_packed[6], input_packed[16]), extract_lo_hi_f32(input_packed[7], input_packed[16]), extract_hi_lo_f32(input_packed[7], input_packed[17]), extract_lo_hi_f32(input_packed[8], input_packed[17]), extract_hi_lo_f32(input_packed[8], input_packed[18]), extract_lo_hi_f32(input_packed[9], input_packed[18]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_lo_f32(out[14], out[15]), extract_lo_lo_f32(out[16], out[17]), extract_lo_hi_f32(out[18], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), extract_hi_hi_f32(out[15], out[16]), extract_hi_hi_f32(out[17], out[18]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [v128; 19]) -> [v128; 19] { let [x1p18, x1m18] = parallel_fft2_interleaved_f32(values[1], values[18]); let [x2p17, x2m17] = parallel_fft2_interleaved_f32(values[2], values[17]); let [x3p16, x3m16] = parallel_fft2_interleaved_f32(values[3], values[16]); let [x4p15, x4m15] = parallel_fft2_interleaved_f32(values[4], values[15]); let [x5p14, x5m14] = parallel_fft2_interleaved_f32(values[5], values[14]); let [x6p13, x6m13] = parallel_fft2_interleaved_f32(values[6], values[13]); let [x7p12, x7m12] = parallel_fft2_interleaved_f32(values[7], values[12]); let [x8p11, x8m11] = parallel_fft2_interleaved_f32(values[8], values[11]); let [x9p10, x9m10] = parallel_fft2_interleaved_f32(values[9], values[10]); let t_a1_1 = f32x4_mul(self.twiddle1re, x1p18); let t_a1_2 = f32x4_mul(self.twiddle2re, x2p17); let t_a1_3 = f32x4_mul(self.twiddle3re, x3p16); let t_a1_4 = f32x4_mul(self.twiddle4re, x4p15); let t_a1_5 = f32x4_mul(self.twiddle5re, x5p14); let t_a1_6 = f32x4_mul(self.twiddle6re, x6p13); let t_a1_7 = f32x4_mul(self.twiddle7re, x7p12); let t_a1_8 = f32x4_mul(self.twiddle8re, x8p11); let t_a1_9 = f32x4_mul(self.twiddle9re, x9p10); let t_a2_1 = f32x4_mul(self.twiddle2re, x1p18); let t_a2_2 = f32x4_mul(self.twiddle4re, x2p17); let t_a2_3 = f32x4_mul(self.twiddle6re, x3p16); let t_a2_4 = f32x4_mul(self.twiddle8re, x4p15); let t_a2_5 = f32x4_mul(self.twiddle9re, x5p14); let t_a2_6 = f32x4_mul(self.twiddle7re, x6p13); let t_a2_7 = f32x4_mul(self.twiddle5re, x7p12); let t_a2_8 = f32x4_mul(self.twiddle3re, x8p11); let t_a2_9 = f32x4_mul(self.twiddle1re, x9p10); let t_a3_1 = f32x4_mul(self.twiddle3re, x1p18); let t_a3_2 = f32x4_mul(self.twiddle6re, x2p17); let t_a3_3 = f32x4_mul(self.twiddle9re, x3p16); let t_a3_4 = f32x4_mul(self.twiddle7re, x4p15); let t_a3_5 = f32x4_mul(self.twiddle4re, x5p14); let t_a3_6 = f32x4_mul(self.twiddle1re, x6p13); let t_a3_7 = f32x4_mul(self.twiddle2re, x7p12); let t_a3_8 = f32x4_mul(self.twiddle5re, x8p11); let t_a3_9 = f32x4_mul(self.twiddle8re, x9p10); let t_a4_1 = f32x4_mul(self.twiddle4re, x1p18); let t_a4_2 = f32x4_mul(self.twiddle8re, x2p17); let t_a4_3 = f32x4_mul(self.twiddle7re, x3p16); let t_a4_4 = f32x4_mul(self.twiddle3re, x4p15); let t_a4_5 = f32x4_mul(self.twiddle1re, x5p14); let t_a4_6 = f32x4_mul(self.twiddle5re, x6p13); let t_a4_7 = f32x4_mul(self.twiddle9re, x7p12); let t_a4_8 = f32x4_mul(self.twiddle6re, x8p11); let t_a4_9 = f32x4_mul(self.twiddle2re, x9p10); let t_a5_1 = f32x4_mul(self.twiddle5re, x1p18); let t_a5_2 = f32x4_mul(self.twiddle9re, x2p17); let t_a5_3 = f32x4_mul(self.twiddle4re, x3p16); let t_a5_4 = f32x4_mul(self.twiddle1re, x4p15); let t_a5_5 = f32x4_mul(self.twiddle6re, x5p14); let t_a5_6 = f32x4_mul(self.twiddle8re, x6p13); let t_a5_7 = f32x4_mul(self.twiddle3re, x7p12); let t_a5_8 = f32x4_mul(self.twiddle2re, x8p11); let t_a5_9 = f32x4_mul(self.twiddle7re, x9p10); let t_a6_1 = f32x4_mul(self.twiddle6re, x1p18); let t_a6_2 = f32x4_mul(self.twiddle7re, x2p17); let t_a6_3 = f32x4_mul(self.twiddle1re, x3p16); let t_a6_4 = f32x4_mul(self.twiddle5re, x4p15); let t_a6_5 = f32x4_mul(self.twiddle8re, x5p14); let t_a6_6 = f32x4_mul(self.twiddle2re, x6p13); let t_a6_7 = f32x4_mul(self.twiddle4re, x7p12); let t_a6_8 = f32x4_mul(self.twiddle9re, x8p11); let t_a6_9 = f32x4_mul(self.twiddle3re, x9p10); let t_a7_1 = f32x4_mul(self.twiddle7re, x1p18); let t_a7_2 = f32x4_mul(self.twiddle5re, x2p17); let t_a7_3 = f32x4_mul(self.twiddle2re, x3p16); let t_a7_4 = f32x4_mul(self.twiddle9re, x4p15); let t_a7_5 = f32x4_mul(self.twiddle3re, x5p14); let t_a7_6 = f32x4_mul(self.twiddle4re, x6p13); let t_a7_7 = f32x4_mul(self.twiddle8re, x7p12); let t_a7_8 = f32x4_mul(self.twiddle1re, x8p11); let t_a7_9 = f32x4_mul(self.twiddle6re, x9p10); let t_a8_1 = f32x4_mul(self.twiddle8re, x1p18); let t_a8_2 = f32x4_mul(self.twiddle3re, x2p17); let t_a8_3 = f32x4_mul(self.twiddle5re, x3p16); let t_a8_4 = f32x4_mul(self.twiddle6re, x4p15); let t_a8_5 = f32x4_mul(self.twiddle2re, x5p14); let t_a8_6 = f32x4_mul(self.twiddle9re, x6p13); let t_a8_7 = f32x4_mul(self.twiddle1re, x7p12); let t_a8_8 = f32x4_mul(self.twiddle7re, x8p11); let t_a8_9 = f32x4_mul(self.twiddle4re, x9p10); let t_a9_1 = f32x4_mul(self.twiddle9re, x1p18); let t_a9_2 = f32x4_mul(self.twiddle1re, x2p17); let t_a9_3 = f32x4_mul(self.twiddle8re, x3p16); let t_a9_4 = f32x4_mul(self.twiddle2re, x4p15); let t_a9_5 = f32x4_mul(self.twiddle7re, x5p14); let t_a9_6 = f32x4_mul(self.twiddle3re, x6p13); let t_a9_7 = f32x4_mul(self.twiddle6re, x7p12); let t_a9_8 = f32x4_mul(self.twiddle4re, x8p11); let t_a9_9 = f32x4_mul(self.twiddle5re, x9p10); let t_b1_1 = f32x4_mul(self.twiddle1im, x1m18); let t_b1_2 = f32x4_mul(self.twiddle2im, x2m17); let t_b1_3 = f32x4_mul(self.twiddle3im, x3m16); let t_b1_4 = f32x4_mul(self.twiddle4im, x4m15); let t_b1_5 = f32x4_mul(self.twiddle5im, x5m14); let t_b1_6 = f32x4_mul(self.twiddle6im, x6m13); let t_b1_7 = f32x4_mul(self.twiddle7im, x7m12); let t_b1_8 = f32x4_mul(self.twiddle8im, x8m11); let t_b1_9 = f32x4_mul(self.twiddle9im, x9m10); let t_b2_1 = f32x4_mul(self.twiddle2im, x1m18); let t_b2_2 = f32x4_mul(self.twiddle4im, x2m17); let t_b2_3 = f32x4_mul(self.twiddle6im, x3m16); let t_b2_4 = f32x4_mul(self.twiddle8im, x4m15); let t_b2_5 = f32x4_mul(self.twiddle9im, x5m14); let t_b2_6 = f32x4_mul(self.twiddle7im, x6m13); let t_b2_7 = f32x4_mul(self.twiddle5im, x7m12); let t_b2_8 = f32x4_mul(self.twiddle3im, x8m11); let t_b2_9 = f32x4_mul(self.twiddle1im, x9m10); let t_b3_1 = f32x4_mul(self.twiddle3im, x1m18); let t_b3_2 = f32x4_mul(self.twiddle6im, x2m17); let t_b3_3 = f32x4_mul(self.twiddle9im, x3m16); let t_b3_4 = f32x4_mul(self.twiddle7im, x4m15); let t_b3_5 = f32x4_mul(self.twiddle4im, x5m14); let t_b3_6 = f32x4_mul(self.twiddle1im, x6m13); let t_b3_7 = f32x4_mul(self.twiddle2im, x7m12); let t_b3_8 = f32x4_mul(self.twiddle5im, x8m11); let t_b3_9 = f32x4_mul(self.twiddle8im, x9m10); let t_b4_1 = f32x4_mul(self.twiddle4im, x1m18); let t_b4_2 = f32x4_mul(self.twiddle8im, x2m17); let t_b4_3 = f32x4_mul(self.twiddle7im, x3m16); let t_b4_4 = f32x4_mul(self.twiddle3im, x4m15); let t_b4_5 = f32x4_mul(self.twiddle1im, x5m14); let t_b4_6 = f32x4_mul(self.twiddle5im, x6m13); let t_b4_7 = f32x4_mul(self.twiddle9im, x7m12); let t_b4_8 = f32x4_mul(self.twiddle6im, x8m11); let t_b4_9 = f32x4_mul(self.twiddle2im, x9m10); let t_b5_1 = f32x4_mul(self.twiddle5im, x1m18); let t_b5_2 = f32x4_mul(self.twiddle9im, x2m17); let t_b5_3 = f32x4_mul(self.twiddle4im, x3m16); let t_b5_4 = f32x4_mul(self.twiddle1im, x4m15); let t_b5_5 = f32x4_mul(self.twiddle6im, x5m14); let t_b5_6 = f32x4_mul(self.twiddle8im, x6m13); let t_b5_7 = f32x4_mul(self.twiddle3im, x7m12); let t_b5_8 = f32x4_mul(self.twiddle2im, x8m11); let t_b5_9 = f32x4_mul(self.twiddle7im, x9m10); let t_b6_1 = f32x4_mul(self.twiddle6im, x1m18); let t_b6_2 = f32x4_mul(self.twiddle7im, x2m17); let t_b6_3 = f32x4_mul(self.twiddle1im, x3m16); let t_b6_4 = f32x4_mul(self.twiddle5im, x4m15); let t_b6_5 = f32x4_mul(self.twiddle8im, x5m14); let t_b6_6 = f32x4_mul(self.twiddle2im, x6m13); let t_b6_7 = f32x4_mul(self.twiddle4im, x7m12); let t_b6_8 = f32x4_mul(self.twiddle9im, x8m11); let t_b6_9 = f32x4_mul(self.twiddle3im, x9m10); let t_b7_1 = f32x4_mul(self.twiddle7im, x1m18); let t_b7_2 = f32x4_mul(self.twiddle5im, x2m17); let t_b7_3 = f32x4_mul(self.twiddle2im, x3m16); let t_b7_4 = f32x4_mul(self.twiddle9im, x4m15); let t_b7_5 = f32x4_mul(self.twiddle3im, x5m14); let t_b7_6 = f32x4_mul(self.twiddle4im, x6m13); let t_b7_7 = f32x4_mul(self.twiddle8im, x7m12); let t_b7_8 = f32x4_mul(self.twiddle1im, x8m11); let t_b7_9 = f32x4_mul(self.twiddle6im, x9m10); let t_b8_1 = f32x4_mul(self.twiddle8im, x1m18); let t_b8_2 = f32x4_mul(self.twiddle3im, x2m17); let t_b8_3 = f32x4_mul(self.twiddle5im, x3m16); let t_b8_4 = f32x4_mul(self.twiddle6im, x4m15); let t_b8_5 = f32x4_mul(self.twiddle2im, x5m14); let t_b8_6 = f32x4_mul(self.twiddle9im, x6m13); let t_b8_7 = f32x4_mul(self.twiddle1im, x7m12); let t_b8_8 = f32x4_mul(self.twiddle7im, x8m11); let t_b8_9 = f32x4_mul(self.twiddle4im, x9m10); let t_b9_1 = f32x4_mul(self.twiddle9im, x1m18); let t_b9_2 = f32x4_mul(self.twiddle1im, x2m17); let t_b9_3 = f32x4_mul(self.twiddle8im, x3m16); let t_b9_4 = f32x4_mul(self.twiddle2im, x4m15); let t_b9_5 = f32x4_mul(self.twiddle7im, x5m14); let t_b9_6 = f32x4_mul(self.twiddle3im, x6m13); let t_b9_7 = f32x4_mul(self.twiddle6im, x7m12); let t_b9_8 = f32x4_mul(self.twiddle4im, x8m11); let t_b9_9 = f32x4_mul(self.twiddle5im, x9m10); let x0 = values[0]; let t_a1 = calc_f32!( x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 ); let t_a2 = calc_f32!( x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 ); let t_a3 = calc_f32!( x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 ); let t_a4 = calc_f32!( x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 ); let t_a5 = calc_f32!( x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 ); let t_a6 = calc_f32!( x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 ); let t_a7 = calc_f32!( x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 ); let t_a8 = calc_f32!( x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 ); let t_a9 = calc_f32!( x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 ); let t_b1 = calc_f32!( t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 ); let t_b2 = calc_f32!( t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 - t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8 - t_b2_9 ); let t_b3 = calc_f32!( t_b3_1 + t_b3_2 + t_b3_3 - t_b3_4 - t_b3_5 - t_b3_6 + t_b3_7 + t_b3_8 + t_b3_9 ); let t_b4 = calc_f32!( t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 + t_b4_5 + t_b4_6 + t_b4_7 - t_b4_8 - t_b4_9 ); let t_b5 = calc_f32!( t_b5_1 - t_b5_2 - t_b5_3 + t_b5_4 + t_b5_5 - t_b5_6 - t_b5_7 + t_b5_8 + t_b5_9 ); let t_b6 = calc_f32!( t_b6_1 - t_b6_2 - t_b6_3 + t_b6_4 - t_b6_5 - t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9 ); let t_b7 = calc_f32!( t_b7_1 - t_b7_2 + t_b7_3 + t_b7_4 - t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9 ); let t_b8 = calc_f32!( t_b8_1 - t_b8_2 + t_b8_3 - t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 - t_b8_9 ); let t_b9 = calc_f32!( t_b9_1 - t_b9_2 + t_b9_3 - t_b9_4 + t_b9_5 - t_b9_6 + t_b9_7 - t_b9_8 + t_b9_9 ); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let t_b7_rot = self.rotate.rotate_both(t_b7); let t_b8_rot = self.rotate.rotate_both(t_b8); let t_b9_rot = self.rotate.rotate_both(t_b9); let y0 = calc_f32!(x0 + x1p18 + x2p17 + x3p16 + x4p15 + x5p14 + x6p13 + x7p12 + x8p11 + x9p10); let [y1, y18] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y17] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y16] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y15] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y14] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y13] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); let [y7, y12] = parallel_fft2_interleaved_f32(t_a7, t_b7_rot); let [y8, y11] = parallel_fft2_interleaved_f32(t_a8, t_b8_rot); let [y9, y10] = parallel_fft2_interleaved_f32(t_a9, t_b9_rot); [ y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, ] } } // _ ___ __ _ _ _ _ _ // / |/ _ \ / /_ | || | | |__ (_) |_ // | | (_) | _____ | '_ \| || |_| '_ \| | __| // | |\__, | |_____| | (_) |__ _| |_) | | |_ // |_| /_/ \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly19 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, twiddle3re: v128, twiddle3im: v128, twiddle4re: v128, twiddle4im: v128, twiddle5re: v128, twiddle5im: v128, twiddle6re: v128, twiddle6im: v128, twiddle7re: v128, twiddle7im: v128, twiddle8re: v128, twiddle8im: v128, twiddle9re: v128, twiddle9im: v128, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly19, 19, |this: &WasmSimdF64Butterfly19<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly19, 19, |this: &WasmSimdF64Butterfly19<_>| this.direction ); impl WasmSimdF64Butterfly19 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 19, direction); let tw2: Complex = twiddles::compute_twiddle(2, 19, direction); let tw3: Complex = twiddles::compute_twiddle(3, 19, direction); let tw4: Complex = twiddles::compute_twiddle(4, 19, direction); let tw5: Complex = twiddles::compute_twiddle(5, 19, direction); let tw6: Complex = twiddles::compute_twiddle(6, 19, direction); let tw7: Complex = twiddles::compute_twiddle(7, 19, direction); let tw8: Complex = twiddles::compute_twiddle(8, 19, direction); let tw9: Complex = twiddles::compute_twiddle(9, 19, direction); let twiddle1re = f64x2_splat(tw1.re); let twiddle1im = f64x2_splat(tw1.im); let twiddle2re = f64x2_splat(tw2.re); let twiddle2im = f64x2_splat(tw2.im); let twiddle3re = f64x2_splat(tw3.re); let twiddle3im = f64x2_splat(tw3.im); let twiddle4re = f64x2_splat(tw4.re); let twiddle4im = f64x2_splat(tw4.im); let twiddle5re = f64x2_splat(tw5.re); let twiddle5im = f64x2_splat(tw5.im); let twiddle6re = f64x2_splat(tw6.re); let twiddle6im = f64x2_splat(tw6.im); let twiddle7re = f64x2_splat(tw7.re); let twiddle7im = f64x2_splat(tw7.im); let twiddle8re = f64x2_splat(tw8.re); let twiddle8im = f64x2_splat(tw8.im); let twiddle9re = f64x2_splat(tw9.re); let twiddle9im = f64x2_splat(tw9.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [v128; 19]) -> [v128; 19] { let [x1p18, x1m18] = solo_fft2_f64(values[1], values[18]); let [x2p17, x2m17] = solo_fft2_f64(values[2], values[17]); let [x3p16, x3m16] = solo_fft2_f64(values[3], values[16]); let [x4p15, x4m15] = solo_fft2_f64(values[4], values[15]); let [x5p14, x5m14] = solo_fft2_f64(values[5], values[14]); let [x6p13, x6m13] = solo_fft2_f64(values[6], values[13]); let [x7p12, x7m12] = solo_fft2_f64(values[7], values[12]); let [x8p11, x8m11] = solo_fft2_f64(values[8], values[11]); let [x9p10, x9m10] = solo_fft2_f64(values[9], values[10]); let t_a1_1 = f64x2_mul(self.twiddle1re, x1p18); let t_a1_2 = f64x2_mul(self.twiddle2re, x2p17); let t_a1_3 = f64x2_mul(self.twiddle3re, x3p16); let t_a1_4 = f64x2_mul(self.twiddle4re, x4p15); let t_a1_5 = f64x2_mul(self.twiddle5re, x5p14); let t_a1_6 = f64x2_mul(self.twiddle6re, x6p13); let t_a1_7 = f64x2_mul(self.twiddle7re, x7p12); let t_a1_8 = f64x2_mul(self.twiddle8re, x8p11); let t_a1_9 = f64x2_mul(self.twiddle9re, x9p10); let t_a2_1 = f64x2_mul(self.twiddle2re, x1p18); let t_a2_2 = f64x2_mul(self.twiddle4re, x2p17); let t_a2_3 = f64x2_mul(self.twiddle6re, x3p16); let t_a2_4 = f64x2_mul(self.twiddle8re, x4p15); let t_a2_5 = f64x2_mul(self.twiddle9re, x5p14); let t_a2_6 = f64x2_mul(self.twiddle7re, x6p13); let t_a2_7 = f64x2_mul(self.twiddle5re, x7p12); let t_a2_8 = f64x2_mul(self.twiddle3re, x8p11); let t_a2_9 = f64x2_mul(self.twiddle1re, x9p10); let t_a3_1 = f64x2_mul(self.twiddle3re, x1p18); let t_a3_2 = f64x2_mul(self.twiddle6re, x2p17); let t_a3_3 = f64x2_mul(self.twiddle9re, x3p16); let t_a3_4 = f64x2_mul(self.twiddle7re, x4p15); let t_a3_5 = f64x2_mul(self.twiddle4re, x5p14); let t_a3_6 = f64x2_mul(self.twiddle1re, x6p13); let t_a3_7 = f64x2_mul(self.twiddle2re, x7p12); let t_a3_8 = f64x2_mul(self.twiddle5re, x8p11); let t_a3_9 = f64x2_mul(self.twiddle8re, x9p10); let t_a4_1 = f64x2_mul(self.twiddle4re, x1p18); let t_a4_2 = f64x2_mul(self.twiddle8re, x2p17); let t_a4_3 = f64x2_mul(self.twiddle7re, x3p16); let t_a4_4 = f64x2_mul(self.twiddle3re, x4p15); let t_a4_5 = f64x2_mul(self.twiddle1re, x5p14); let t_a4_6 = f64x2_mul(self.twiddle5re, x6p13); let t_a4_7 = f64x2_mul(self.twiddle9re, x7p12); let t_a4_8 = f64x2_mul(self.twiddle6re, x8p11); let t_a4_9 = f64x2_mul(self.twiddle2re, x9p10); let t_a5_1 = f64x2_mul(self.twiddle5re, x1p18); let t_a5_2 = f64x2_mul(self.twiddle9re, x2p17); let t_a5_3 = f64x2_mul(self.twiddle4re, x3p16); let t_a5_4 = f64x2_mul(self.twiddle1re, x4p15); let t_a5_5 = f64x2_mul(self.twiddle6re, x5p14); let t_a5_6 = f64x2_mul(self.twiddle8re, x6p13); let t_a5_7 = f64x2_mul(self.twiddle3re, x7p12); let t_a5_8 = f64x2_mul(self.twiddle2re, x8p11); let t_a5_9 = f64x2_mul(self.twiddle7re, x9p10); let t_a6_1 = f64x2_mul(self.twiddle6re, x1p18); let t_a6_2 = f64x2_mul(self.twiddle7re, x2p17); let t_a6_3 = f64x2_mul(self.twiddle1re, x3p16); let t_a6_4 = f64x2_mul(self.twiddle5re, x4p15); let t_a6_5 = f64x2_mul(self.twiddle8re, x5p14); let t_a6_6 = f64x2_mul(self.twiddle2re, x6p13); let t_a6_7 = f64x2_mul(self.twiddle4re, x7p12); let t_a6_8 = f64x2_mul(self.twiddle9re, x8p11); let t_a6_9 = f64x2_mul(self.twiddle3re, x9p10); let t_a7_1 = f64x2_mul(self.twiddle7re, x1p18); let t_a7_2 = f64x2_mul(self.twiddle5re, x2p17); let t_a7_3 = f64x2_mul(self.twiddle2re, x3p16); let t_a7_4 = f64x2_mul(self.twiddle9re, x4p15); let t_a7_5 = f64x2_mul(self.twiddle3re, x5p14); let t_a7_6 = f64x2_mul(self.twiddle4re, x6p13); let t_a7_7 = f64x2_mul(self.twiddle8re, x7p12); let t_a7_8 = f64x2_mul(self.twiddle1re, x8p11); let t_a7_9 = f64x2_mul(self.twiddle6re, x9p10); let t_a8_1 = f64x2_mul(self.twiddle8re, x1p18); let t_a8_2 = f64x2_mul(self.twiddle3re, x2p17); let t_a8_3 = f64x2_mul(self.twiddle5re, x3p16); let t_a8_4 = f64x2_mul(self.twiddle6re, x4p15); let t_a8_5 = f64x2_mul(self.twiddle2re, x5p14); let t_a8_6 = f64x2_mul(self.twiddle9re, x6p13); let t_a8_7 = f64x2_mul(self.twiddle1re, x7p12); let t_a8_8 = f64x2_mul(self.twiddle7re, x8p11); let t_a8_9 = f64x2_mul(self.twiddle4re, x9p10); let t_a9_1 = f64x2_mul(self.twiddle9re, x1p18); let t_a9_2 = f64x2_mul(self.twiddle1re, x2p17); let t_a9_3 = f64x2_mul(self.twiddle8re, x3p16); let t_a9_4 = f64x2_mul(self.twiddle2re, x4p15); let t_a9_5 = f64x2_mul(self.twiddle7re, x5p14); let t_a9_6 = f64x2_mul(self.twiddle3re, x6p13); let t_a9_7 = f64x2_mul(self.twiddle6re, x7p12); let t_a9_8 = f64x2_mul(self.twiddle4re, x8p11); let t_a9_9 = f64x2_mul(self.twiddle5re, x9p10); let t_b1_1 = f64x2_mul(self.twiddle1im, x1m18); let t_b1_2 = f64x2_mul(self.twiddle2im, x2m17); let t_b1_3 = f64x2_mul(self.twiddle3im, x3m16); let t_b1_4 = f64x2_mul(self.twiddle4im, x4m15); let t_b1_5 = f64x2_mul(self.twiddle5im, x5m14); let t_b1_6 = f64x2_mul(self.twiddle6im, x6m13); let t_b1_7 = f64x2_mul(self.twiddle7im, x7m12); let t_b1_8 = f64x2_mul(self.twiddle8im, x8m11); let t_b1_9 = f64x2_mul(self.twiddle9im, x9m10); let t_b2_1 = f64x2_mul(self.twiddle2im, x1m18); let t_b2_2 = f64x2_mul(self.twiddle4im, x2m17); let t_b2_3 = f64x2_mul(self.twiddle6im, x3m16); let t_b2_4 = f64x2_mul(self.twiddle8im, x4m15); let t_b2_5 = f64x2_mul(self.twiddle9im, x5m14); let t_b2_6 = f64x2_mul(self.twiddle7im, x6m13); let t_b2_7 = f64x2_mul(self.twiddle5im, x7m12); let t_b2_8 = f64x2_mul(self.twiddle3im, x8m11); let t_b2_9 = f64x2_mul(self.twiddle1im, x9m10); let t_b3_1 = f64x2_mul(self.twiddle3im, x1m18); let t_b3_2 = f64x2_mul(self.twiddle6im, x2m17); let t_b3_3 = f64x2_mul(self.twiddle9im, x3m16); let t_b3_4 = f64x2_mul(self.twiddle7im, x4m15); let t_b3_5 = f64x2_mul(self.twiddle4im, x5m14); let t_b3_6 = f64x2_mul(self.twiddle1im, x6m13); let t_b3_7 = f64x2_mul(self.twiddle2im, x7m12); let t_b3_8 = f64x2_mul(self.twiddle5im, x8m11); let t_b3_9 = f64x2_mul(self.twiddle8im, x9m10); let t_b4_1 = f64x2_mul(self.twiddle4im, x1m18); let t_b4_2 = f64x2_mul(self.twiddle8im, x2m17); let t_b4_3 = f64x2_mul(self.twiddle7im, x3m16); let t_b4_4 = f64x2_mul(self.twiddle3im, x4m15); let t_b4_5 = f64x2_mul(self.twiddle1im, x5m14); let t_b4_6 = f64x2_mul(self.twiddle5im, x6m13); let t_b4_7 = f64x2_mul(self.twiddle9im, x7m12); let t_b4_8 = f64x2_mul(self.twiddle6im, x8m11); let t_b4_9 = f64x2_mul(self.twiddle2im, x9m10); let t_b5_1 = f64x2_mul(self.twiddle5im, x1m18); let t_b5_2 = f64x2_mul(self.twiddle9im, x2m17); let t_b5_3 = f64x2_mul(self.twiddle4im, x3m16); let t_b5_4 = f64x2_mul(self.twiddle1im, x4m15); let t_b5_5 = f64x2_mul(self.twiddle6im, x5m14); let t_b5_6 = f64x2_mul(self.twiddle8im, x6m13); let t_b5_7 = f64x2_mul(self.twiddle3im, x7m12); let t_b5_8 = f64x2_mul(self.twiddle2im, x8m11); let t_b5_9 = f64x2_mul(self.twiddle7im, x9m10); let t_b6_1 = f64x2_mul(self.twiddle6im, x1m18); let t_b6_2 = f64x2_mul(self.twiddle7im, x2m17); let t_b6_3 = f64x2_mul(self.twiddle1im, x3m16); let t_b6_4 = f64x2_mul(self.twiddle5im, x4m15); let t_b6_5 = f64x2_mul(self.twiddle8im, x5m14); let t_b6_6 = f64x2_mul(self.twiddle2im, x6m13); let t_b6_7 = f64x2_mul(self.twiddle4im, x7m12); let t_b6_8 = f64x2_mul(self.twiddle9im, x8m11); let t_b6_9 = f64x2_mul(self.twiddle3im, x9m10); let t_b7_1 = f64x2_mul(self.twiddle7im, x1m18); let t_b7_2 = f64x2_mul(self.twiddle5im, x2m17); let t_b7_3 = f64x2_mul(self.twiddle2im, x3m16); let t_b7_4 = f64x2_mul(self.twiddle9im, x4m15); let t_b7_5 = f64x2_mul(self.twiddle3im, x5m14); let t_b7_6 = f64x2_mul(self.twiddle4im, x6m13); let t_b7_7 = f64x2_mul(self.twiddle8im, x7m12); let t_b7_8 = f64x2_mul(self.twiddle1im, x8m11); let t_b7_9 = f64x2_mul(self.twiddle6im, x9m10); let t_b8_1 = f64x2_mul(self.twiddle8im, x1m18); let t_b8_2 = f64x2_mul(self.twiddle3im, x2m17); let t_b8_3 = f64x2_mul(self.twiddle5im, x3m16); let t_b8_4 = f64x2_mul(self.twiddle6im, x4m15); let t_b8_5 = f64x2_mul(self.twiddle2im, x5m14); let t_b8_6 = f64x2_mul(self.twiddle9im, x6m13); let t_b8_7 = f64x2_mul(self.twiddle1im, x7m12); let t_b8_8 = f64x2_mul(self.twiddle7im, x8m11); let t_b8_9 = f64x2_mul(self.twiddle4im, x9m10); let t_b9_1 = f64x2_mul(self.twiddle9im, x1m18); let t_b9_2 = f64x2_mul(self.twiddle1im, x2m17); let t_b9_3 = f64x2_mul(self.twiddle8im, x3m16); let t_b9_4 = f64x2_mul(self.twiddle2im, x4m15); let t_b9_5 = f64x2_mul(self.twiddle7im, x5m14); let t_b9_6 = f64x2_mul(self.twiddle3im, x6m13); let t_b9_7 = f64x2_mul(self.twiddle6im, x7m12); let t_b9_8 = f64x2_mul(self.twiddle4im, x8m11); let t_b9_9 = f64x2_mul(self.twiddle5im, x9m10); let x0 = values[0]; let t_a1 = calc_f64!( x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 ); let t_a2 = calc_f64!( x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 ); let t_a3 = calc_f64!( x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 ); let t_a4 = calc_f64!( x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 ); let t_a5 = calc_f64!( x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 ); let t_a6 = calc_f64!( x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 ); let t_a7 = calc_f64!( x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 ); let t_a8 = calc_f64!( x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 ); let t_a9 = calc_f64!( x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 ); let t_b1 = calc_f64!( t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 ); let t_b2 = calc_f64!( t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 - t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8 - t_b2_9 ); let t_b3 = calc_f64!( t_b3_1 + t_b3_2 + t_b3_3 - t_b3_4 - t_b3_5 - t_b3_6 + t_b3_7 + t_b3_8 + t_b3_9 ); let t_b4 = calc_f64!( t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 + t_b4_5 + t_b4_6 + t_b4_7 - t_b4_8 - t_b4_9 ); let t_b5 = calc_f64!( t_b5_1 - t_b5_2 - t_b5_3 + t_b5_4 + t_b5_5 - t_b5_6 - t_b5_7 + t_b5_8 + t_b5_9 ); let t_b6 = calc_f64!( t_b6_1 - t_b6_2 - t_b6_3 + t_b6_4 - t_b6_5 - t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9 ); let t_b7 = calc_f64!( t_b7_1 - t_b7_2 + t_b7_3 + t_b7_4 - t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9 ); let t_b8 = calc_f64!( t_b8_1 - t_b8_2 + t_b8_3 - t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 - t_b8_9 ); let t_b9 = calc_f64!( t_b9_1 - t_b9_2 + t_b9_3 - t_b9_4 + t_b9_5 - t_b9_6 + t_b9_7 - t_b9_8 + t_b9_9 ); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let t_b7_rot = self.rotate.rotate(t_b7); let t_b8_rot = self.rotate.rotate(t_b8); let t_b9_rot = self.rotate.rotate(t_b9); let y0 = calc_f64!(x0 + x1p18 + x2p17 + x3p16 + x4p15 + x5p14 + x6p13 + x7p12 + x8p11 + x9p10); let [y1, y18] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y17] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y16] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y15] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y14] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y13] = solo_fft2_f64(t_a6, t_b6_rot); let [y7, y12] = solo_fft2_f64(t_a7, t_b7_rot); let [y8, y11] = solo_fft2_f64(t_a8, t_b8_rot); let [y9, y10] = solo_fft2_f64(t_a9, t_b9_rot); [ y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, ] } } // ____ _____ _________ _ _ _ // |___ \|___ / |___ /___ \| |__ (_) |_ // __) | |_ \ _____ |_ \ __) | '_ \| | __| // / __/ ___) | |_____| ___) / __/| |_) | | |_ // |_____|____/ |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly23 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, twiddle3re: v128, twiddle3im: v128, twiddle4re: v128, twiddle4im: v128, twiddle5re: v128, twiddle5im: v128, twiddle6re: v128, twiddle6im: v128, twiddle7re: v128, twiddle7im: v128, twiddle8re: v128, twiddle8im: v128, twiddle9re: v128, twiddle9im: v128, twiddle10re: v128, twiddle10im: v128, twiddle11re: v128, twiddle11im: v128, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly23, 23, |this: &WasmSimdF32Butterfly23<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly23, 23, |this: &WasmSimdF32Butterfly23<_>| this.direction ); impl WasmSimdF32Butterfly23 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 23, direction); let tw2: Complex = twiddles::compute_twiddle(2, 23, direction); let tw3: Complex = twiddles::compute_twiddle(3, 23, direction); let tw4: Complex = twiddles::compute_twiddle(4, 23, direction); let tw5: Complex = twiddles::compute_twiddle(5, 23, direction); let tw6: Complex = twiddles::compute_twiddle(6, 23, direction); let tw7: Complex = twiddles::compute_twiddle(7, 23, direction); let tw8: Complex = twiddles::compute_twiddle(8, 23, direction); let tw9: Complex = twiddles::compute_twiddle(9, 23, direction); let tw10: Complex = twiddles::compute_twiddle(10, 23, direction); let tw11: Complex = twiddles::compute_twiddle(11, 23, direction); let twiddle1re = f32x4_splat(tw1.re); let twiddle1im = f32x4_splat(tw1.im); let twiddle2re = f32x4_splat(tw2.re); let twiddle2im = f32x4_splat(tw2.im); let twiddle3re = f32x4_splat(tw3.re); let twiddle3im = f32x4_splat(tw3.im); let twiddle4re = f32x4_splat(tw4.re); let twiddle4im = f32x4_splat(tw4.im); let twiddle5re = f32x4_splat(tw5.re); let twiddle5im = f32x4_splat(tw5.im); let twiddle6re = f32x4_splat(tw6.re); let twiddle6im = f32x4_splat(tw6.im); let twiddle7re = f32x4_splat(tw7.re); let twiddle7im = f32x4_splat(tw7.im); let twiddle8re = f32x4_splat(tw8.re); let twiddle8im = f32x4_splat(tw8.im); let twiddle9re = f32x4_splat(tw9.re); let twiddle9im = f32x4_splat(tw9.im); let twiddle10re = f32x4_splat(tw10.re); let twiddle10im = f32x4_splat(tw10.im); let twiddle11re = f32x4_splat(tw11.re); let twiddle11im = f32x4_splat(tw11.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[11]), extract_hi_lo_f32(input_packed[0], input_packed[12]), extract_lo_hi_f32(input_packed[1], input_packed[12]), extract_hi_lo_f32(input_packed[1], input_packed[13]), extract_lo_hi_f32(input_packed[2], input_packed[13]), extract_hi_lo_f32(input_packed[2], input_packed[14]), extract_lo_hi_f32(input_packed[3], input_packed[14]), extract_hi_lo_f32(input_packed[3], input_packed[15]), extract_lo_hi_f32(input_packed[4], input_packed[15]), extract_hi_lo_f32(input_packed[4], input_packed[16]), extract_lo_hi_f32(input_packed[5], input_packed[16]), extract_hi_lo_f32(input_packed[5], input_packed[17]), extract_lo_hi_f32(input_packed[6], input_packed[17]), extract_hi_lo_f32(input_packed[6], input_packed[18]), extract_lo_hi_f32(input_packed[7], input_packed[18]), extract_hi_lo_f32(input_packed[7], input_packed[19]), extract_lo_hi_f32(input_packed[8], input_packed[19]), extract_hi_lo_f32(input_packed[8], input_packed[20]), extract_lo_hi_f32(input_packed[9], input_packed[20]), extract_hi_lo_f32(input_packed[9], input_packed[21]), extract_lo_hi_f32(input_packed[10], input_packed[21]), extract_hi_lo_f32(input_packed[10], input_packed[22]), extract_lo_hi_f32(input_packed[11], input_packed[22]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_lo_f32(out[14], out[15]), extract_lo_lo_f32(out[16], out[17]), extract_lo_lo_f32(out[18], out[19]), extract_lo_lo_f32(out[20], out[21]), extract_lo_hi_f32(out[22], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), extract_hi_hi_f32(out[15], out[16]), extract_hi_hi_f32(out[17], out[18]), extract_hi_hi_f32(out[19], out[20]), extract_hi_hi_f32(out[21], out[22]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [v128; 23]) -> [v128; 23] { let [x1p22, x1m22] = parallel_fft2_interleaved_f32(values[1], values[22]); let [x2p21, x2m21] = parallel_fft2_interleaved_f32(values[2], values[21]); let [x3p20, x3m20] = parallel_fft2_interleaved_f32(values[3], values[20]); let [x4p19, x4m19] = parallel_fft2_interleaved_f32(values[4], values[19]); let [x5p18, x5m18] = parallel_fft2_interleaved_f32(values[5], values[18]); let [x6p17, x6m17] = parallel_fft2_interleaved_f32(values[6], values[17]); let [x7p16, x7m16] = parallel_fft2_interleaved_f32(values[7], values[16]); let [x8p15, x8m15] = parallel_fft2_interleaved_f32(values[8], values[15]); let [x9p14, x9m14] = parallel_fft2_interleaved_f32(values[9], values[14]); let [x10p13, x10m13] = parallel_fft2_interleaved_f32(values[10], values[13]); let [x11p12, x11m12] = parallel_fft2_interleaved_f32(values[11], values[12]); let t_a1_1 = f32x4_mul(self.twiddle1re, x1p22); let t_a1_2 = f32x4_mul(self.twiddle2re, x2p21); let t_a1_3 = f32x4_mul(self.twiddle3re, x3p20); let t_a1_4 = f32x4_mul(self.twiddle4re, x4p19); let t_a1_5 = f32x4_mul(self.twiddle5re, x5p18); let t_a1_6 = f32x4_mul(self.twiddle6re, x6p17); let t_a1_7 = f32x4_mul(self.twiddle7re, x7p16); let t_a1_8 = f32x4_mul(self.twiddle8re, x8p15); let t_a1_9 = f32x4_mul(self.twiddle9re, x9p14); let t_a1_10 = f32x4_mul(self.twiddle10re, x10p13); let t_a1_11 = f32x4_mul(self.twiddle11re, x11p12); let t_a2_1 = f32x4_mul(self.twiddle2re, x1p22); let t_a2_2 = f32x4_mul(self.twiddle4re, x2p21); let t_a2_3 = f32x4_mul(self.twiddle6re, x3p20); let t_a2_4 = f32x4_mul(self.twiddle8re, x4p19); let t_a2_5 = f32x4_mul(self.twiddle10re, x5p18); let t_a2_6 = f32x4_mul(self.twiddle11re, x6p17); let t_a2_7 = f32x4_mul(self.twiddle9re, x7p16); let t_a2_8 = f32x4_mul(self.twiddle7re, x8p15); let t_a2_9 = f32x4_mul(self.twiddle5re, x9p14); let t_a2_10 = f32x4_mul(self.twiddle3re, x10p13); let t_a2_11 = f32x4_mul(self.twiddle1re, x11p12); let t_a3_1 = f32x4_mul(self.twiddle3re, x1p22); let t_a3_2 = f32x4_mul(self.twiddle6re, x2p21); let t_a3_3 = f32x4_mul(self.twiddle9re, x3p20); let t_a3_4 = f32x4_mul(self.twiddle11re, x4p19); let t_a3_5 = f32x4_mul(self.twiddle8re, x5p18); let t_a3_6 = f32x4_mul(self.twiddle5re, x6p17); let t_a3_7 = f32x4_mul(self.twiddle2re, x7p16); let t_a3_8 = f32x4_mul(self.twiddle1re, x8p15); let t_a3_9 = f32x4_mul(self.twiddle4re, x9p14); let t_a3_10 = f32x4_mul(self.twiddle7re, x10p13); let t_a3_11 = f32x4_mul(self.twiddle10re, x11p12); let t_a4_1 = f32x4_mul(self.twiddle4re, x1p22); let t_a4_2 = f32x4_mul(self.twiddle8re, x2p21); let t_a4_3 = f32x4_mul(self.twiddle11re, x3p20); let t_a4_4 = f32x4_mul(self.twiddle7re, x4p19); let t_a4_5 = f32x4_mul(self.twiddle3re, x5p18); let t_a4_6 = f32x4_mul(self.twiddle1re, x6p17); let t_a4_7 = f32x4_mul(self.twiddle5re, x7p16); let t_a4_8 = f32x4_mul(self.twiddle9re, x8p15); let t_a4_9 = f32x4_mul(self.twiddle10re, x9p14); let t_a4_10 = f32x4_mul(self.twiddle6re, x10p13); let t_a4_11 = f32x4_mul(self.twiddle2re, x11p12); let t_a5_1 = f32x4_mul(self.twiddle5re, x1p22); let t_a5_2 = f32x4_mul(self.twiddle10re, x2p21); let t_a5_3 = f32x4_mul(self.twiddle8re, x3p20); let t_a5_4 = f32x4_mul(self.twiddle3re, x4p19); let t_a5_5 = f32x4_mul(self.twiddle2re, x5p18); let t_a5_6 = f32x4_mul(self.twiddle7re, x6p17); let t_a5_7 = f32x4_mul(self.twiddle11re, x7p16); let t_a5_8 = f32x4_mul(self.twiddle6re, x8p15); let t_a5_9 = f32x4_mul(self.twiddle1re, x9p14); let t_a5_10 = f32x4_mul(self.twiddle4re, x10p13); let t_a5_11 = f32x4_mul(self.twiddle9re, x11p12); let t_a6_1 = f32x4_mul(self.twiddle6re, x1p22); let t_a6_2 = f32x4_mul(self.twiddle11re, x2p21); let t_a6_3 = f32x4_mul(self.twiddle5re, x3p20); let t_a6_4 = f32x4_mul(self.twiddle1re, x4p19); let t_a6_5 = f32x4_mul(self.twiddle7re, x5p18); let t_a6_6 = f32x4_mul(self.twiddle10re, x6p17); let t_a6_7 = f32x4_mul(self.twiddle4re, x7p16); let t_a6_8 = f32x4_mul(self.twiddle2re, x8p15); let t_a6_9 = f32x4_mul(self.twiddle8re, x9p14); let t_a6_10 = f32x4_mul(self.twiddle9re, x10p13); let t_a6_11 = f32x4_mul(self.twiddle3re, x11p12); let t_a7_1 = f32x4_mul(self.twiddle7re, x1p22); let t_a7_2 = f32x4_mul(self.twiddle9re, x2p21); let t_a7_3 = f32x4_mul(self.twiddle2re, x3p20); let t_a7_4 = f32x4_mul(self.twiddle5re, x4p19); let t_a7_5 = f32x4_mul(self.twiddle11re, x5p18); let t_a7_6 = f32x4_mul(self.twiddle4re, x6p17); let t_a7_7 = f32x4_mul(self.twiddle3re, x7p16); let t_a7_8 = f32x4_mul(self.twiddle10re, x8p15); let t_a7_9 = f32x4_mul(self.twiddle6re, x9p14); let t_a7_10 = f32x4_mul(self.twiddle1re, x10p13); let t_a7_11 = f32x4_mul(self.twiddle8re, x11p12); let t_a8_1 = f32x4_mul(self.twiddle8re, x1p22); let t_a8_2 = f32x4_mul(self.twiddle7re, x2p21); let t_a8_3 = f32x4_mul(self.twiddle1re, x3p20); let t_a8_4 = f32x4_mul(self.twiddle9re, x4p19); let t_a8_5 = f32x4_mul(self.twiddle6re, x5p18); let t_a8_6 = f32x4_mul(self.twiddle2re, x6p17); let t_a8_7 = f32x4_mul(self.twiddle10re, x7p16); let t_a8_8 = f32x4_mul(self.twiddle5re, x8p15); let t_a8_9 = f32x4_mul(self.twiddle3re, x9p14); let t_a8_10 = f32x4_mul(self.twiddle11re, x10p13); let t_a8_11 = f32x4_mul(self.twiddle4re, x11p12); let t_a9_1 = f32x4_mul(self.twiddle9re, x1p22); let t_a9_2 = f32x4_mul(self.twiddle5re, x2p21); let t_a9_3 = f32x4_mul(self.twiddle4re, x3p20); let t_a9_4 = f32x4_mul(self.twiddle10re, x4p19); let t_a9_5 = f32x4_mul(self.twiddle1re, x5p18); let t_a9_6 = f32x4_mul(self.twiddle8re, x6p17); let t_a9_7 = f32x4_mul(self.twiddle6re, x7p16); let t_a9_8 = f32x4_mul(self.twiddle3re, x8p15); let t_a9_9 = f32x4_mul(self.twiddle11re, x9p14); let t_a9_10 = f32x4_mul(self.twiddle2re, x10p13); let t_a9_11 = f32x4_mul(self.twiddle7re, x11p12); let t_a10_1 = f32x4_mul(self.twiddle10re, x1p22); let t_a10_2 = f32x4_mul(self.twiddle3re, x2p21); let t_a10_3 = f32x4_mul(self.twiddle7re, x3p20); let t_a10_4 = f32x4_mul(self.twiddle6re, x4p19); let t_a10_5 = f32x4_mul(self.twiddle4re, x5p18); let t_a10_6 = f32x4_mul(self.twiddle9re, x6p17); let t_a10_7 = f32x4_mul(self.twiddle1re, x7p16); let t_a10_8 = f32x4_mul(self.twiddle11re, x8p15); let t_a10_9 = f32x4_mul(self.twiddle2re, x9p14); let t_a10_10 = f32x4_mul(self.twiddle8re, x10p13); let t_a10_11 = f32x4_mul(self.twiddle5re, x11p12); let t_a11_1 = f32x4_mul(self.twiddle11re, x1p22); let t_a11_2 = f32x4_mul(self.twiddle1re, x2p21); let t_a11_3 = f32x4_mul(self.twiddle10re, x3p20); let t_a11_4 = f32x4_mul(self.twiddle2re, x4p19); let t_a11_5 = f32x4_mul(self.twiddle9re, x5p18); let t_a11_6 = f32x4_mul(self.twiddle3re, x6p17); let t_a11_7 = f32x4_mul(self.twiddle8re, x7p16); let t_a11_8 = f32x4_mul(self.twiddle4re, x8p15); let t_a11_9 = f32x4_mul(self.twiddle7re, x9p14); let t_a11_10 = f32x4_mul(self.twiddle5re, x10p13); let t_a11_11 = f32x4_mul(self.twiddle6re, x11p12); let t_b1_1 = f32x4_mul(self.twiddle1im, x1m22); let t_b1_2 = f32x4_mul(self.twiddle2im, x2m21); let t_b1_3 = f32x4_mul(self.twiddle3im, x3m20); let t_b1_4 = f32x4_mul(self.twiddle4im, x4m19); let t_b1_5 = f32x4_mul(self.twiddle5im, x5m18); let t_b1_6 = f32x4_mul(self.twiddle6im, x6m17); let t_b1_7 = f32x4_mul(self.twiddle7im, x7m16); let t_b1_8 = f32x4_mul(self.twiddle8im, x8m15); let t_b1_9 = f32x4_mul(self.twiddle9im, x9m14); let t_b1_10 = f32x4_mul(self.twiddle10im, x10m13); let t_b1_11 = f32x4_mul(self.twiddle11im, x11m12); let t_b2_1 = f32x4_mul(self.twiddle2im, x1m22); let t_b2_2 = f32x4_mul(self.twiddle4im, x2m21); let t_b2_3 = f32x4_mul(self.twiddle6im, x3m20); let t_b2_4 = f32x4_mul(self.twiddle8im, x4m19); let t_b2_5 = f32x4_mul(self.twiddle10im, x5m18); let t_b2_6 = f32x4_mul(self.twiddle11im, x6m17); let t_b2_7 = f32x4_mul(self.twiddle9im, x7m16); let t_b2_8 = f32x4_mul(self.twiddle7im, x8m15); let t_b2_9 = f32x4_mul(self.twiddle5im, x9m14); let t_b2_10 = f32x4_mul(self.twiddle3im, x10m13); let t_b2_11 = f32x4_mul(self.twiddle1im, x11m12); let t_b3_1 = f32x4_mul(self.twiddle3im, x1m22); let t_b3_2 = f32x4_mul(self.twiddle6im, x2m21); let t_b3_3 = f32x4_mul(self.twiddle9im, x3m20); let t_b3_4 = f32x4_mul(self.twiddle11im, x4m19); let t_b3_5 = f32x4_mul(self.twiddle8im, x5m18); let t_b3_6 = f32x4_mul(self.twiddle5im, x6m17); let t_b3_7 = f32x4_mul(self.twiddle2im, x7m16); let t_b3_8 = f32x4_mul(self.twiddle1im, x8m15); let t_b3_9 = f32x4_mul(self.twiddle4im, x9m14); let t_b3_10 = f32x4_mul(self.twiddle7im, x10m13); let t_b3_11 = f32x4_mul(self.twiddle10im, x11m12); let t_b4_1 = f32x4_mul(self.twiddle4im, x1m22); let t_b4_2 = f32x4_mul(self.twiddle8im, x2m21); let t_b4_3 = f32x4_mul(self.twiddle11im, x3m20); let t_b4_4 = f32x4_mul(self.twiddle7im, x4m19); let t_b4_5 = f32x4_mul(self.twiddle3im, x5m18); let t_b4_6 = f32x4_mul(self.twiddle1im, x6m17); let t_b4_7 = f32x4_mul(self.twiddle5im, x7m16); let t_b4_8 = f32x4_mul(self.twiddle9im, x8m15); let t_b4_9 = f32x4_mul(self.twiddle10im, x9m14); let t_b4_10 = f32x4_mul(self.twiddle6im, x10m13); let t_b4_11 = f32x4_mul(self.twiddle2im, x11m12); let t_b5_1 = f32x4_mul(self.twiddle5im, x1m22); let t_b5_2 = f32x4_mul(self.twiddle10im, x2m21); let t_b5_3 = f32x4_mul(self.twiddle8im, x3m20); let t_b5_4 = f32x4_mul(self.twiddle3im, x4m19); let t_b5_5 = f32x4_mul(self.twiddle2im, x5m18); let t_b5_6 = f32x4_mul(self.twiddle7im, x6m17); let t_b5_7 = f32x4_mul(self.twiddle11im, x7m16); let t_b5_8 = f32x4_mul(self.twiddle6im, x8m15); let t_b5_9 = f32x4_mul(self.twiddle1im, x9m14); let t_b5_10 = f32x4_mul(self.twiddle4im, x10m13); let t_b5_11 = f32x4_mul(self.twiddle9im, x11m12); let t_b6_1 = f32x4_mul(self.twiddle6im, x1m22); let t_b6_2 = f32x4_mul(self.twiddle11im, x2m21); let t_b6_3 = f32x4_mul(self.twiddle5im, x3m20); let t_b6_4 = f32x4_mul(self.twiddle1im, x4m19); let t_b6_5 = f32x4_mul(self.twiddle7im, x5m18); let t_b6_6 = f32x4_mul(self.twiddle10im, x6m17); let t_b6_7 = f32x4_mul(self.twiddle4im, x7m16); let t_b6_8 = f32x4_mul(self.twiddle2im, x8m15); let t_b6_9 = f32x4_mul(self.twiddle8im, x9m14); let t_b6_10 = f32x4_mul(self.twiddle9im, x10m13); let t_b6_11 = f32x4_mul(self.twiddle3im, x11m12); let t_b7_1 = f32x4_mul(self.twiddle7im, x1m22); let t_b7_2 = f32x4_mul(self.twiddle9im, x2m21); let t_b7_3 = f32x4_mul(self.twiddle2im, x3m20); let t_b7_4 = f32x4_mul(self.twiddle5im, x4m19); let t_b7_5 = f32x4_mul(self.twiddle11im, x5m18); let t_b7_6 = f32x4_mul(self.twiddle4im, x6m17); let t_b7_7 = f32x4_mul(self.twiddle3im, x7m16); let t_b7_8 = f32x4_mul(self.twiddle10im, x8m15); let t_b7_9 = f32x4_mul(self.twiddle6im, x9m14); let t_b7_10 = f32x4_mul(self.twiddle1im, x10m13); let t_b7_11 = f32x4_mul(self.twiddle8im, x11m12); let t_b8_1 = f32x4_mul(self.twiddle8im, x1m22); let t_b8_2 = f32x4_mul(self.twiddle7im, x2m21); let t_b8_3 = f32x4_mul(self.twiddle1im, x3m20); let t_b8_4 = f32x4_mul(self.twiddle9im, x4m19); let t_b8_5 = f32x4_mul(self.twiddle6im, x5m18); let t_b8_6 = f32x4_mul(self.twiddle2im, x6m17); let t_b8_7 = f32x4_mul(self.twiddle10im, x7m16); let t_b8_8 = f32x4_mul(self.twiddle5im, x8m15); let t_b8_9 = f32x4_mul(self.twiddle3im, x9m14); let t_b8_10 = f32x4_mul(self.twiddle11im, x10m13); let t_b8_11 = f32x4_mul(self.twiddle4im, x11m12); let t_b9_1 = f32x4_mul(self.twiddle9im, x1m22); let t_b9_2 = f32x4_mul(self.twiddle5im, x2m21); let t_b9_3 = f32x4_mul(self.twiddle4im, x3m20); let t_b9_4 = f32x4_mul(self.twiddle10im, x4m19); let t_b9_5 = f32x4_mul(self.twiddle1im, x5m18); let t_b9_6 = f32x4_mul(self.twiddle8im, x6m17); let t_b9_7 = f32x4_mul(self.twiddle6im, x7m16); let t_b9_8 = f32x4_mul(self.twiddle3im, x8m15); let t_b9_9 = f32x4_mul(self.twiddle11im, x9m14); let t_b9_10 = f32x4_mul(self.twiddle2im, x10m13); let t_b9_11 = f32x4_mul(self.twiddle7im, x11m12); let t_b10_1 = f32x4_mul(self.twiddle10im, x1m22); let t_b10_2 = f32x4_mul(self.twiddle3im, x2m21); let t_b10_3 = f32x4_mul(self.twiddle7im, x3m20); let t_b10_4 = f32x4_mul(self.twiddle6im, x4m19); let t_b10_5 = f32x4_mul(self.twiddle4im, x5m18); let t_b10_6 = f32x4_mul(self.twiddle9im, x6m17); let t_b10_7 = f32x4_mul(self.twiddle1im, x7m16); let t_b10_8 = f32x4_mul(self.twiddle11im, x8m15); let t_b10_9 = f32x4_mul(self.twiddle2im, x9m14); let t_b10_10 = f32x4_mul(self.twiddle8im, x10m13); let t_b10_11 = f32x4_mul(self.twiddle5im, x11m12); let t_b11_1 = f32x4_mul(self.twiddle11im, x1m22); let t_b11_2 = f32x4_mul(self.twiddle1im, x2m21); let t_b11_3 = f32x4_mul(self.twiddle10im, x3m20); let t_b11_4 = f32x4_mul(self.twiddle2im, x4m19); let t_b11_5 = f32x4_mul(self.twiddle9im, x5m18); let t_b11_6 = f32x4_mul(self.twiddle3im, x6m17); let t_b11_7 = f32x4_mul(self.twiddle8im, x7m16); let t_b11_8 = f32x4_mul(self.twiddle4im, x8m15); let t_b11_9 = f32x4_mul(self.twiddle7im, x9m14); let t_b11_10 = f32x4_mul(self.twiddle5im, x10m13); let t_b11_11 = f32x4_mul(self.twiddle6im, x11m12); let x0 = values[0]; let t_a1 = calc_f32!( x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11 ); let t_a2 = calc_f32!( x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11 ); let t_a3 = calc_f32!( x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11 ); let t_a4 = calc_f32!( x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11 ); let t_a5 = calc_f32!( x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11 ); let t_a6 = calc_f32!( x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11 ); let t_a7 = calc_f32!( x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11 ); let t_a8 = calc_f32!( x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11 ); let t_a9 = calc_f32!( x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11 ); let t_a10 = calc_f32!( x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11 ); let t_a11 = calc_f32!( x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11 ); let t_b1 = calc_f32!( t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11 ); let t_b2 = calc_f32!( t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11 ); let t_b3 = calc_f32!( t_b3_1 + t_b3_2 + t_b3_3 - t_b3_4 - t_b3_5 - t_b3_6 - t_b3_7 + t_b3_8 + t_b3_9 + t_b3_10 + t_b3_11 ); let t_b4 = calc_f32!( t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 - t_b4_5 + t_b4_6 + t_b4_7 + t_b4_8 - t_b4_9 - t_b4_10 - t_b4_11 ); let t_b5 = calc_f32!( t_b5_1 + t_b5_2 - t_b5_3 - t_b5_4 + t_b5_5 + t_b5_6 - t_b5_7 - t_b5_8 - t_b5_9 + t_b5_10 + t_b5_11 ); let t_b6 = calc_f32!( t_b6_1 - t_b6_2 - t_b6_3 + t_b6_4 + t_b6_5 - t_b6_6 - t_b6_7 + t_b6_8 + t_b6_9 - t_b6_10 - t_b6_11 ); let t_b7 = calc_f32!( t_b7_1 - t_b7_2 - t_b7_3 + t_b7_4 - t_b7_5 - t_b7_6 + t_b7_7 + t_b7_8 - t_b7_9 + t_b7_10 + t_b7_11 ); let t_b8 = calc_f32!( t_b8_1 - t_b8_2 + t_b8_3 + t_b8_4 - t_b8_5 + t_b8_6 + t_b8_7 - t_b8_8 + t_b8_9 + t_b8_10 - t_b8_11 ); let t_b9 = calc_f32!( t_b9_1 - t_b9_2 + t_b9_3 - t_b9_4 - t_b9_5 + t_b9_6 - t_b9_7 + t_b9_8 - t_b9_9 - t_b9_10 + t_b9_11 ); let t_b10 = calc_f32!( t_b10_1 - t_b10_2 + t_b10_3 - t_b10_4 + t_b10_5 - t_b10_6 + t_b10_7 + t_b10_8 - t_b10_9 + t_b10_10 - t_b10_11 ); let t_b11 = calc_f32!( t_b11_1 - t_b11_2 + t_b11_3 - t_b11_4 + t_b11_5 - t_b11_6 + t_b11_7 - t_b11_8 + t_b11_9 - t_b11_10 + t_b11_11 ); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let t_b7_rot = self.rotate.rotate_both(t_b7); let t_b8_rot = self.rotate.rotate_both(t_b8); let t_b9_rot = self.rotate.rotate_both(t_b9); let t_b10_rot = self.rotate.rotate_both(t_b10); let t_b11_rot = self.rotate.rotate_both(t_b11); let y0 = calc_f32!( x0 + x1p22 + x2p21 + x3p20 + x4p19 + x5p18 + x6p17 + x7p16 + x8p15 + x9p14 + x10p13 + x11p12 ); let [y1, y22] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y21] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y20] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y19] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y18] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y17] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); let [y7, y16] = parallel_fft2_interleaved_f32(t_a7, t_b7_rot); let [y8, y15] = parallel_fft2_interleaved_f32(t_a8, t_b8_rot); let [y9, y14] = parallel_fft2_interleaved_f32(t_a9, t_b9_rot); let [y10, y13] = parallel_fft2_interleaved_f32(t_a10, t_b10_rot); let [y11, y12] = parallel_fft2_interleaved_f32(t_a11, t_b11_rot); [ y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22, ] } } // ____ _____ __ _ _ _ _ _ // |___ \|___ / / /_ | || | | |__ (_) |_ // __) | |_ \ _____ | '_ \| || |_| '_ \| | __| // / __/ ___) | |_____| | (_) |__ _| |_) | | |_ // |_____|____/ \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly23 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, twiddle3re: v128, twiddle3im: v128, twiddle4re: v128, twiddle4im: v128, twiddle5re: v128, twiddle5im: v128, twiddle6re: v128, twiddle6im: v128, twiddle7re: v128, twiddle7im: v128, twiddle8re: v128, twiddle8im: v128, twiddle9re: v128, twiddle9im: v128, twiddle10re: v128, twiddle10im: v128, twiddle11re: v128, twiddle11im: v128, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly23, 23, |this: &WasmSimdF64Butterfly23<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly23, 23, |this: &WasmSimdF64Butterfly23<_>| this.direction ); impl WasmSimdF64Butterfly23 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 23, direction); let tw2: Complex = twiddles::compute_twiddle(2, 23, direction); let tw3: Complex = twiddles::compute_twiddle(3, 23, direction); let tw4: Complex = twiddles::compute_twiddle(4, 23, direction); let tw5: Complex = twiddles::compute_twiddle(5, 23, direction); let tw6: Complex = twiddles::compute_twiddle(6, 23, direction); let tw7: Complex = twiddles::compute_twiddle(7, 23, direction); let tw8: Complex = twiddles::compute_twiddle(8, 23, direction); let tw9: Complex = twiddles::compute_twiddle(9, 23, direction); let tw10: Complex = twiddles::compute_twiddle(10, 23, direction); let tw11: Complex = twiddles::compute_twiddle(11, 23, direction); let twiddle1re = f64x2_splat(tw1.re); let twiddle1im = f64x2_splat(tw1.im); let twiddle2re = f64x2_splat(tw2.re); let twiddle2im = f64x2_splat(tw2.im); let twiddle3re = f64x2_splat(tw3.re); let twiddle3im = f64x2_splat(tw3.im); let twiddle4re = f64x2_splat(tw4.re); let twiddle4im = f64x2_splat(tw4.im); let twiddle5re = f64x2_splat(tw5.re); let twiddle5im = f64x2_splat(tw5.im); let twiddle6re = f64x2_splat(tw6.re); let twiddle6im = f64x2_splat(tw6.im); let twiddle7re = f64x2_splat(tw7.re); let twiddle7im = f64x2_splat(tw7.im); let twiddle8re = f64x2_splat(tw8.re); let twiddle8im = f64x2_splat(tw8.im); let twiddle9re = f64x2_splat(tw9.re); let twiddle9im = f64x2_splat(tw9.im); let twiddle10re = f64x2_splat(tw10.re); let twiddle10im = f64x2_splat(tw10.im); let twiddle11re = f64x2_splat(tw11.re); let twiddle11im = f64x2_splat(tw11.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [v128; 23]) -> [v128; 23] { let [x1p22, x1m22] = solo_fft2_f64(values[1], values[22]); let [x2p21, x2m21] = solo_fft2_f64(values[2], values[21]); let [x3p20, x3m20] = solo_fft2_f64(values[3], values[20]); let [x4p19, x4m19] = solo_fft2_f64(values[4], values[19]); let [x5p18, x5m18] = solo_fft2_f64(values[5], values[18]); let [x6p17, x6m17] = solo_fft2_f64(values[6], values[17]); let [x7p16, x7m16] = solo_fft2_f64(values[7], values[16]); let [x8p15, x8m15] = solo_fft2_f64(values[8], values[15]); let [x9p14, x9m14] = solo_fft2_f64(values[9], values[14]); let [x10p13, x10m13] = solo_fft2_f64(values[10], values[13]); let [x11p12, x11m12] = solo_fft2_f64(values[11], values[12]); let t_a1_1 = f64x2_mul(self.twiddle1re, x1p22); let t_a1_2 = f64x2_mul(self.twiddle2re, x2p21); let t_a1_3 = f64x2_mul(self.twiddle3re, x3p20); let t_a1_4 = f64x2_mul(self.twiddle4re, x4p19); let t_a1_5 = f64x2_mul(self.twiddle5re, x5p18); let t_a1_6 = f64x2_mul(self.twiddle6re, x6p17); let t_a1_7 = f64x2_mul(self.twiddle7re, x7p16); let t_a1_8 = f64x2_mul(self.twiddle8re, x8p15); let t_a1_9 = f64x2_mul(self.twiddle9re, x9p14); let t_a1_10 = f64x2_mul(self.twiddle10re, x10p13); let t_a1_11 = f64x2_mul(self.twiddle11re, x11p12); let t_a2_1 = f64x2_mul(self.twiddle2re, x1p22); let t_a2_2 = f64x2_mul(self.twiddle4re, x2p21); let t_a2_3 = f64x2_mul(self.twiddle6re, x3p20); let t_a2_4 = f64x2_mul(self.twiddle8re, x4p19); let t_a2_5 = f64x2_mul(self.twiddle10re, x5p18); let t_a2_6 = f64x2_mul(self.twiddle11re, x6p17); let t_a2_7 = f64x2_mul(self.twiddle9re, x7p16); let t_a2_8 = f64x2_mul(self.twiddle7re, x8p15); let t_a2_9 = f64x2_mul(self.twiddle5re, x9p14); let t_a2_10 = f64x2_mul(self.twiddle3re, x10p13); let t_a2_11 = f64x2_mul(self.twiddle1re, x11p12); let t_a3_1 = f64x2_mul(self.twiddle3re, x1p22); let t_a3_2 = f64x2_mul(self.twiddle6re, x2p21); let t_a3_3 = f64x2_mul(self.twiddle9re, x3p20); let t_a3_4 = f64x2_mul(self.twiddle11re, x4p19); let t_a3_5 = f64x2_mul(self.twiddle8re, x5p18); let t_a3_6 = f64x2_mul(self.twiddle5re, x6p17); let t_a3_7 = f64x2_mul(self.twiddle2re, x7p16); let t_a3_8 = f64x2_mul(self.twiddle1re, x8p15); let t_a3_9 = f64x2_mul(self.twiddle4re, x9p14); let t_a3_10 = f64x2_mul(self.twiddle7re, x10p13); let t_a3_11 = f64x2_mul(self.twiddle10re, x11p12); let t_a4_1 = f64x2_mul(self.twiddle4re, x1p22); let t_a4_2 = f64x2_mul(self.twiddle8re, x2p21); let t_a4_3 = f64x2_mul(self.twiddle11re, x3p20); let t_a4_4 = f64x2_mul(self.twiddle7re, x4p19); let t_a4_5 = f64x2_mul(self.twiddle3re, x5p18); let t_a4_6 = f64x2_mul(self.twiddle1re, x6p17); let t_a4_7 = f64x2_mul(self.twiddle5re, x7p16); let t_a4_8 = f64x2_mul(self.twiddle9re, x8p15); let t_a4_9 = f64x2_mul(self.twiddle10re, x9p14); let t_a4_10 = f64x2_mul(self.twiddle6re, x10p13); let t_a4_11 = f64x2_mul(self.twiddle2re, x11p12); let t_a5_1 = f64x2_mul(self.twiddle5re, x1p22); let t_a5_2 = f64x2_mul(self.twiddle10re, x2p21); let t_a5_3 = f64x2_mul(self.twiddle8re, x3p20); let t_a5_4 = f64x2_mul(self.twiddle3re, x4p19); let t_a5_5 = f64x2_mul(self.twiddle2re, x5p18); let t_a5_6 = f64x2_mul(self.twiddle7re, x6p17); let t_a5_7 = f64x2_mul(self.twiddle11re, x7p16); let t_a5_8 = f64x2_mul(self.twiddle6re, x8p15); let t_a5_9 = f64x2_mul(self.twiddle1re, x9p14); let t_a5_10 = f64x2_mul(self.twiddle4re, x10p13); let t_a5_11 = f64x2_mul(self.twiddle9re, x11p12); let t_a6_1 = f64x2_mul(self.twiddle6re, x1p22); let t_a6_2 = f64x2_mul(self.twiddle11re, x2p21); let t_a6_3 = f64x2_mul(self.twiddle5re, x3p20); let t_a6_4 = f64x2_mul(self.twiddle1re, x4p19); let t_a6_5 = f64x2_mul(self.twiddle7re, x5p18); let t_a6_6 = f64x2_mul(self.twiddle10re, x6p17); let t_a6_7 = f64x2_mul(self.twiddle4re, x7p16); let t_a6_8 = f64x2_mul(self.twiddle2re, x8p15); let t_a6_9 = f64x2_mul(self.twiddle8re, x9p14); let t_a6_10 = f64x2_mul(self.twiddle9re, x10p13); let t_a6_11 = f64x2_mul(self.twiddle3re, x11p12); let t_a7_1 = f64x2_mul(self.twiddle7re, x1p22); let t_a7_2 = f64x2_mul(self.twiddle9re, x2p21); let t_a7_3 = f64x2_mul(self.twiddle2re, x3p20); let t_a7_4 = f64x2_mul(self.twiddle5re, x4p19); let t_a7_5 = f64x2_mul(self.twiddle11re, x5p18); let t_a7_6 = f64x2_mul(self.twiddle4re, x6p17); let t_a7_7 = f64x2_mul(self.twiddle3re, x7p16); let t_a7_8 = f64x2_mul(self.twiddle10re, x8p15); let t_a7_9 = f64x2_mul(self.twiddle6re, x9p14); let t_a7_10 = f64x2_mul(self.twiddle1re, x10p13); let t_a7_11 = f64x2_mul(self.twiddle8re, x11p12); let t_a8_1 = f64x2_mul(self.twiddle8re, x1p22); let t_a8_2 = f64x2_mul(self.twiddle7re, x2p21); let t_a8_3 = f64x2_mul(self.twiddle1re, x3p20); let t_a8_4 = f64x2_mul(self.twiddle9re, x4p19); let t_a8_5 = f64x2_mul(self.twiddle6re, x5p18); let t_a8_6 = f64x2_mul(self.twiddle2re, x6p17); let t_a8_7 = f64x2_mul(self.twiddle10re, x7p16); let t_a8_8 = f64x2_mul(self.twiddle5re, x8p15); let t_a8_9 = f64x2_mul(self.twiddle3re, x9p14); let t_a8_10 = f64x2_mul(self.twiddle11re, x10p13); let t_a8_11 = f64x2_mul(self.twiddle4re, x11p12); let t_a9_1 = f64x2_mul(self.twiddle9re, x1p22); let t_a9_2 = f64x2_mul(self.twiddle5re, x2p21); let t_a9_3 = f64x2_mul(self.twiddle4re, x3p20); let t_a9_4 = f64x2_mul(self.twiddle10re, x4p19); let t_a9_5 = f64x2_mul(self.twiddle1re, x5p18); let t_a9_6 = f64x2_mul(self.twiddle8re, x6p17); let t_a9_7 = f64x2_mul(self.twiddle6re, x7p16); let t_a9_8 = f64x2_mul(self.twiddle3re, x8p15); let t_a9_9 = f64x2_mul(self.twiddle11re, x9p14); let t_a9_10 = f64x2_mul(self.twiddle2re, x10p13); let t_a9_11 = f64x2_mul(self.twiddle7re, x11p12); let t_a10_1 = f64x2_mul(self.twiddle10re, x1p22); let t_a10_2 = f64x2_mul(self.twiddle3re, x2p21); let t_a10_3 = f64x2_mul(self.twiddle7re, x3p20); let t_a10_4 = f64x2_mul(self.twiddle6re, x4p19); let t_a10_5 = f64x2_mul(self.twiddle4re, x5p18); let t_a10_6 = f64x2_mul(self.twiddle9re, x6p17); let t_a10_7 = f64x2_mul(self.twiddle1re, x7p16); let t_a10_8 = f64x2_mul(self.twiddle11re, x8p15); let t_a10_9 = f64x2_mul(self.twiddle2re, x9p14); let t_a10_10 = f64x2_mul(self.twiddle8re, x10p13); let t_a10_11 = f64x2_mul(self.twiddle5re, x11p12); let t_a11_1 = f64x2_mul(self.twiddle11re, x1p22); let t_a11_2 = f64x2_mul(self.twiddle1re, x2p21); let t_a11_3 = f64x2_mul(self.twiddle10re, x3p20); let t_a11_4 = f64x2_mul(self.twiddle2re, x4p19); let t_a11_5 = f64x2_mul(self.twiddle9re, x5p18); let t_a11_6 = f64x2_mul(self.twiddle3re, x6p17); let t_a11_7 = f64x2_mul(self.twiddle8re, x7p16); let t_a11_8 = f64x2_mul(self.twiddle4re, x8p15); let t_a11_9 = f64x2_mul(self.twiddle7re, x9p14); let t_a11_10 = f64x2_mul(self.twiddle5re, x10p13); let t_a11_11 = f64x2_mul(self.twiddle6re, x11p12); let t_b1_1 = f64x2_mul(self.twiddle1im, x1m22); let t_b1_2 = f64x2_mul(self.twiddle2im, x2m21); let t_b1_3 = f64x2_mul(self.twiddle3im, x3m20); let t_b1_4 = f64x2_mul(self.twiddle4im, x4m19); let t_b1_5 = f64x2_mul(self.twiddle5im, x5m18); let t_b1_6 = f64x2_mul(self.twiddle6im, x6m17); let t_b1_7 = f64x2_mul(self.twiddle7im, x7m16); let t_b1_8 = f64x2_mul(self.twiddle8im, x8m15); let t_b1_9 = f64x2_mul(self.twiddle9im, x9m14); let t_b1_10 = f64x2_mul(self.twiddle10im, x10m13); let t_b1_11 = f64x2_mul(self.twiddle11im, x11m12); let t_b2_1 = f64x2_mul(self.twiddle2im, x1m22); let t_b2_2 = f64x2_mul(self.twiddle4im, x2m21); let t_b2_3 = f64x2_mul(self.twiddle6im, x3m20); let t_b2_4 = f64x2_mul(self.twiddle8im, x4m19); let t_b2_5 = f64x2_mul(self.twiddle10im, x5m18); let t_b2_6 = f64x2_mul(self.twiddle11im, x6m17); let t_b2_7 = f64x2_mul(self.twiddle9im, x7m16); let t_b2_8 = f64x2_mul(self.twiddle7im, x8m15); let t_b2_9 = f64x2_mul(self.twiddle5im, x9m14); let t_b2_10 = f64x2_mul(self.twiddle3im, x10m13); let t_b2_11 = f64x2_mul(self.twiddle1im, x11m12); let t_b3_1 = f64x2_mul(self.twiddle3im, x1m22); let t_b3_2 = f64x2_mul(self.twiddle6im, x2m21); let t_b3_3 = f64x2_mul(self.twiddle9im, x3m20); let t_b3_4 = f64x2_mul(self.twiddle11im, x4m19); let t_b3_5 = f64x2_mul(self.twiddle8im, x5m18); let t_b3_6 = f64x2_mul(self.twiddle5im, x6m17); let t_b3_7 = f64x2_mul(self.twiddle2im, x7m16); let t_b3_8 = f64x2_mul(self.twiddle1im, x8m15); let t_b3_9 = f64x2_mul(self.twiddle4im, x9m14); let t_b3_10 = f64x2_mul(self.twiddle7im, x10m13); let t_b3_11 = f64x2_mul(self.twiddle10im, x11m12); let t_b4_1 = f64x2_mul(self.twiddle4im, x1m22); let t_b4_2 = f64x2_mul(self.twiddle8im, x2m21); let t_b4_3 = f64x2_mul(self.twiddle11im, x3m20); let t_b4_4 = f64x2_mul(self.twiddle7im, x4m19); let t_b4_5 = f64x2_mul(self.twiddle3im, x5m18); let t_b4_6 = f64x2_mul(self.twiddle1im, x6m17); let t_b4_7 = f64x2_mul(self.twiddle5im, x7m16); let t_b4_8 = f64x2_mul(self.twiddle9im, x8m15); let t_b4_9 = f64x2_mul(self.twiddle10im, x9m14); let t_b4_10 = f64x2_mul(self.twiddle6im, x10m13); let t_b4_11 = f64x2_mul(self.twiddle2im, x11m12); let t_b5_1 = f64x2_mul(self.twiddle5im, x1m22); let t_b5_2 = f64x2_mul(self.twiddle10im, x2m21); let t_b5_3 = f64x2_mul(self.twiddle8im, x3m20); let t_b5_4 = f64x2_mul(self.twiddle3im, x4m19); let t_b5_5 = f64x2_mul(self.twiddle2im, x5m18); let t_b5_6 = f64x2_mul(self.twiddle7im, x6m17); let t_b5_7 = f64x2_mul(self.twiddle11im, x7m16); let t_b5_8 = f64x2_mul(self.twiddle6im, x8m15); let t_b5_9 = f64x2_mul(self.twiddle1im, x9m14); let t_b5_10 = f64x2_mul(self.twiddle4im, x10m13); let t_b5_11 = f64x2_mul(self.twiddle9im, x11m12); let t_b6_1 = f64x2_mul(self.twiddle6im, x1m22); let t_b6_2 = f64x2_mul(self.twiddle11im, x2m21); let t_b6_3 = f64x2_mul(self.twiddle5im, x3m20); let t_b6_4 = f64x2_mul(self.twiddle1im, x4m19); let t_b6_5 = f64x2_mul(self.twiddle7im, x5m18); let t_b6_6 = f64x2_mul(self.twiddle10im, x6m17); let t_b6_7 = f64x2_mul(self.twiddle4im, x7m16); let t_b6_8 = f64x2_mul(self.twiddle2im, x8m15); let t_b6_9 = f64x2_mul(self.twiddle8im, x9m14); let t_b6_10 = f64x2_mul(self.twiddle9im, x10m13); let t_b6_11 = f64x2_mul(self.twiddle3im, x11m12); let t_b7_1 = f64x2_mul(self.twiddle7im, x1m22); let t_b7_2 = f64x2_mul(self.twiddle9im, x2m21); let t_b7_3 = f64x2_mul(self.twiddle2im, x3m20); let t_b7_4 = f64x2_mul(self.twiddle5im, x4m19); let t_b7_5 = f64x2_mul(self.twiddle11im, x5m18); let t_b7_6 = f64x2_mul(self.twiddle4im, x6m17); let t_b7_7 = f64x2_mul(self.twiddle3im, x7m16); let t_b7_8 = f64x2_mul(self.twiddle10im, x8m15); let t_b7_9 = f64x2_mul(self.twiddle6im, x9m14); let t_b7_10 = f64x2_mul(self.twiddle1im, x10m13); let t_b7_11 = f64x2_mul(self.twiddle8im, x11m12); let t_b8_1 = f64x2_mul(self.twiddle8im, x1m22); let t_b8_2 = f64x2_mul(self.twiddle7im, x2m21); let t_b8_3 = f64x2_mul(self.twiddle1im, x3m20); let t_b8_4 = f64x2_mul(self.twiddle9im, x4m19); let t_b8_5 = f64x2_mul(self.twiddle6im, x5m18); let t_b8_6 = f64x2_mul(self.twiddle2im, x6m17); let t_b8_7 = f64x2_mul(self.twiddle10im, x7m16); let t_b8_8 = f64x2_mul(self.twiddle5im, x8m15); let t_b8_9 = f64x2_mul(self.twiddle3im, x9m14); let t_b8_10 = f64x2_mul(self.twiddle11im, x10m13); let t_b8_11 = f64x2_mul(self.twiddle4im, x11m12); let t_b9_1 = f64x2_mul(self.twiddle9im, x1m22); let t_b9_2 = f64x2_mul(self.twiddle5im, x2m21); let t_b9_3 = f64x2_mul(self.twiddle4im, x3m20); let t_b9_4 = f64x2_mul(self.twiddle10im, x4m19); let t_b9_5 = f64x2_mul(self.twiddle1im, x5m18); let t_b9_6 = f64x2_mul(self.twiddle8im, x6m17); let t_b9_7 = f64x2_mul(self.twiddle6im, x7m16); let t_b9_8 = f64x2_mul(self.twiddle3im, x8m15); let t_b9_9 = f64x2_mul(self.twiddle11im, x9m14); let t_b9_10 = f64x2_mul(self.twiddle2im, x10m13); let t_b9_11 = f64x2_mul(self.twiddle7im, x11m12); let t_b10_1 = f64x2_mul(self.twiddle10im, x1m22); let t_b10_2 = f64x2_mul(self.twiddle3im, x2m21); let t_b10_3 = f64x2_mul(self.twiddle7im, x3m20); let t_b10_4 = f64x2_mul(self.twiddle6im, x4m19); let t_b10_5 = f64x2_mul(self.twiddle4im, x5m18); let t_b10_6 = f64x2_mul(self.twiddle9im, x6m17); let t_b10_7 = f64x2_mul(self.twiddle1im, x7m16); let t_b10_8 = f64x2_mul(self.twiddle11im, x8m15); let t_b10_9 = f64x2_mul(self.twiddle2im, x9m14); let t_b10_10 = f64x2_mul(self.twiddle8im, x10m13); let t_b10_11 = f64x2_mul(self.twiddle5im, x11m12); let t_b11_1 = f64x2_mul(self.twiddle11im, x1m22); let t_b11_2 = f64x2_mul(self.twiddle1im, x2m21); let t_b11_3 = f64x2_mul(self.twiddle10im, x3m20); let t_b11_4 = f64x2_mul(self.twiddle2im, x4m19); let t_b11_5 = f64x2_mul(self.twiddle9im, x5m18); let t_b11_6 = f64x2_mul(self.twiddle3im, x6m17); let t_b11_7 = f64x2_mul(self.twiddle8im, x7m16); let t_b11_8 = f64x2_mul(self.twiddle4im, x8m15); let t_b11_9 = f64x2_mul(self.twiddle7im, x9m14); let t_b11_10 = f64x2_mul(self.twiddle5im, x10m13); let t_b11_11 = f64x2_mul(self.twiddle6im, x11m12); let x0 = values[0]; let t_a1 = calc_f64!( x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11 ); let t_a2 = calc_f64!( x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11 ); let t_a3 = calc_f64!( x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11 ); let t_a4 = calc_f64!( x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11 ); let t_a5 = calc_f64!( x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11 ); let t_a6 = calc_f64!( x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11 ); let t_a7 = calc_f64!( x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11 ); let t_a8 = calc_f64!( x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11 ); let t_a9 = calc_f64!( x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11 ); let t_a10 = calc_f64!( x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11 ); let t_a11 = calc_f64!( x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11 ); let t_b1 = calc_f64!( t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11 ); let t_b2 = calc_f64!( t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 - t_b2_6 - t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11 ); let t_b3 = calc_f64!( t_b3_1 + t_b3_2 + t_b3_3 - t_b3_4 - t_b3_5 - t_b3_6 - t_b3_7 + t_b3_8 + t_b3_9 + t_b3_10 + t_b3_11 ); let t_b4 = calc_f64!( t_b4_1 + t_b4_2 - t_b4_3 - t_b4_4 - t_b4_5 + t_b4_6 + t_b4_7 + t_b4_8 - t_b4_9 - t_b4_10 - t_b4_11 ); let t_b5 = calc_f64!( t_b5_1 + t_b5_2 - t_b5_3 - t_b5_4 + t_b5_5 + t_b5_6 - t_b5_7 - t_b5_8 - t_b5_9 + t_b5_10 + t_b5_11 ); let t_b6 = calc_f64!( t_b6_1 - t_b6_2 - t_b6_3 + t_b6_4 + t_b6_5 - t_b6_6 - t_b6_7 + t_b6_8 + t_b6_9 - t_b6_10 - t_b6_11 ); let t_b7 = calc_f64!( t_b7_1 - t_b7_2 - t_b7_3 + t_b7_4 - t_b7_5 - t_b7_6 + t_b7_7 + t_b7_8 - t_b7_9 + t_b7_10 + t_b7_11 ); let t_b8 = calc_f64!( t_b8_1 - t_b8_2 + t_b8_3 + t_b8_4 - t_b8_5 + t_b8_6 + t_b8_7 - t_b8_8 + t_b8_9 + t_b8_10 - t_b8_11 ); let t_b9 = calc_f64!( t_b9_1 - t_b9_2 + t_b9_3 - t_b9_4 - t_b9_5 + t_b9_6 - t_b9_7 + t_b9_8 - t_b9_9 - t_b9_10 + t_b9_11 ); let t_b10 = calc_f64!( t_b10_1 - t_b10_2 + t_b10_3 - t_b10_4 + t_b10_5 - t_b10_6 + t_b10_7 + t_b10_8 - t_b10_9 + t_b10_10 - t_b10_11 ); let t_b11 = calc_f64!( t_b11_1 - t_b11_2 + t_b11_3 - t_b11_4 + t_b11_5 - t_b11_6 + t_b11_7 - t_b11_8 + t_b11_9 - t_b11_10 + t_b11_11 ); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let t_b7_rot = self.rotate.rotate(t_b7); let t_b8_rot = self.rotate.rotate(t_b8); let t_b9_rot = self.rotate.rotate(t_b9); let t_b10_rot = self.rotate.rotate(t_b10); let t_b11_rot = self.rotate.rotate(t_b11); let y0 = calc_f64!( x0 + x1p22 + x2p21 + x3p20 + x4p19 + x5p18 + x6p17 + x7p16 + x8p15 + x9p14 + x10p13 + x11p12 ); let [y1, y22] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y21] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y20] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y19] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y18] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y17] = solo_fft2_f64(t_a6, t_b6_rot); let [y7, y16] = solo_fft2_f64(t_a7, t_b7_rot); let [y8, y15] = solo_fft2_f64(t_a8, t_b8_rot); let [y9, y14] = solo_fft2_f64(t_a9, t_b9_rot); let [y10, y13] = solo_fft2_f64(t_a10, t_b10_rot); let [y11, y12] = solo_fft2_f64(t_a11, t_b11_rot); [ y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22, ] } } // ____ ___ _________ _ _ _ // |___ \ / _ \ |___ /___ \| |__ (_) |_ // __) | (_) | _____ |_ \ __) | '_ \| | __| // / __/ \__, | |_____| ___) / __/| |_) | | |_ // |_____| /_/ |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly29 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, twiddle3re: v128, twiddle3im: v128, twiddle4re: v128, twiddle4im: v128, twiddle5re: v128, twiddle5im: v128, twiddle6re: v128, twiddle6im: v128, twiddle7re: v128, twiddle7im: v128, twiddle8re: v128, twiddle8im: v128, twiddle9re: v128, twiddle9im: v128, twiddle10re: v128, twiddle10im: v128, twiddle11re: v128, twiddle11im: v128, twiddle12re: v128, twiddle12im: v128, twiddle13re: v128, twiddle13im: v128, twiddle14re: v128, twiddle14im: v128, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly29, 29, |this: &WasmSimdF32Butterfly29<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly29, 29, |this: &WasmSimdF32Butterfly29<_>| this.direction ); impl WasmSimdF32Butterfly29 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 29, direction); let tw2: Complex = twiddles::compute_twiddle(2, 29, direction); let tw3: Complex = twiddles::compute_twiddle(3, 29, direction); let tw4: Complex = twiddles::compute_twiddle(4, 29, direction); let tw5: Complex = twiddles::compute_twiddle(5, 29, direction); let tw6: Complex = twiddles::compute_twiddle(6, 29, direction); let tw7: Complex = twiddles::compute_twiddle(7, 29, direction); let tw8: Complex = twiddles::compute_twiddle(8, 29, direction); let tw9: Complex = twiddles::compute_twiddle(9, 29, direction); let tw10: Complex = twiddles::compute_twiddle(10, 29, direction); let tw11: Complex = twiddles::compute_twiddle(11, 29, direction); let tw12: Complex = twiddles::compute_twiddle(12, 29, direction); let tw13: Complex = twiddles::compute_twiddle(13, 29, direction); let tw14: Complex = twiddles::compute_twiddle(14, 29, direction); let twiddle1re = f32x4_splat(tw1.re); let twiddle1im = f32x4_splat(tw1.im); let twiddle2re = f32x4_splat(tw2.re); let twiddle2im = f32x4_splat(tw2.im); let twiddle3re = f32x4_splat(tw3.re); let twiddle3im = f32x4_splat(tw3.im); let twiddle4re = f32x4_splat(tw4.re); let twiddle4im = f32x4_splat(tw4.im); let twiddle5re = f32x4_splat(tw5.re); let twiddle5im = f32x4_splat(tw5.im); let twiddle6re = f32x4_splat(tw6.re); let twiddle6im = f32x4_splat(tw6.im); let twiddle7re = f32x4_splat(tw7.re); let twiddle7im = f32x4_splat(tw7.im); let twiddle8re = f32x4_splat(tw8.re); let twiddle8im = f32x4_splat(tw8.im); let twiddle9re = f32x4_splat(tw9.re); let twiddle9im = f32x4_splat(tw9.im); let twiddle10re = f32x4_splat(tw10.re); let twiddle10im = f32x4_splat(tw10.im); let twiddle11re = f32x4_splat(tw11.re); let twiddle11im = f32x4_splat(tw11.im); let twiddle12re = f32x4_splat(tw12.re); let twiddle12im = f32x4_splat(tw12.im); let twiddle13re = f32x4_splat(tw13.re); let twiddle13im = f32x4_splat(tw13.im); let twiddle14re = f32x4_splat(tw14.re); let twiddle14im = f32x4_splat(tw14.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, twiddle12re, twiddle12im, twiddle13re, twiddle13im, twiddle14re, twiddle14im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[14]), extract_hi_lo_f32(input_packed[0], input_packed[15]), extract_lo_hi_f32(input_packed[1], input_packed[15]), extract_hi_lo_f32(input_packed[1], input_packed[16]), extract_lo_hi_f32(input_packed[2], input_packed[16]), extract_hi_lo_f32(input_packed[2], input_packed[17]), extract_lo_hi_f32(input_packed[3], input_packed[17]), extract_hi_lo_f32(input_packed[3], input_packed[18]), extract_lo_hi_f32(input_packed[4], input_packed[18]), extract_hi_lo_f32(input_packed[4], input_packed[19]), extract_lo_hi_f32(input_packed[5], input_packed[19]), extract_hi_lo_f32(input_packed[5], input_packed[20]), extract_lo_hi_f32(input_packed[6], input_packed[20]), extract_hi_lo_f32(input_packed[6], input_packed[21]), extract_lo_hi_f32(input_packed[7], input_packed[21]), extract_hi_lo_f32(input_packed[7], input_packed[22]), extract_lo_hi_f32(input_packed[8], input_packed[22]), extract_hi_lo_f32(input_packed[8], input_packed[23]), extract_lo_hi_f32(input_packed[9], input_packed[23]), extract_hi_lo_f32(input_packed[9], input_packed[24]), extract_lo_hi_f32(input_packed[10], input_packed[24]), extract_hi_lo_f32(input_packed[10], input_packed[25]), extract_lo_hi_f32(input_packed[11], input_packed[25]), extract_hi_lo_f32(input_packed[11], input_packed[26]), extract_lo_hi_f32(input_packed[12], input_packed[26]), extract_hi_lo_f32(input_packed[12], input_packed[27]), extract_lo_hi_f32(input_packed[13], input_packed[27]), extract_hi_lo_f32(input_packed[13], input_packed[28]), extract_lo_hi_f32(input_packed[14], input_packed[28]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_lo_f32(out[14], out[15]), extract_lo_lo_f32(out[16], out[17]), extract_lo_lo_f32(out[18], out[19]), extract_lo_lo_f32(out[20], out[21]), extract_lo_lo_f32(out[22], out[23]), extract_lo_lo_f32(out[24], out[25]), extract_lo_lo_f32(out[26], out[27]), extract_lo_hi_f32(out[28], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), extract_hi_hi_f32(out[15], out[16]), extract_hi_hi_f32(out[17], out[18]), extract_hi_hi_f32(out[19], out[20]), extract_hi_hi_f32(out[21], out[22]), extract_hi_hi_f32(out[23], out[24]), extract_hi_hi_f32(out[25], out[26]), extract_hi_hi_f32(out[27], out[28]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [v128; 29]) -> [v128; 29] { let [x1p28, x1m28] = parallel_fft2_interleaved_f32(values[1], values[28]); let [x2p27, x2m27] = parallel_fft2_interleaved_f32(values[2], values[27]); let [x3p26, x3m26] = parallel_fft2_interleaved_f32(values[3], values[26]); let [x4p25, x4m25] = parallel_fft2_interleaved_f32(values[4], values[25]); let [x5p24, x5m24] = parallel_fft2_interleaved_f32(values[5], values[24]); let [x6p23, x6m23] = parallel_fft2_interleaved_f32(values[6], values[23]); let [x7p22, x7m22] = parallel_fft2_interleaved_f32(values[7], values[22]); let [x8p21, x8m21] = parallel_fft2_interleaved_f32(values[8], values[21]); let [x9p20, x9m20] = parallel_fft2_interleaved_f32(values[9], values[20]); let [x10p19, x10m19] = parallel_fft2_interleaved_f32(values[10], values[19]); let [x11p18, x11m18] = parallel_fft2_interleaved_f32(values[11], values[18]); let [x12p17, x12m17] = parallel_fft2_interleaved_f32(values[12], values[17]); let [x13p16, x13m16] = parallel_fft2_interleaved_f32(values[13], values[16]); let [x14p15, x14m15] = parallel_fft2_interleaved_f32(values[14], values[15]); let t_a1_1 = f32x4_mul(self.twiddle1re, x1p28); let t_a1_2 = f32x4_mul(self.twiddle2re, x2p27); let t_a1_3 = f32x4_mul(self.twiddle3re, x3p26); let t_a1_4 = f32x4_mul(self.twiddle4re, x4p25); let t_a1_5 = f32x4_mul(self.twiddle5re, x5p24); let t_a1_6 = f32x4_mul(self.twiddle6re, x6p23); let t_a1_7 = f32x4_mul(self.twiddle7re, x7p22); let t_a1_8 = f32x4_mul(self.twiddle8re, x8p21); let t_a1_9 = f32x4_mul(self.twiddle9re, x9p20); let t_a1_10 = f32x4_mul(self.twiddle10re, x10p19); let t_a1_11 = f32x4_mul(self.twiddle11re, x11p18); let t_a1_12 = f32x4_mul(self.twiddle12re, x12p17); let t_a1_13 = f32x4_mul(self.twiddle13re, x13p16); let t_a1_14 = f32x4_mul(self.twiddle14re, x14p15); let t_a2_1 = f32x4_mul(self.twiddle2re, x1p28); let t_a2_2 = f32x4_mul(self.twiddle4re, x2p27); let t_a2_3 = f32x4_mul(self.twiddle6re, x3p26); let t_a2_4 = f32x4_mul(self.twiddle8re, x4p25); let t_a2_5 = f32x4_mul(self.twiddle10re, x5p24); let t_a2_6 = f32x4_mul(self.twiddle12re, x6p23); let t_a2_7 = f32x4_mul(self.twiddle14re, x7p22); let t_a2_8 = f32x4_mul(self.twiddle13re, x8p21); let t_a2_9 = f32x4_mul(self.twiddle11re, x9p20); let t_a2_10 = f32x4_mul(self.twiddle9re, x10p19); let t_a2_11 = f32x4_mul(self.twiddle7re, x11p18); let t_a2_12 = f32x4_mul(self.twiddle5re, x12p17); let t_a2_13 = f32x4_mul(self.twiddle3re, x13p16); let t_a2_14 = f32x4_mul(self.twiddle1re, x14p15); let t_a3_1 = f32x4_mul(self.twiddle3re, x1p28); let t_a3_2 = f32x4_mul(self.twiddle6re, x2p27); let t_a3_3 = f32x4_mul(self.twiddle9re, x3p26); let t_a3_4 = f32x4_mul(self.twiddle12re, x4p25); let t_a3_5 = f32x4_mul(self.twiddle14re, x5p24); let t_a3_6 = f32x4_mul(self.twiddle11re, x6p23); let t_a3_7 = f32x4_mul(self.twiddle8re, x7p22); let t_a3_8 = f32x4_mul(self.twiddle5re, x8p21); let t_a3_9 = f32x4_mul(self.twiddle2re, x9p20); let t_a3_10 = f32x4_mul(self.twiddle1re, x10p19); let t_a3_11 = f32x4_mul(self.twiddle4re, x11p18); let t_a3_12 = f32x4_mul(self.twiddle7re, x12p17); let t_a3_13 = f32x4_mul(self.twiddle10re, x13p16); let t_a3_14 = f32x4_mul(self.twiddle13re, x14p15); let t_a4_1 = f32x4_mul(self.twiddle4re, x1p28); let t_a4_2 = f32x4_mul(self.twiddle8re, x2p27); let t_a4_3 = f32x4_mul(self.twiddle12re, x3p26); let t_a4_4 = f32x4_mul(self.twiddle13re, x4p25); let t_a4_5 = f32x4_mul(self.twiddle9re, x5p24); let t_a4_6 = f32x4_mul(self.twiddle5re, x6p23); let t_a4_7 = f32x4_mul(self.twiddle1re, x7p22); let t_a4_8 = f32x4_mul(self.twiddle3re, x8p21); let t_a4_9 = f32x4_mul(self.twiddle7re, x9p20); let t_a4_10 = f32x4_mul(self.twiddle11re, x10p19); let t_a4_11 = f32x4_mul(self.twiddle14re, x11p18); let t_a4_12 = f32x4_mul(self.twiddle10re, x12p17); let t_a4_13 = f32x4_mul(self.twiddle6re, x13p16); let t_a4_14 = f32x4_mul(self.twiddle2re, x14p15); let t_a5_1 = f32x4_mul(self.twiddle5re, x1p28); let t_a5_2 = f32x4_mul(self.twiddle10re, x2p27); let t_a5_3 = f32x4_mul(self.twiddle14re, x3p26); let t_a5_4 = f32x4_mul(self.twiddle9re, x4p25); let t_a5_5 = f32x4_mul(self.twiddle4re, x5p24); let t_a5_6 = f32x4_mul(self.twiddle1re, x6p23); let t_a5_7 = f32x4_mul(self.twiddle6re, x7p22); let t_a5_8 = f32x4_mul(self.twiddle11re, x8p21); let t_a5_9 = f32x4_mul(self.twiddle13re, x9p20); let t_a5_10 = f32x4_mul(self.twiddle8re, x10p19); let t_a5_11 = f32x4_mul(self.twiddle3re, x11p18); let t_a5_12 = f32x4_mul(self.twiddle2re, x12p17); let t_a5_13 = f32x4_mul(self.twiddle7re, x13p16); let t_a5_14 = f32x4_mul(self.twiddle12re, x14p15); let t_a6_1 = f32x4_mul(self.twiddle6re, x1p28); let t_a6_2 = f32x4_mul(self.twiddle12re, x2p27); let t_a6_3 = f32x4_mul(self.twiddle11re, x3p26); let t_a6_4 = f32x4_mul(self.twiddle5re, x4p25); let t_a6_5 = f32x4_mul(self.twiddle1re, x5p24); let t_a6_6 = f32x4_mul(self.twiddle7re, x6p23); let t_a6_7 = f32x4_mul(self.twiddle13re, x7p22); let t_a6_8 = f32x4_mul(self.twiddle10re, x8p21); let t_a6_9 = f32x4_mul(self.twiddle4re, x9p20); let t_a6_10 = f32x4_mul(self.twiddle2re, x10p19); let t_a6_11 = f32x4_mul(self.twiddle8re, x11p18); let t_a6_12 = f32x4_mul(self.twiddle14re, x12p17); let t_a6_13 = f32x4_mul(self.twiddle9re, x13p16); let t_a6_14 = f32x4_mul(self.twiddle3re, x14p15); let t_a7_1 = f32x4_mul(self.twiddle7re, x1p28); let t_a7_2 = f32x4_mul(self.twiddle14re, x2p27); let t_a7_3 = f32x4_mul(self.twiddle8re, x3p26); let t_a7_4 = f32x4_mul(self.twiddle1re, x4p25); let t_a7_5 = f32x4_mul(self.twiddle6re, x5p24); let t_a7_6 = f32x4_mul(self.twiddle13re, x6p23); let t_a7_7 = f32x4_mul(self.twiddle9re, x7p22); let t_a7_8 = f32x4_mul(self.twiddle2re, x8p21); let t_a7_9 = f32x4_mul(self.twiddle5re, x9p20); let t_a7_10 = f32x4_mul(self.twiddle12re, x10p19); let t_a7_11 = f32x4_mul(self.twiddle10re, x11p18); let t_a7_12 = f32x4_mul(self.twiddle3re, x12p17); let t_a7_13 = f32x4_mul(self.twiddle4re, x13p16); let t_a7_14 = f32x4_mul(self.twiddle11re, x14p15); let t_a8_1 = f32x4_mul(self.twiddle8re, x1p28); let t_a8_2 = f32x4_mul(self.twiddle13re, x2p27); let t_a8_3 = f32x4_mul(self.twiddle5re, x3p26); let t_a8_4 = f32x4_mul(self.twiddle3re, x4p25); let t_a8_5 = f32x4_mul(self.twiddle11re, x5p24); let t_a8_6 = f32x4_mul(self.twiddle10re, x6p23); let t_a8_7 = f32x4_mul(self.twiddle2re, x7p22); let t_a8_8 = f32x4_mul(self.twiddle6re, x8p21); let t_a8_9 = f32x4_mul(self.twiddle14re, x9p20); let t_a8_10 = f32x4_mul(self.twiddle7re, x10p19); let t_a8_11 = f32x4_mul(self.twiddle1re, x11p18); let t_a8_12 = f32x4_mul(self.twiddle9re, x12p17); let t_a8_13 = f32x4_mul(self.twiddle12re, x13p16); let t_a8_14 = f32x4_mul(self.twiddle4re, x14p15); let t_a9_1 = f32x4_mul(self.twiddle9re, x1p28); let t_a9_2 = f32x4_mul(self.twiddle11re, x2p27); let t_a9_3 = f32x4_mul(self.twiddle2re, x3p26); let t_a9_4 = f32x4_mul(self.twiddle7re, x4p25); let t_a9_5 = f32x4_mul(self.twiddle13re, x5p24); let t_a9_6 = f32x4_mul(self.twiddle4re, x6p23); let t_a9_7 = f32x4_mul(self.twiddle5re, x7p22); let t_a9_8 = f32x4_mul(self.twiddle14re, x8p21); let t_a9_9 = f32x4_mul(self.twiddle6re, x9p20); let t_a9_10 = f32x4_mul(self.twiddle3re, x10p19); let t_a9_11 = f32x4_mul(self.twiddle12re, x11p18); let t_a9_12 = f32x4_mul(self.twiddle8re, x12p17); let t_a9_13 = f32x4_mul(self.twiddle1re, x13p16); let t_a9_14 = f32x4_mul(self.twiddle10re, x14p15); let t_a10_1 = f32x4_mul(self.twiddle10re, x1p28); let t_a10_2 = f32x4_mul(self.twiddle9re, x2p27); let t_a10_3 = f32x4_mul(self.twiddle1re, x3p26); let t_a10_4 = f32x4_mul(self.twiddle11re, x4p25); let t_a10_5 = f32x4_mul(self.twiddle8re, x5p24); let t_a10_6 = f32x4_mul(self.twiddle2re, x6p23); let t_a10_7 = f32x4_mul(self.twiddle12re, x7p22); let t_a10_8 = f32x4_mul(self.twiddle7re, x8p21); let t_a10_9 = f32x4_mul(self.twiddle3re, x9p20); let t_a10_10 = f32x4_mul(self.twiddle13re, x10p19); let t_a10_11 = f32x4_mul(self.twiddle6re, x11p18); let t_a10_12 = f32x4_mul(self.twiddle4re, x12p17); let t_a10_13 = f32x4_mul(self.twiddle14re, x13p16); let t_a10_14 = f32x4_mul(self.twiddle5re, x14p15); let t_a11_1 = f32x4_mul(self.twiddle11re, x1p28); let t_a11_2 = f32x4_mul(self.twiddle7re, x2p27); let t_a11_3 = f32x4_mul(self.twiddle4re, x3p26); let t_a11_4 = f32x4_mul(self.twiddle14re, x4p25); let t_a11_5 = f32x4_mul(self.twiddle3re, x5p24); let t_a11_6 = f32x4_mul(self.twiddle8re, x6p23); let t_a11_7 = f32x4_mul(self.twiddle10re, x7p22); let t_a11_8 = f32x4_mul(self.twiddle1re, x8p21); let t_a11_9 = f32x4_mul(self.twiddle12re, x9p20); let t_a11_10 = f32x4_mul(self.twiddle6re, x10p19); let t_a11_11 = f32x4_mul(self.twiddle5re, x11p18); let t_a11_12 = f32x4_mul(self.twiddle13re, x12p17); let t_a11_13 = f32x4_mul(self.twiddle2re, x13p16); let t_a11_14 = f32x4_mul(self.twiddle9re, x14p15); let t_a12_1 = f32x4_mul(self.twiddle12re, x1p28); let t_a12_2 = f32x4_mul(self.twiddle5re, x2p27); let t_a12_3 = f32x4_mul(self.twiddle7re, x3p26); let t_a12_4 = f32x4_mul(self.twiddle10re, x4p25); let t_a12_5 = f32x4_mul(self.twiddle2re, x5p24); let t_a12_6 = f32x4_mul(self.twiddle14re, x6p23); let t_a12_7 = f32x4_mul(self.twiddle3re, x7p22); let t_a12_8 = f32x4_mul(self.twiddle9re, x8p21); let t_a12_9 = f32x4_mul(self.twiddle8re, x9p20); let t_a12_10 = f32x4_mul(self.twiddle4re, x10p19); let t_a12_11 = f32x4_mul(self.twiddle13re, x11p18); let t_a12_12 = f32x4_mul(self.twiddle1re, x12p17); let t_a12_13 = f32x4_mul(self.twiddle11re, x13p16); let t_a12_14 = f32x4_mul(self.twiddle6re, x14p15); let t_a13_1 = f32x4_mul(self.twiddle13re, x1p28); let t_a13_2 = f32x4_mul(self.twiddle3re, x2p27); let t_a13_3 = f32x4_mul(self.twiddle10re, x3p26); let t_a13_4 = f32x4_mul(self.twiddle6re, x4p25); let t_a13_5 = f32x4_mul(self.twiddle7re, x5p24); let t_a13_6 = f32x4_mul(self.twiddle9re, x6p23); let t_a13_7 = f32x4_mul(self.twiddle4re, x7p22); let t_a13_8 = f32x4_mul(self.twiddle12re, x8p21); let t_a13_9 = f32x4_mul(self.twiddle1re, x9p20); let t_a13_10 = f32x4_mul(self.twiddle14re, x10p19); let t_a13_11 = f32x4_mul(self.twiddle2re, x11p18); let t_a13_12 = f32x4_mul(self.twiddle11re, x12p17); let t_a13_13 = f32x4_mul(self.twiddle5re, x13p16); let t_a13_14 = f32x4_mul(self.twiddle8re, x14p15); let t_a14_1 = f32x4_mul(self.twiddle14re, x1p28); let t_a14_2 = f32x4_mul(self.twiddle1re, x2p27); let t_a14_3 = f32x4_mul(self.twiddle13re, x3p26); let t_a14_4 = f32x4_mul(self.twiddle2re, x4p25); let t_a14_5 = f32x4_mul(self.twiddle12re, x5p24); let t_a14_6 = f32x4_mul(self.twiddle3re, x6p23); let t_a14_7 = f32x4_mul(self.twiddle11re, x7p22); let t_a14_8 = f32x4_mul(self.twiddle4re, x8p21); let t_a14_9 = f32x4_mul(self.twiddle10re, x9p20); let t_a14_10 = f32x4_mul(self.twiddle5re, x10p19); let t_a14_11 = f32x4_mul(self.twiddle9re, x11p18); let t_a14_12 = f32x4_mul(self.twiddle6re, x12p17); let t_a14_13 = f32x4_mul(self.twiddle8re, x13p16); let t_a14_14 = f32x4_mul(self.twiddle7re, x14p15); let t_b1_1 = f32x4_mul(self.twiddle1im, x1m28); let t_b1_2 = f32x4_mul(self.twiddle2im, x2m27); let t_b1_3 = f32x4_mul(self.twiddle3im, x3m26); let t_b1_4 = f32x4_mul(self.twiddle4im, x4m25); let t_b1_5 = f32x4_mul(self.twiddle5im, x5m24); let t_b1_6 = f32x4_mul(self.twiddle6im, x6m23); let t_b1_7 = f32x4_mul(self.twiddle7im, x7m22); let t_b1_8 = f32x4_mul(self.twiddle8im, x8m21); let t_b1_9 = f32x4_mul(self.twiddle9im, x9m20); let t_b1_10 = f32x4_mul(self.twiddle10im, x10m19); let t_b1_11 = f32x4_mul(self.twiddle11im, x11m18); let t_b1_12 = f32x4_mul(self.twiddle12im, x12m17); let t_b1_13 = f32x4_mul(self.twiddle13im, x13m16); let t_b1_14 = f32x4_mul(self.twiddle14im, x14m15); let t_b2_1 = f32x4_mul(self.twiddle2im, x1m28); let t_b2_2 = f32x4_mul(self.twiddle4im, x2m27); let t_b2_3 = f32x4_mul(self.twiddle6im, x3m26); let t_b2_4 = f32x4_mul(self.twiddle8im, x4m25); let t_b2_5 = f32x4_mul(self.twiddle10im, x5m24); let t_b2_6 = f32x4_mul(self.twiddle12im, x6m23); let t_b2_7 = f32x4_mul(self.twiddle14im, x7m22); let t_b2_8 = f32x4_mul(self.twiddle13im, x8m21); let t_b2_9 = f32x4_mul(self.twiddle11im, x9m20); let t_b2_10 = f32x4_mul(self.twiddle9im, x10m19); let t_b2_11 = f32x4_mul(self.twiddle7im, x11m18); let t_b2_12 = f32x4_mul(self.twiddle5im, x12m17); let t_b2_13 = f32x4_mul(self.twiddle3im, x13m16); let t_b2_14 = f32x4_mul(self.twiddle1im, x14m15); let t_b3_1 = f32x4_mul(self.twiddle3im, x1m28); let t_b3_2 = f32x4_mul(self.twiddle6im, x2m27); let t_b3_3 = f32x4_mul(self.twiddle9im, x3m26); let t_b3_4 = f32x4_mul(self.twiddle12im, x4m25); let t_b3_5 = f32x4_mul(self.twiddle14im, x5m24); let t_b3_6 = f32x4_mul(self.twiddle11im, x6m23); let t_b3_7 = f32x4_mul(self.twiddle8im, x7m22); let t_b3_8 = f32x4_mul(self.twiddle5im, x8m21); let t_b3_9 = f32x4_mul(self.twiddle2im, x9m20); let t_b3_10 = f32x4_mul(self.twiddle1im, x10m19); let t_b3_11 = f32x4_mul(self.twiddle4im, x11m18); let t_b3_12 = f32x4_mul(self.twiddle7im, x12m17); let t_b3_13 = f32x4_mul(self.twiddle10im, x13m16); let t_b3_14 = f32x4_mul(self.twiddle13im, x14m15); let t_b4_1 = f32x4_mul(self.twiddle4im, x1m28); let t_b4_2 = f32x4_mul(self.twiddle8im, x2m27); let t_b4_3 = f32x4_mul(self.twiddle12im, x3m26); let t_b4_4 = f32x4_mul(self.twiddle13im, x4m25); let t_b4_5 = f32x4_mul(self.twiddle9im, x5m24); let t_b4_6 = f32x4_mul(self.twiddle5im, x6m23); let t_b4_7 = f32x4_mul(self.twiddle1im, x7m22); let t_b4_8 = f32x4_mul(self.twiddle3im, x8m21); let t_b4_9 = f32x4_mul(self.twiddle7im, x9m20); let t_b4_10 = f32x4_mul(self.twiddle11im, x10m19); let t_b4_11 = f32x4_mul(self.twiddle14im, x11m18); let t_b4_12 = f32x4_mul(self.twiddle10im, x12m17); let t_b4_13 = f32x4_mul(self.twiddle6im, x13m16); let t_b4_14 = f32x4_mul(self.twiddle2im, x14m15); let t_b5_1 = f32x4_mul(self.twiddle5im, x1m28); let t_b5_2 = f32x4_mul(self.twiddle10im, x2m27); let t_b5_3 = f32x4_mul(self.twiddle14im, x3m26); let t_b5_4 = f32x4_mul(self.twiddle9im, x4m25); let t_b5_5 = f32x4_mul(self.twiddle4im, x5m24); let t_b5_6 = f32x4_mul(self.twiddle1im, x6m23); let t_b5_7 = f32x4_mul(self.twiddle6im, x7m22); let t_b5_8 = f32x4_mul(self.twiddle11im, x8m21); let t_b5_9 = f32x4_mul(self.twiddle13im, x9m20); let t_b5_10 = f32x4_mul(self.twiddle8im, x10m19); let t_b5_11 = f32x4_mul(self.twiddle3im, x11m18); let t_b5_12 = f32x4_mul(self.twiddle2im, x12m17); let t_b5_13 = f32x4_mul(self.twiddle7im, x13m16); let t_b5_14 = f32x4_mul(self.twiddle12im, x14m15); let t_b6_1 = f32x4_mul(self.twiddle6im, x1m28); let t_b6_2 = f32x4_mul(self.twiddle12im, x2m27); let t_b6_3 = f32x4_mul(self.twiddle11im, x3m26); let t_b6_4 = f32x4_mul(self.twiddle5im, x4m25); let t_b6_5 = f32x4_mul(self.twiddle1im, x5m24); let t_b6_6 = f32x4_mul(self.twiddle7im, x6m23); let t_b6_7 = f32x4_mul(self.twiddle13im, x7m22); let t_b6_8 = f32x4_mul(self.twiddle10im, x8m21); let t_b6_9 = f32x4_mul(self.twiddle4im, x9m20); let t_b6_10 = f32x4_mul(self.twiddle2im, x10m19); let t_b6_11 = f32x4_mul(self.twiddle8im, x11m18); let t_b6_12 = f32x4_mul(self.twiddle14im, x12m17); let t_b6_13 = f32x4_mul(self.twiddle9im, x13m16); let t_b6_14 = f32x4_mul(self.twiddle3im, x14m15); let t_b7_1 = f32x4_mul(self.twiddle7im, x1m28); let t_b7_2 = f32x4_mul(self.twiddle14im, x2m27); let t_b7_3 = f32x4_mul(self.twiddle8im, x3m26); let t_b7_4 = f32x4_mul(self.twiddle1im, x4m25); let t_b7_5 = f32x4_mul(self.twiddle6im, x5m24); let t_b7_6 = f32x4_mul(self.twiddle13im, x6m23); let t_b7_7 = f32x4_mul(self.twiddle9im, x7m22); let t_b7_8 = f32x4_mul(self.twiddle2im, x8m21); let t_b7_9 = f32x4_mul(self.twiddle5im, x9m20); let t_b7_10 = f32x4_mul(self.twiddle12im, x10m19); let t_b7_11 = f32x4_mul(self.twiddle10im, x11m18); let t_b7_12 = f32x4_mul(self.twiddle3im, x12m17); let t_b7_13 = f32x4_mul(self.twiddle4im, x13m16); let t_b7_14 = f32x4_mul(self.twiddle11im, x14m15); let t_b8_1 = f32x4_mul(self.twiddle8im, x1m28); let t_b8_2 = f32x4_mul(self.twiddle13im, x2m27); let t_b8_3 = f32x4_mul(self.twiddle5im, x3m26); let t_b8_4 = f32x4_mul(self.twiddle3im, x4m25); let t_b8_5 = f32x4_mul(self.twiddle11im, x5m24); let t_b8_6 = f32x4_mul(self.twiddle10im, x6m23); let t_b8_7 = f32x4_mul(self.twiddle2im, x7m22); let t_b8_8 = f32x4_mul(self.twiddle6im, x8m21); let t_b8_9 = f32x4_mul(self.twiddle14im, x9m20); let t_b8_10 = f32x4_mul(self.twiddle7im, x10m19); let t_b8_11 = f32x4_mul(self.twiddle1im, x11m18); let t_b8_12 = f32x4_mul(self.twiddle9im, x12m17); let t_b8_13 = f32x4_mul(self.twiddle12im, x13m16); let t_b8_14 = f32x4_mul(self.twiddle4im, x14m15); let t_b9_1 = f32x4_mul(self.twiddle9im, x1m28); let t_b9_2 = f32x4_mul(self.twiddle11im, x2m27); let t_b9_3 = f32x4_mul(self.twiddle2im, x3m26); let t_b9_4 = f32x4_mul(self.twiddle7im, x4m25); let t_b9_5 = f32x4_mul(self.twiddle13im, x5m24); let t_b9_6 = f32x4_mul(self.twiddle4im, x6m23); let t_b9_7 = f32x4_mul(self.twiddle5im, x7m22); let t_b9_8 = f32x4_mul(self.twiddle14im, x8m21); let t_b9_9 = f32x4_mul(self.twiddle6im, x9m20); let t_b9_10 = f32x4_mul(self.twiddle3im, x10m19); let t_b9_11 = f32x4_mul(self.twiddle12im, x11m18); let t_b9_12 = f32x4_mul(self.twiddle8im, x12m17); let t_b9_13 = f32x4_mul(self.twiddle1im, x13m16); let t_b9_14 = f32x4_mul(self.twiddle10im, x14m15); let t_b10_1 = f32x4_mul(self.twiddle10im, x1m28); let t_b10_2 = f32x4_mul(self.twiddle9im, x2m27); let t_b10_3 = f32x4_mul(self.twiddle1im, x3m26); let t_b10_4 = f32x4_mul(self.twiddle11im, x4m25); let t_b10_5 = f32x4_mul(self.twiddle8im, x5m24); let t_b10_6 = f32x4_mul(self.twiddle2im, x6m23); let t_b10_7 = f32x4_mul(self.twiddle12im, x7m22); let t_b10_8 = f32x4_mul(self.twiddle7im, x8m21); let t_b10_9 = f32x4_mul(self.twiddle3im, x9m20); let t_b10_10 = f32x4_mul(self.twiddle13im, x10m19); let t_b10_11 = f32x4_mul(self.twiddle6im, x11m18); let t_b10_12 = f32x4_mul(self.twiddle4im, x12m17); let t_b10_13 = f32x4_mul(self.twiddle14im, x13m16); let t_b10_14 = f32x4_mul(self.twiddle5im, x14m15); let t_b11_1 = f32x4_mul(self.twiddle11im, x1m28); let t_b11_2 = f32x4_mul(self.twiddle7im, x2m27); let t_b11_3 = f32x4_mul(self.twiddle4im, x3m26); let t_b11_4 = f32x4_mul(self.twiddle14im, x4m25); let t_b11_5 = f32x4_mul(self.twiddle3im, x5m24); let t_b11_6 = f32x4_mul(self.twiddle8im, x6m23); let t_b11_7 = f32x4_mul(self.twiddle10im, x7m22); let t_b11_8 = f32x4_mul(self.twiddle1im, x8m21); let t_b11_9 = f32x4_mul(self.twiddle12im, x9m20); let t_b11_10 = f32x4_mul(self.twiddle6im, x10m19); let t_b11_11 = f32x4_mul(self.twiddle5im, x11m18); let t_b11_12 = f32x4_mul(self.twiddle13im, x12m17); let t_b11_13 = f32x4_mul(self.twiddle2im, x13m16); let t_b11_14 = f32x4_mul(self.twiddle9im, x14m15); let t_b12_1 = f32x4_mul(self.twiddle12im, x1m28); let t_b12_2 = f32x4_mul(self.twiddle5im, x2m27); let t_b12_3 = f32x4_mul(self.twiddle7im, x3m26); let t_b12_4 = f32x4_mul(self.twiddle10im, x4m25); let t_b12_5 = f32x4_mul(self.twiddle2im, x5m24); let t_b12_6 = f32x4_mul(self.twiddle14im, x6m23); let t_b12_7 = f32x4_mul(self.twiddle3im, x7m22); let t_b12_8 = f32x4_mul(self.twiddle9im, x8m21); let t_b12_9 = f32x4_mul(self.twiddle8im, x9m20); let t_b12_10 = f32x4_mul(self.twiddle4im, x10m19); let t_b12_11 = f32x4_mul(self.twiddle13im, x11m18); let t_b12_12 = f32x4_mul(self.twiddle1im, x12m17); let t_b12_13 = f32x4_mul(self.twiddle11im, x13m16); let t_b12_14 = f32x4_mul(self.twiddle6im, x14m15); let t_b13_1 = f32x4_mul(self.twiddle13im, x1m28); let t_b13_2 = f32x4_mul(self.twiddle3im, x2m27); let t_b13_3 = f32x4_mul(self.twiddle10im, x3m26); let t_b13_4 = f32x4_mul(self.twiddle6im, x4m25); let t_b13_5 = f32x4_mul(self.twiddle7im, x5m24); let t_b13_6 = f32x4_mul(self.twiddle9im, x6m23); let t_b13_7 = f32x4_mul(self.twiddle4im, x7m22); let t_b13_8 = f32x4_mul(self.twiddle12im, x8m21); let t_b13_9 = f32x4_mul(self.twiddle1im, x9m20); let t_b13_10 = f32x4_mul(self.twiddle14im, x10m19); let t_b13_11 = f32x4_mul(self.twiddle2im, x11m18); let t_b13_12 = f32x4_mul(self.twiddle11im, x12m17); let t_b13_13 = f32x4_mul(self.twiddle5im, x13m16); let t_b13_14 = f32x4_mul(self.twiddle8im, x14m15); let t_b14_1 = f32x4_mul(self.twiddle14im, x1m28); let t_b14_2 = f32x4_mul(self.twiddle1im, x2m27); let t_b14_3 = f32x4_mul(self.twiddle13im, x3m26); let t_b14_4 = f32x4_mul(self.twiddle2im, x4m25); let t_b14_5 = f32x4_mul(self.twiddle12im, x5m24); let t_b14_6 = f32x4_mul(self.twiddle3im, x6m23); let t_b14_7 = f32x4_mul(self.twiddle11im, x7m22); let t_b14_8 = f32x4_mul(self.twiddle4im, x8m21); let t_b14_9 = f32x4_mul(self.twiddle10im, x9m20); let t_b14_10 = f32x4_mul(self.twiddle5im, x10m19); let t_b14_11 = f32x4_mul(self.twiddle9im, x11m18); let t_b14_12 = f32x4_mul(self.twiddle6im, x12m17); let t_b14_13 = f32x4_mul(self.twiddle8im, x13m16); let t_b14_14 = f32x4_mul(self.twiddle7im, x14m15); let x0 = values[0]; let t_a1 = calc_f32!( x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11 + t_a1_12 + t_a1_13 + t_a1_14 ); let t_a2 = calc_f32!( x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11 + t_a2_12 + t_a2_13 + t_a2_14 ); let t_a3 = calc_f32!( x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11 + t_a3_12 + t_a3_13 + t_a3_14 ); let t_a4 = calc_f32!( x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11 + t_a4_12 + t_a4_13 + t_a4_14 ); let t_a5 = calc_f32!( x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11 + t_a5_12 + t_a5_13 + t_a5_14 ); let t_a6 = calc_f32!( x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11 + t_a6_12 + t_a6_13 + t_a6_14 ); let t_a7 = calc_f32!( x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11 + t_a7_12 + t_a7_13 + t_a7_14 ); let t_a8 = calc_f32!( x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11 + t_a8_12 + t_a8_13 + t_a8_14 ); let t_a9 = calc_f32!( x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11 + t_a9_12 + t_a9_13 + t_a9_14 ); let t_a10 = calc_f32!( x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11 + t_a10_12 + t_a10_13 + t_a10_14 ); let t_a11 = calc_f32!( x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11 + t_a11_12 + t_a11_13 + t_a11_14 ); let t_a12 = calc_f32!( x0 + t_a12_1 + t_a12_2 + t_a12_3 + t_a12_4 + t_a12_5 + t_a12_6 + t_a12_7 + t_a12_8 + t_a12_9 + t_a12_10 + t_a12_11 + t_a12_12 + t_a12_13 + t_a12_14 ); let t_a13 = calc_f32!( x0 + t_a13_1 + t_a13_2 + t_a13_3 + t_a13_4 + t_a13_5 + t_a13_6 + t_a13_7 + t_a13_8 + t_a13_9 + t_a13_10 + t_a13_11 + t_a13_12 + t_a13_13 + t_a13_14 ); let t_a14 = calc_f32!( x0 + t_a14_1 + t_a14_2 + t_a14_3 + t_a14_4 + t_a14_5 + t_a14_6 + t_a14_7 + t_a14_8 + t_a14_9 + t_a14_10 + t_a14_11 + t_a14_12 + t_a14_13 + t_a14_14 ); let t_b1 = calc_f32!( t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11 + t_b1_12 + t_b1_13 + t_b1_14 ); let t_b2 = calc_f32!( t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 + t_b2_6 + t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11 - t_b2_12 - t_b2_13 - t_b2_14 ); let t_b3 = calc_f32!( t_b3_1 + t_b3_2 + t_b3_3 + t_b3_4 - t_b3_5 - t_b3_6 - t_b3_7 - t_b3_8 - t_b3_9 + t_b3_10 + t_b3_11 + t_b3_12 + t_b3_13 + t_b3_14 ); let t_b4 = calc_f32!( t_b4_1 + t_b4_2 + t_b4_3 - t_b4_4 - t_b4_5 - t_b4_6 - t_b4_7 + t_b4_8 + t_b4_9 + t_b4_10 - t_b4_11 - t_b4_12 - t_b4_13 - t_b4_14 ); let t_b5 = calc_f32!( t_b5_1 + t_b5_2 - t_b5_3 - t_b5_4 - t_b5_5 + t_b5_6 + t_b5_7 + t_b5_8 - t_b5_9 - t_b5_10 - t_b5_11 + t_b5_12 + t_b5_13 + t_b5_14 ); let t_b6 = calc_f32!( t_b6_1 + t_b6_2 - t_b6_3 - t_b6_4 + t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9 + t_b6_10 + t_b6_11 + t_b6_12 - t_b6_13 - t_b6_14 ); let t_b7 = calc_f32!( t_b7_1 + t_b7_2 - t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9 + t_b7_10 - t_b7_11 - t_b7_12 + t_b7_13 + t_b7_14 ); let t_b8 = calc_f32!( t_b8_1 - t_b8_2 - t_b8_3 + t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 + t_b8_9 - t_b8_10 + t_b8_11 + t_b8_12 - t_b8_13 - t_b8_14 ); let t_b9 = calc_f32!( t_b9_1 - t_b9_2 - t_b9_3 + t_b9_4 - t_b9_5 - t_b9_6 + t_b9_7 + t_b9_8 - t_b9_9 + t_b9_10 + t_b9_11 - t_b9_12 + t_b9_13 + t_b9_14 ); let t_b10 = calc_f32!( t_b10_1 - t_b10_2 + t_b10_3 + t_b10_4 - t_b10_5 + t_b10_6 + t_b10_7 - t_b10_8 + t_b10_9 + t_b10_10 - t_b10_11 + t_b10_12 + t_b10_13 - t_b10_14 ); let t_b11 = calc_f32!( t_b11_1 - t_b11_2 + t_b11_3 - t_b11_4 - t_b11_5 + t_b11_6 - t_b11_7 + t_b11_8 + t_b11_9 - t_b11_10 + t_b11_11 - t_b11_12 - t_b11_13 + t_b11_14 ); let t_b12 = calc_f32!( t_b12_1 - t_b12_2 + t_b12_3 - t_b12_4 + t_b12_5 + t_b12_6 - t_b12_7 + t_b12_8 - t_b12_9 + t_b12_10 - t_b12_11 - t_b12_12 + t_b12_13 - t_b12_14 ); let t_b13 = calc_f32!( t_b13_1 - t_b13_2 + t_b13_3 - t_b13_4 + t_b13_5 - t_b13_6 + t_b13_7 - t_b13_8 + t_b13_9 + t_b13_10 - t_b13_11 + t_b13_12 - t_b13_13 + t_b13_14 ); let t_b14 = calc_f32!( t_b14_1 - t_b14_2 + t_b14_3 - t_b14_4 + t_b14_5 - t_b14_6 + t_b14_7 - t_b14_8 + t_b14_9 - t_b14_10 + t_b14_11 - t_b14_12 + t_b14_13 - t_b14_14 ); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let t_b7_rot = self.rotate.rotate_both(t_b7); let t_b8_rot = self.rotate.rotate_both(t_b8); let t_b9_rot = self.rotate.rotate_both(t_b9); let t_b10_rot = self.rotate.rotate_both(t_b10); let t_b11_rot = self.rotate.rotate_both(t_b11); let t_b12_rot = self.rotate.rotate_both(t_b12); let t_b13_rot = self.rotate.rotate_both(t_b13); let t_b14_rot = self.rotate.rotate_both(t_b14); let y0 = calc_f32!( x0 + x1p28 + x2p27 + x3p26 + x4p25 + x5p24 + x6p23 + x7p22 + x8p21 + x9p20 + x10p19 + x11p18 + x12p17 + x13p16 + x14p15 ); let [y1, y28] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y27] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y26] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y25] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y24] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y23] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); let [y7, y22] = parallel_fft2_interleaved_f32(t_a7, t_b7_rot); let [y8, y21] = parallel_fft2_interleaved_f32(t_a8, t_b8_rot); let [y9, y20] = parallel_fft2_interleaved_f32(t_a9, t_b9_rot); let [y10, y19] = parallel_fft2_interleaved_f32(t_a10, t_b10_rot); let [y11, y18] = parallel_fft2_interleaved_f32(t_a11, t_b11_rot); let [y12, y17] = parallel_fft2_interleaved_f32(t_a12, t_b12_rot); let [y13, y16] = parallel_fft2_interleaved_f32(t_a13, t_b13_rot); let [y14, y15] = parallel_fft2_interleaved_f32(t_a14, t_b14_rot); [ y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22, y23, y24, y25, y26, y27, y28, ] } } // ____ ___ __ _ _ _ _ _ // |___ \ / _ \ / /_ | || | | |__ (_) |_ // __) | (_) | _____ | '_ \| || |_| '_ \| | __| // / __/ \__, | |_____| | (_) |__ _| |_) | | |_ // |_____| /_/ \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly29 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, twiddle3re: v128, twiddle3im: v128, twiddle4re: v128, twiddle4im: v128, twiddle5re: v128, twiddle5im: v128, twiddle6re: v128, twiddle6im: v128, twiddle7re: v128, twiddle7im: v128, twiddle8re: v128, twiddle8im: v128, twiddle9re: v128, twiddle9im: v128, twiddle10re: v128, twiddle10im: v128, twiddle11re: v128, twiddle11im: v128, twiddle12re: v128, twiddle12im: v128, twiddle13re: v128, twiddle13im: v128, twiddle14re: v128, twiddle14im: v128, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly29, 29, |this: &WasmSimdF64Butterfly29<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly29, 29, |this: &WasmSimdF64Butterfly29<_>| this.direction ); impl WasmSimdF64Butterfly29 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 29, direction); let tw2: Complex = twiddles::compute_twiddle(2, 29, direction); let tw3: Complex = twiddles::compute_twiddle(3, 29, direction); let tw4: Complex = twiddles::compute_twiddle(4, 29, direction); let tw5: Complex = twiddles::compute_twiddle(5, 29, direction); let tw6: Complex = twiddles::compute_twiddle(6, 29, direction); let tw7: Complex = twiddles::compute_twiddle(7, 29, direction); let tw8: Complex = twiddles::compute_twiddle(8, 29, direction); let tw9: Complex = twiddles::compute_twiddle(9, 29, direction); let tw10: Complex = twiddles::compute_twiddle(10, 29, direction); let tw11: Complex = twiddles::compute_twiddle(11, 29, direction); let tw12: Complex = twiddles::compute_twiddle(12, 29, direction); let tw13: Complex = twiddles::compute_twiddle(13, 29, direction); let tw14: Complex = twiddles::compute_twiddle(14, 29, direction); let twiddle1re = f64x2_splat(tw1.re); let twiddle1im = f64x2_splat(tw1.im); let twiddle2re = f64x2_splat(tw2.re); let twiddle2im = f64x2_splat(tw2.im); let twiddle3re = f64x2_splat(tw3.re); let twiddle3im = f64x2_splat(tw3.im); let twiddle4re = f64x2_splat(tw4.re); let twiddle4im = f64x2_splat(tw4.im); let twiddle5re = f64x2_splat(tw5.re); let twiddle5im = f64x2_splat(tw5.im); let twiddle6re = f64x2_splat(tw6.re); let twiddle6im = f64x2_splat(tw6.im); let twiddle7re = f64x2_splat(tw7.re); let twiddle7im = f64x2_splat(tw7.im); let twiddle8re = f64x2_splat(tw8.re); let twiddle8im = f64x2_splat(tw8.im); let twiddle9re = f64x2_splat(tw9.re); let twiddle9im = f64x2_splat(tw9.im); let twiddle10re = f64x2_splat(tw10.re); let twiddle10im = f64x2_splat(tw10.im); let twiddle11re = f64x2_splat(tw11.re); let twiddle11im = f64x2_splat(tw11.im); let twiddle12re = f64x2_splat(tw12.re); let twiddle12im = f64x2_splat(tw12.im); let twiddle13re = f64x2_splat(tw13.re); let twiddle13im = f64x2_splat(tw13.im); let twiddle14re = f64x2_splat(tw14.re); let twiddle14im = f64x2_splat(tw14.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, twiddle12re, twiddle12im, twiddle13re, twiddle13im, twiddle14re, twiddle14im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [v128; 29]) -> [v128; 29] { let [x1p28, x1m28] = solo_fft2_f64(values[1], values[28]); let [x2p27, x2m27] = solo_fft2_f64(values[2], values[27]); let [x3p26, x3m26] = solo_fft2_f64(values[3], values[26]); let [x4p25, x4m25] = solo_fft2_f64(values[4], values[25]); let [x5p24, x5m24] = solo_fft2_f64(values[5], values[24]); let [x6p23, x6m23] = solo_fft2_f64(values[6], values[23]); let [x7p22, x7m22] = solo_fft2_f64(values[7], values[22]); let [x8p21, x8m21] = solo_fft2_f64(values[8], values[21]); let [x9p20, x9m20] = solo_fft2_f64(values[9], values[20]); let [x10p19, x10m19] = solo_fft2_f64(values[10], values[19]); let [x11p18, x11m18] = solo_fft2_f64(values[11], values[18]); let [x12p17, x12m17] = solo_fft2_f64(values[12], values[17]); let [x13p16, x13m16] = solo_fft2_f64(values[13], values[16]); let [x14p15, x14m15] = solo_fft2_f64(values[14], values[15]); let t_a1_1 = f64x2_mul(self.twiddle1re, x1p28); let t_a1_2 = f64x2_mul(self.twiddle2re, x2p27); let t_a1_3 = f64x2_mul(self.twiddle3re, x3p26); let t_a1_4 = f64x2_mul(self.twiddle4re, x4p25); let t_a1_5 = f64x2_mul(self.twiddle5re, x5p24); let t_a1_6 = f64x2_mul(self.twiddle6re, x6p23); let t_a1_7 = f64x2_mul(self.twiddle7re, x7p22); let t_a1_8 = f64x2_mul(self.twiddle8re, x8p21); let t_a1_9 = f64x2_mul(self.twiddle9re, x9p20); let t_a1_10 = f64x2_mul(self.twiddle10re, x10p19); let t_a1_11 = f64x2_mul(self.twiddle11re, x11p18); let t_a1_12 = f64x2_mul(self.twiddle12re, x12p17); let t_a1_13 = f64x2_mul(self.twiddle13re, x13p16); let t_a1_14 = f64x2_mul(self.twiddle14re, x14p15); let t_a2_1 = f64x2_mul(self.twiddle2re, x1p28); let t_a2_2 = f64x2_mul(self.twiddle4re, x2p27); let t_a2_3 = f64x2_mul(self.twiddle6re, x3p26); let t_a2_4 = f64x2_mul(self.twiddle8re, x4p25); let t_a2_5 = f64x2_mul(self.twiddle10re, x5p24); let t_a2_6 = f64x2_mul(self.twiddle12re, x6p23); let t_a2_7 = f64x2_mul(self.twiddle14re, x7p22); let t_a2_8 = f64x2_mul(self.twiddle13re, x8p21); let t_a2_9 = f64x2_mul(self.twiddle11re, x9p20); let t_a2_10 = f64x2_mul(self.twiddle9re, x10p19); let t_a2_11 = f64x2_mul(self.twiddle7re, x11p18); let t_a2_12 = f64x2_mul(self.twiddle5re, x12p17); let t_a2_13 = f64x2_mul(self.twiddle3re, x13p16); let t_a2_14 = f64x2_mul(self.twiddle1re, x14p15); let t_a3_1 = f64x2_mul(self.twiddle3re, x1p28); let t_a3_2 = f64x2_mul(self.twiddle6re, x2p27); let t_a3_3 = f64x2_mul(self.twiddle9re, x3p26); let t_a3_4 = f64x2_mul(self.twiddle12re, x4p25); let t_a3_5 = f64x2_mul(self.twiddle14re, x5p24); let t_a3_6 = f64x2_mul(self.twiddle11re, x6p23); let t_a3_7 = f64x2_mul(self.twiddle8re, x7p22); let t_a3_8 = f64x2_mul(self.twiddle5re, x8p21); let t_a3_9 = f64x2_mul(self.twiddle2re, x9p20); let t_a3_10 = f64x2_mul(self.twiddle1re, x10p19); let t_a3_11 = f64x2_mul(self.twiddle4re, x11p18); let t_a3_12 = f64x2_mul(self.twiddle7re, x12p17); let t_a3_13 = f64x2_mul(self.twiddle10re, x13p16); let t_a3_14 = f64x2_mul(self.twiddle13re, x14p15); let t_a4_1 = f64x2_mul(self.twiddle4re, x1p28); let t_a4_2 = f64x2_mul(self.twiddle8re, x2p27); let t_a4_3 = f64x2_mul(self.twiddle12re, x3p26); let t_a4_4 = f64x2_mul(self.twiddle13re, x4p25); let t_a4_5 = f64x2_mul(self.twiddle9re, x5p24); let t_a4_6 = f64x2_mul(self.twiddle5re, x6p23); let t_a4_7 = f64x2_mul(self.twiddle1re, x7p22); let t_a4_8 = f64x2_mul(self.twiddle3re, x8p21); let t_a4_9 = f64x2_mul(self.twiddle7re, x9p20); let t_a4_10 = f64x2_mul(self.twiddle11re, x10p19); let t_a4_11 = f64x2_mul(self.twiddle14re, x11p18); let t_a4_12 = f64x2_mul(self.twiddle10re, x12p17); let t_a4_13 = f64x2_mul(self.twiddle6re, x13p16); let t_a4_14 = f64x2_mul(self.twiddle2re, x14p15); let t_a5_1 = f64x2_mul(self.twiddle5re, x1p28); let t_a5_2 = f64x2_mul(self.twiddle10re, x2p27); let t_a5_3 = f64x2_mul(self.twiddle14re, x3p26); let t_a5_4 = f64x2_mul(self.twiddle9re, x4p25); let t_a5_5 = f64x2_mul(self.twiddle4re, x5p24); let t_a5_6 = f64x2_mul(self.twiddle1re, x6p23); let t_a5_7 = f64x2_mul(self.twiddle6re, x7p22); let t_a5_8 = f64x2_mul(self.twiddle11re, x8p21); let t_a5_9 = f64x2_mul(self.twiddle13re, x9p20); let t_a5_10 = f64x2_mul(self.twiddle8re, x10p19); let t_a5_11 = f64x2_mul(self.twiddle3re, x11p18); let t_a5_12 = f64x2_mul(self.twiddle2re, x12p17); let t_a5_13 = f64x2_mul(self.twiddle7re, x13p16); let t_a5_14 = f64x2_mul(self.twiddle12re, x14p15); let t_a6_1 = f64x2_mul(self.twiddle6re, x1p28); let t_a6_2 = f64x2_mul(self.twiddle12re, x2p27); let t_a6_3 = f64x2_mul(self.twiddle11re, x3p26); let t_a6_4 = f64x2_mul(self.twiddle5re, x4p25); let t_a6_5 = f64x2_mul(self.twiddle1re, x5p24); let t_a6_6 = f64x2_mul(self.twiddle7re, x6p23); let t_a6_7 = f64x2_mul(self.twiddle13re, x7p22); let t_a6_8 = f64x2_mul(self.twiddle10re, x8p21); let t_a6_9 = f64x2_mul(self.twiddle4re, x9p20); let t_a6_10 = f64x2_mul(self.twiddle2re, x10p19); let t_a6_11 = f64x2_mul(self.twiddle8re, x11p18); let t_a6_12 = f64x2_mul(self.twiddle14re, x12p17); let t_a6_13 = f64x2_mul(self.twiddle9re, x13p16); let t_a6_14 = f64x2_mul(self.twiddle3re, x14p15); let t_a7_1 = f64x2_mul(self.twiddle7re, x1p28); let t_a7_2 = f64x2_mul(self.twiddle14re, x2p27); let t_a7_3 = f64x2_mul(self.twiddle8re, x3p26); let t_a7_4 = f64x2_mul(self.twiddle1re, x4p25); let t_a7_5 = f64x2_mul(self.twiddle6re, x5p24); let t_a7_6 = f64x2_mul(self.twiddle13re, x6p23); let t_a7_7 = f64x2_mul(self.twiddle9re, x7p22); let t_a7_8 = f64x2_mul(self.twiddle2re, x8p21); let t_a7_9 = f64x2_mul(self.twiddle5re, x9p20); let t_a7_10 = f64x2_mul(self.twiddle12re, x10p19); let t_a7_11 = f64x2_mul(self.twiddle10re, x11p18); let t_a7_12 = f64x2_mul(self.twiddle3re, x12p17); let t_a7_13 = f64x2_mul(self.twiddle4re, x13p16); let t_a7_14 = f64x2_mul(self.twiddle11re, x14p15); let t_a8_1 = f64x2_mul(self.twiddle8re, x1p28); let t_a8_2 = f64x2_mul(self.twiddle13re, x2p27); let t_a8_3 = f64x2_mul(self.twiddle5re, x3p26); let t_a8_4 = f64x2_mul(self.twiddle3re, x4p25); let t_a8_5 = f64x2_mul(self.twiddle11re, x5p24); let t_a8_6 = f64x2_mul(self.twiddle10re, x6p23); let t_a8_7 = f64x2_mul(self.twiddle2re, x7p22); let t_a8_8 = f64x2_mul(self.twiddle6re, x8p21); let t_a8_9 = f64x2_mul(self.twiddle14re, x9p20); let t_a8_10 = f64x2_mul(self.twiddle7re, x10p19); let t_a8_11 = f64x2_mul(self.twiddle1re, x11p18); let t_a8_12 = f64x2_mul(self.twiddle9re, x12p17); let t_a8_13 = f64x2_mul(self.twiddle12re, x13p16); let t_a8_14 = f64x2_mul(self.twiddle4re, x14p15); let t_a9_1 = f64x2_mul(self.twiddle9re, x1p28); let t_a9_2 = f64x2_mul(self.twiddle11re, x2p27); let t_a9_3 = f64x2_mul(self.twiddle2re, x3p26); let t_a9_4 = f64x2_mul(self.twiddle7re, x4p25); let t_a9_5 = f64x2_mul(self.twiddle13re, x5p24); let t_a9_6 = f64x2_mul(self.twiddle4re, x6p23); let t_a9_7 = f64x2_mul(self.twiddle5re, x7p22); let t_a9_8 = f64x2_mul(self.twiddle14re, x8p21); let t_a9_9 = f64x2_mul(self.twiddle6re, x9p20); let t_a9_10 = f64x2_mul(self.twiddle3re, x10p19); let t_a9_11 = f64x2_mul(self.twiddle12re, x11p18); let t_a9_12 = f64x2_mul(self.twiddle8re, x12p17); let t_a9_13 = f64x2_mul(self.twiddle1re, x13p16); let t_a9_14 = f64x2_mul(self.twiddle10re, x14p15); let t_a10_1 = f64x2_mul(self.twiddle10re, x1p28); let t_a10_2 = f64x2_mul(self.twiddle9re, x2p27); let t_a10_3 = f64x2_mul(self.twiddle1re, x3p26); let t_a10_4 = f64x2_mul(self.twiddle11re, x4p25); let t_a10_5 = f64x2_mul(self.twiddle8re, x5p24); let t_a10_6 = f64x2_mul(self.twiddle2re, x6p23); let t_a10_7 = f64x2_mul(self.twiddle12re, x7p22); let t_a10_8 = f64x2_mul(self.twiddle7re, x8p21); let t_a10_9 = f64x2_mul(self.twiddle3re, x9p20); let t_a10_10 = f64x2_mul(self.twiddle13re, x10p19); let t_a10_11 = f64x2_mul(self.twiddle6re, x11p18); let t_a10_12 = f64x2_mul(self.twiddle4re, x12p17); let t_a10_13 = f64x2_mul(self.twiddle14re, x13p16); let t_a10_14 = f64x2_mul(self.twiddle5re, x14p15); let t_a11_1 = f64x2_mul(self.twiddle11re, x1p28); let t_a11_2 = f64x2_mul(self.twiddle7re, x2p27); let t_a11_3 = f64x2_mul(self.twiddle4re, x3p26); let t_a11_4 = f64x2_mul(self.twiddle14re, x4p25); let t_a11_5 = f64x2_mul(self.twiddle3re, x5p24); let t_a11_6 = f64x2_mul(self.twiddle8re, x6p23); let t_a11_7 = f64x2_mul(self.twiddle10re, x7p22); let t_a11_8 = f64x2_mul(self.twiddle1re, x8p21); let t_a11_9 = f64x2_mul(self.twiddle12re, x9p20); let t_a11_10 = f64x2_mul(self.twiddle6re, x10p19); let t_a11_11 = f64x2_mul(self.twiddle5re, x11p18); let t_a11_12 = f64x2_mul(self.twiddle13re, x12p17); let t_a11_13 = f64x2_mul(self.twiddle2re, x13p16); let t_a11_14 = f64x2_mul(self.twiddle9re, x14p15); let t_a12_1 = f64x2_mul(self.twiddle12re, x1p28); let t_a12_2 = f64x2_mul(self.twiddle5re, x2p27); let t_a12_3 = f64x2_mul(self.twiddle7re, x3p26); let t_a12_4 = f64x2_mul(self.twiddle10re, x4p25); let t_a12_5 = f64x2_mul(self.twiddle2re, x5p24); let t_a12_6 = f64x2_mul(self.twiddle14re, x6p23); let t_a12_7 = f64x2_mul(self.twiddle3re, x7p22); let t_a12_8 = f64x2_mul(self.twiddle9re, x8p21); let t_a12_9 = f64x2_mul(self.twiddle8re, x9p20); let t_a12_10 = f64x2_mul(self.twiddle4re, x10p19); let t_a12_11 = f64x2_mul(self.twiddle13re, x11p18); let t_a12_12 = f64x2_mul(self.twiddle1re, x12p17); let t_a12_13 = f64x2_mul(self.twiddle11re, x13p16); let t_a12_14 = f64x2_mul(self.twiddle6re, x14p15); let t_a13_1 = f64x2_mul(self.twiddle13re, x1p28); let t_a13_2 = f64x2_mul(self.twiddle3re, x2p27); let t_a13_3 = f64x2_mul(self.twiddle10re, x3p26); let t_a13_4 = f64x2_mul(self.twiddle6re, x4p25); let t_a13_5 = f64x2_mul(self.twiddle7re, x5p24); let t_a13_6 = f64x2_mul(self.twiddle9re, x6p23); let t_a13_7 = f64x2_mul(self.twiddle4re, x7p22); let t_a13_8 = f64x2_mul(self.twiddle12re, x8p21); let t_a13_9 = f64x2_mul(self.twiddle1re, x9p20); let t_a13_10 = f64x2_mul(self.twiddle14re, x10p19); let t_a13_11 = f64x2_mul(self.twiddle2re, x11p18); let t_a13_12 = f64x2_mul(self.twiddle11re, x12p17); let t_a13_13 = f64x2_mul(self.twiddle5re, x13p16); let t_a13_14 = f64x2_mul(self.twiddle8re, x14p15); let t_a14_1 = f64x2_mul(self.twiddle14re, x1p28); let t_a14_2 = f64x2_mul(self.twiddle1re, x2p27); let t_a14_3 = f64x2_mul(self.twiddle13re, x3p26); let t_a14_4 = f64x2_mul(self.twiddle2re, x4p25); let t_a14_5 = f64x2_mul(self.twiddle12re, x5p24); let t_a14_6 = f64x2_mul(self.twiddle3re, x6p23); let t_a14_7 = f64x2_mul(self.twiddle11re, x7p22); let t_a14_8 = f64x2_mul(self.twiddle4re, x8p21); let t_a14_9 = f64x2_mul(self.twiddle10re, x9p20); let t_a14_10 = f64x2_mul(self.twiddle5re, x10p19); let t_a14_11 = f64x2_mul(self.twiddle9re, x11p18); let t_a14_12 = f64x2_mul(self.twiddle6re, x12p17); let t_a14_13 = f64x2_mul(self.twiddle8re, x13p16); let t_a14_14 = f64x2_mul(self.twiddle7re, x14p15); let t_b1_1 = f64x2_mul(self.twiddle1im, x1m28); let t_b1_2 = f64x2_mul(self.twiddle2im, x2m27); let t_b1_3 = f64x2_mul(self.twiddle3im, x3m26); let t_b1_4 = f64x2_mul(self.twiddle4im, x4m25); let t_b1_5 = f64x2_mul(self.twiddle5im, x5m24); let t_b1_6 = f64x2_mul(self.twiddle6im, x6m23); let t_b1_7 = f64x2_mul(self.twiddle7im, x7m22); let t_b1_8 = f64x2_mul(self.twiddle8im, x8m21); let t_b1_9 = f64x2_mul(self.twiddle9im, x9m20); let t_b1_10 = f64x2_mul(self.twiddle10im, x10m19); let t_b1_11 = f64x2_mul(self.twiddle11im, x11m18); let t_b1_12 = f64x2_mul(self.twiddle12im, x12m17); let t_b1_13 = f64x2_mul(self.twiddle13im, x13m16); let t_b1_14 = f64x2_mul(self.twiddle14im, x14m15); let t_b2_1 = f64x2_mul(self.twiddle2im, x1m28); let t_b2_2 = f64x2_mul(self.twiddle4im, x2m27); let t_b2_3 = f64x2_mul(self.twiddle6im, x3m26); let t_b2_4 = f64x2_mul(self.twiddle8im, x4m25); let t_b2_5 = f64x2_mul(self.twiddle10im, x5m24); let t_b2_6 = f64x2_mul(self.twiddle12im, x6m23); let t_b2_7 = f64x2_mul(self.twiddle14im, x7m22); let t_b2_8 = f64x2_mul(self.twiddle13im, x8m21); let t_b2_9 = f64x2_mul(self.twiddle11im, x9m20); let t_b2_10 = f64x2_mul(self.twiddle9im, x10m19); let t_b2_11 = f64x2_mul(self.twiddle7im, x11m18); let t_b2_12 = f64x2_mul(self.twiddle5im, x12m17); let t_b2_13 = f64x2_mul(self.twiddle3im, x13m16); let t_b2_14 = f64x2_mul(self.twiddle1im, x14m15); let t_b3_1 = f64x2_mul(self.twiddle3im, x1m28); let t_b3_2 = f64x2_mul(self.twiddle6im, x2m27); let t_b3_3 = f64x2_mul(self.twiddle9im, x3m26); let t_b3_4 = f64x2_mul(self.twiddle12im, x4m25); let t_b3_5 = f64x2_mul(self.twiddle14im, x5m24); let t_b3_6 = f64x2_mul(self.twiddle11im, x6m23); let t_b3_7 = f64x2_mul(self.twiddle8im, x7m22); let t_b3_8 = f64x2_mul(self.twiddle5im, x8m21); let t_b3_9 = f64x2_mul(self.twiddle2im, x9m20); let t_b3_10 = f64x2_mul(self.twiddle1im, x10m19); let t_b3_11 = f64x2_mul(self.twiddle4im, x11m18); let t_b3_12 = f64x2_mul(self.twiddle7im, x12m17); let t_b3_13 = f64x2_mul(self.twiddle10im, x13m16); let t_b3_14 = f64x2_mul(self.twiddle13im, x14m15); let t_b4_1 = f64x2_mul(self.twiddle4im, x1m28); let t_b4_2 = f64x2_mul(self.twiddle8im, x2m27); let t_b4_3 = f64x2_mul(self.twiddle12im, x3m26); let t_b4_4 = f64x2_mul(self.twiddle13im, x4m25); let t_b4_5 = f64x2_mul(self.twiddle9im, x5m24); let t_b4_6 = f64x2_mul(self.twiddle5im, x6m23); let t_b4_7 = f64x2_mul(self.twiddle1im, x7m22); let t_b4_8 = f64x2_mul(self.twiddle3im, x8m21); let t_b4_9 = f64x2_mul(self.twiddle7im, x9m20); let t_b4_10 = f64x2_mul(self.twiddle11im, x10m19); let t_b4_11 = f64x2_mul(self.twiddle14im, x11m18); let t_b4_12 = f64x2_mul(self.twiddle10im, x12m17); let t_b4_13 = f64x2_mul(self.twiddle6im, x13m16); let t_b4_14 = f64x2_mul(self.twiddle2im, x14m15); let t_b5_1 = f64x2_mul(self.twiddle5im, x1m28); let t_b5_2 = f64x2_mul(self.twiddle10im, x2m27); let t_b5_3 = f64x2_mul(self.twiddle14im, x3m26); let t_b5_4 = f64x2_mul(self.twiddle9im, x4m25); let t_b5_5 = f64x2_mul(self.twiddle4im, x5m24); let t_b5_6 = f64x2_mul(self.twiddle1im, x6m23); let t_b5_7 = f64x2_mul(self.twiddle6im, x7m22); let t_b5_8 = f64x2_mul(self.twiddle11im, x8m21); let t_b5_9 = f64x2_mul(self.twiddle13im, x9m20); let t_b5_10 = f64x2_mul(self.twiddle8im, x10m19); let t_b5_11 = f64x2_mul(self.twiddle3im, x11m18); let t_b5_12 = f64x2_mul(self.twiddle2im, x12m17); let t_b5_13 = f64x2_mul(self.twiddle7im, x13m16); let t_b5_14 = f64x2_mul(self.twiddle12im, x14m15); let t_b6_1 = f64x2_mul(self.twiddle6im, x1m28); let t_b6_2 = f64x2_mul(self.twiddle12im, x2m27); let t_b6_3 = f64x2_mul(self.twiddle11im, x3m26); let t_b6_4 = f64x2_mul(self.twiddle5im, x4m25); let t_b6_5 = f64x2_mul(self.twiddle1im, x5m24); let t_b6_6 = f64x2_mul(self.twiddle7im, x6m23); let t_b6_7 = f64x2_mul(self.twiddle13im, x7m22); let t_b6_8 = f64x2_mul(self.twiddle10im, x8m21); let t_b6_9 = f64x2_mul(self.twiddle4im, x9m20); let t_b6_10 = f64x2_mul(self.twiddle2im, x10m19); let t_b6_11 = f64x2_mul(self.twiddle8im, x11m18); let t_b6_12 = f64x2_mul(self.twiddle14im, x12m17); let t_b6_13 = f64x2_mul(self.twiddle9im, x13m16); let t_b6_14 = f64x2_mul(self.twiddle3im, x14m15); let t_b7_1 = f64x2_mul(self.twiddle7im, x1m28); let t_b7_2 = f64x2_mul(self.twiddle14im, x2m27); let t_b7_3 = f64x2_mul(self.twiddle8im, x3m26); let t_b7_4 = f64x2_mul(self.twiddle1im, x4m25); let t_b7_5 = f64x2_mul(self.twiddle6im, x5m24); let t_b7_6 = f64x2_mul(self.twiddle13im, x6m23); let t_b7_7 = f64x2_mul(self.twiddle9im, x7m22); let t_b7_8 = f64x2_mul(self.twiddle2im, x8m21); let t_b7_9 = f64x2_mul(self.twiddle5im, x9m20); let t_b7_10 = f64x2_mul(self.twiddle12im, x10m19); let t_b7_11 = f64x2_mul(self.twiddle10im, x11m18); let t_b7_12 = f64x2_mul(self.twiddle3im, x12m17); let t_b7_13 = f64x2_mul(self.twiddle4im, x13m16); let t_b7_14 = f64x2_mul(self.twiddle11im, x14m15); let t_b8_1 = f64x2_mul(self.twiddle8im, x1m28); let t_b8_2 = f64x2_mul(self.twiddle13im, x2m27); let t_b8_3 = f64x2_mul(self.twiddle5im, x3m26); let t_b8_4 = f64x2_mul(self.twiddle3im, x4m25); let t_b8_5 = f64x2_mul(self.twiddle11im, x5m24); let t_b8_6 = f64x2_mul(self.twiddle10im, x6m23); let t_b8_7 = f64x2_mul(self.twiddle2im, x7m22); let t_b8_8 = f64x2_mul(self.twiddle6im, x8m21); let t_b8_9 = f64x2_mul(self.twiddle14im, x9m20); let t_b8_10 = f64x2_mul(self.twiddle7im, x10m19); let t_b8_11 = f64x2_mul(self.twiddle1im, x11m18); let t_b8_12 = f64x2_mul(self.twiddle9im, x12m17); let t_b8_13 = f64x2_mul(self.twiddle12im, x13m16); let t_b8_14 = f64x2_mul(self.twiddle4im, x14m15); let t_b9_1 = f64x2_mul(self.twiddle9im, x1m28); let t_b9_2 = f64x2_mul(self.twiddle11im, x2m27); let t_b9_3 = f64x2_mul(self.twiddle2im, x3m26); let t_b9_4 = f64x2_mul(self.twiddle7im, x4m25); let t_b9_5 = f64x2_mul(self.twiddle13im, x5m24); let t_b9_6 = f64x2_mul(self.twiddle4im, x6m23); let t_b9_7 = f64x2_mul(self.twiddle5im, x7m22); let t_b9_8 = f64x2_mul(self.twiddle14im, x8m21); let t_b9_9 = f64x2_mul(self.twiddle6im, x9m20); let t_b9_10 = f64x2_mul(self.twiddle3im, x10m19); let t_b9_11 = f64x2_mul(self.twiddle12im, x11m18); let t_b9_12 = f64x2_mul(self.twiddle8im, x12m17); let t_b9_13 = f64x2_mul(self.twiddle1im, x13m16); let t_b9_14 = f64x2_mul(self.twiddle10im, x14m15); let t_b10_1 = f64x2_mul(self.twiddle10im, x1m28); let t_b10_2 = f64x2_mul(self.twiddle9im, x2m27); let t_b10_3 = f64x2_mul(self.twiddle1im, x3m26); let t_b10_4 = f64x2_mul(self.twiddle11im, x4m25); let t_b10_5 = f64x2_mul(self.twiddle8im, x5m24); let t_b10_6 = f64x2_mul(self.twiddle2im, x6m23); let t_b10_7 = f64x2_mul(self.twiddle12im, x7m22); let t_b10_8 = f64x2_mul(self.twiddle7im, x8m21); let t_b10_9 = f64x2_mul(self.twiddle3im, x9m20); let t_b10_10 = f64x2_mul(self.twiddle13im, x10m19); let t_b10_11 = f64x2_mul(self.twiddle6im, x11m18); let t_b10_12 = f64x2_mul(self.twiddle4im, x12m17); let t_b10_13 = f64x2_mul(self.twiddle14im, x13m16); let t_b10_14 = f64x2_mul(self.twiddle5im, x14m15); let t_b11_1 = f64x2_mul(self.twiddle11im, x1m28); let t_b11_2 = f64x2_mul(self.twiddle7im, x2m27); let t_b11_3 = f64x2_mul(self.twiddle4im, x3m26); let t_b11_4 = f64x2_mul(self.twiddle14im, x4m25); let t_b11_5 = f64x2_mul(self.twiddle3im, x5m24); let t_b11_6 = f64x2_mul(self.twiddle8im, x6m23); let t_b11_7 = f64x2_mul(self.twiddle10im, x7m22); let t_b11_8 = f64x2_mul(self.twiddle1im, x8m21); let t_b11_9 = f64x2_mul(self.twiddle12im, x9m20); let t_b11_10 = f64x2_mul(self.twiddle6im, x10m19); let t_b11_11 = f64x2_mul(self.twiddle5im, x11m18); let t_b11_12 = f64x2_mul(self.twiddle13im, x12m17); let t_b11_13 = f64x2_mul(self.twiddle2im, x13m16); let t_b11_14 = f64x2_mul(self.twiddle9im, x14m15); let t_b12_1 = f64x2_mul(self.twiddle12im, x1m28); let t_b12_2 = f64x2_mul(self.twiddle5im, x2m27); let t_b12_3 = f64x2_mul(self.twiddle7im, x3m26); let t_b12_4 = f64x2_mul(self.twiddle10im, x4m25); let t_b12_5 = f64x2_mul(self.twiddle2im, x5m24); let t_b12_6 = f64x2_mul(self.twiddle14im, x6m23); let t_b12_7 = f64x2_mul(self.twiddle3im, x7m22); let t_b12_8 = f64x2_mul(self.twiddle9im, x8m21); let t_b12_9 = f64x2_mul(self.twiddle8im, x9m20); let t_b12_10 = f64x2_mul(self.twiddle4im, x10m19); let t_b12_11 = f64x2_mul(self.twiddle13im, x11m18); let t_b12_12 = f64x2_mul(self.twiddle1im, x12m17); let t_b12_13 = f64x2_mul(self.twiddle11im, x13m16); let t_b12_14 = f64x2_mul(self.twiddle6im, x14m15); let t_b13_1 = f64x2_mul(self.twiddle13im, x1m28); let t_b13_2 = f64x2_mul(self.twiddle3im, x2m27); let t_b13_3 = f64x2_mul(self.twiddle10im, x3m26); let t_b13_4 = f64x2_mul(self.twiddle6im, x4m25); let t_b13_5 = f64x2_mul(self.twiddle7im, x5m24); let t_b13_6 = f64x2_mul(self.twiddle9im, x6m23); let t_b13_7 = f64x2_mul(self.twiddle4im, x7m22); let t_b13_8 = f64x2_mul(self.twiddle12im, x8m21); let t_b13_9 = f64x2_mul(self.twiddle1im, x9m20); let t_b13_10 = f64x2_mul(self.twiddle14im, x10m19); let t_b13_11 = f64x2_mul(self.twiddle2im, x11m18); let t_b13_12 = f64x2_mul(self.twiddle11im, x12m17); let t_b13_13 = f64x2_mul(self.twiddle5im, x13m16); let t_b13_14 = f64x2_mul(self.twiddle8im, x14m15); let t_b14_1 = f64x2_mul(self.twiddle14im, x1m28); let t_b14_2 = f64x2_mul(self.twiddle1im, x2m27); let t_b14_3 = f64x2_mul(self.twiddle13im, x3m26); let t_b14_4 = f64x2_mul(self.twiddle2im, x4m25); let t_b14_5 = f64x2_mul(self.twiddle12im, x5m24); let t_b14_6 = f64x2_mul(self.twiddle3im, x6m23); let t_b14_7 = f64x2_mul(self.twiddle11im, x7m22); let t_b14_8 = f64x2_mul(self.twiddle4im, x8m21); let t_b14_9 = f64x2_mul(self.twiddle10im, x9m20); let t_b14_10 = f64x2_mul(self.twiddle5im, x10m19); let t_b14_11 = f64x2_mul(self.twiddle9im, x11m18); let t_b14_12 = f64x2_mul(self.twiddle6im, x12m17); let t_b14_13 = f64x2_mul(self.twiddle8im, x13m16); let t_b14_14 = f64x2_mul(self.twiddle7im, x14m15); let x0 = values[0]; let t_a1 = calc_f64!( x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11 + t_a1_12 + t_a1_13 + t_a1_14 ); let t_a2 = calc_f64!( x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11 + t_a2_12 + t_a2_13 + t_a2_14 ); let t_a3 = calc_f64!( x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11 + t_a3_12 + t_a3_13 + t_a3_14 ); let t_a4 = calc_f64!( x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11 + t_a4_12 + t_a4_13 + t_a4_14 ); let t_a5 = calc_f64!( x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11 + t_a5_12 + t_a5_13 + t_a5_14 ); let t_a6 = calc_f64!( x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11 + t_a6_12 + t_a6_13 + t_a6_14 ); let t_a7 = calc_f64!( x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11 + t_a7_12 + t_a7_13 + t_a7_14 ); let t_a8 = calc_f64!( x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11 + t_a8_12 + t_a8_13 + t_a8_14 ); let t_a9 = calc_f64!( x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11 + t_a9_12 + t_a9_13 + t_a9_14 ); let t_a10 = calc_f64!( x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11 + t_a10_12 + t_a10_13 + t_a10_14 ); let t_a11 = calc_f64!( x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11 + t_a11_12 + t_a11_13 + t_a11_14 ); let t_a12 = calc_f64!( x0 + t_a12_1 + t_a12_2 + t_a12_3 + t_a12_4 + t_a12_5 + t_a12_6 + t_a12_7 + t_a12_8 + t_a12_9 + t_a12_10 + t_a12_11 + t_a12_12 + t_a12_13 + t_a12_14 ); let t_a13 = calc_f64!( x0 + t_a13_1 + t_a13_2 + t_a13_3 + t_a13_4 + t_a13_5 + t_a13_6 + t_a13_7 + t_a13_8 + t_a13_9 + t_a13_10 + t_a13_11 + t_a13_12 + t_a13_13 + t_a13_14 ); let t_a14 = calc_f64!( x0 + t_a14_1 + t_a14_2 + t_a14_3 + t_a14_4 + t_a14_5 + t_a14_6 + t_a14_7 + t_a14_8 + t_a14_9 + t_a14_10 + t_a14_11 + t_a14_12 + t_a14_13 + t_a14_14 ); let t_b1 = calc_f64!( t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11 + t_b1_12 + t_b1_13 + t_b1_14 ); let t_b2 = calc_f64!( t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 + t_b2_6 + t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11 - t_b2_12 - t_b2_13 - t_b2_14 ); let t_b3 = calc_f64!( t_b3_1 + t_b3_2 + t_b3_3 + t_b3_4 - t_b3_5 - t_b3_6 - t_b3_7 - t_b3_8 - t_b3_9 + t_b3_10 + t_b3_11 + t_b3_12 + t_b3_13 + t_b3_14 ); let t_b4 = calc_f64!( t_b4_1 + t_b4_2 + t_b4_3 - t_b4_4 - t_b4_5 - t_b4_6 - t_b4_7 + t_b4_8 + t_b4_9 + t_b4_10 - t_b4_11 - t_b4_12 - t_b4_13 - t_b4_14 ); let t_b5 = calc_f64!( t_b5_1 + t_b5_2 - t_b5_3 - t_b5_4 - t_b5_5 + t_b5_6 + t_b5_7 + t_b5_8 - t_b5_9 - t_b5_10 - t_b5_11 + t_b5_12 + t_b5_13 + t_b5_14 ); let t_b6 = calc_f64!( t_b6_1 + t_b6_2 - t_b6_3 - t_b6_4 + t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9 + t_b6_10 + t_b6_11 + t_b6_12 - t_b6_13 - t_b6_14 ); let t_b7 = calc_f64!( t_b7_1 + t_b7_2 - t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9 + t_b7_10 - t_b7_11 - t_b7_12 + t_b7_13 + t_b7_14 ); let t_b8 = calc_f64!( t_b8_1 - t_b8_2 - t_b8_3 + t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 + t_b8_9 - t_b8_10 + t_b8_11 + t_b8_12 - t_b8_13 - t_b8_14 ); let t_b9 = calc_f64!( t_b9_1 - t_b9_2 - t_b9_3 + t_b9_4 - t_b9_5 - t_b9_6 + t_b9_7 + t_b9_8 - t_b9_9 + t_b9_10 + t_b9_11 - t_b9_12 + t_b9_13 + t_b9_14 ); let t_b10 = calc_f64!( t_b10_1 - t_b10_2 + t_b10_3 + t_b10_4 - t_b10_5 + t_b10_6 + t_b10_7 - t_b10_8 + t_b10_9 + t_b10_10 - t_b10_11 + t_b10_12 + t_b10_13 - t_b10_14 ); let t_b11 = calc_f64!( t_b11_1 - t_b11_2 + t_b11_3 - t_b11_4 - t_b11_5 + t_b11_6 - t_b11_7 + t_b11_8 + t_b11_9 - t_b11_10 + t_b11_11 - t_b11_12 - t_b11_13 + t_b11_14 ); let t_b12 = calc_f64!( t_b12_1 - t_b12_2 + t_b12_3 - t_b12_4 + t_b12_5 + t_b12_6 - t_b12_7 + t_b12_8 - t_b12_9 + t_b12_10 - t_b12_11 - t_b12_12 + t_b12_13 - t_b12_14 ); let t_b13 = calc_f64!( t_b13_1 - t_b13_2 + t_b13_3 - t_b13_4 + t_b13_5 - t_b13_6 + t_b13_7 - t_b13_8 + t_b13_9 + t_b13_10 - t_b13_11 + t_b13_12 - t_b13_13 + t_b13_14 ); let t_b14 = calc_f64!( t_b14_1 - t_b14_2 + t_b14_3 - t_b14_4 + t_b14_5 - t_b14_6 + t_b14_7 - t_b14_8 + t_b14_9 - t_b14_10 + t_b14_11 - t_b14_12 + t_b14_13 - t_b14_14 ); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let t_b7_rot = self.rotate.rotate(t_b7); let t_b8_rot = self.rotate.rotate(t_b8); let t_b9_rot = self.rotate.rotate(t_b9); let t_b10_rot = self.rotate.rotate(t_b10); let t_b11_rot = self.rotate.rotate(t_b11); let t_b12_rot = self.rotate.rotate(t_b12); let t_b13_rot = self.rotate.rotate(t_b13); let t_b14_rot = self.rotate.rotate(t_b14); let y0 = calc_f64!( x0 + x1p28 + x2p27 + x3p26 + x4p25 + x5p24 + x6p23 + x7p22 + x8p21 + x9p20 + x10p19 + x11p18 + x12p17 + x13p16 + x14p15 ); let [y1, y28] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y27] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y26] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y25] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y24] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y23] = solo_fft2_f64(t_a6, t_b6_rot); let [y7, y22] = solo_fft2_f64(t_a7, t_b7_rot); let [y8, y21] = solo_fft2_f64(t_a8, t_b8_rot); let [y9, y20] = solo_fft2_f64(t_a9, t_b9_rot); let [y10, y19] = solo_fft2_f64(t_a10, t_b10_rot); let [y11, y18] = solo_fft2_f64(t_a11, t_b11_rot); let [y12, y17] = solo_fft2_f64(t_a12, t_b12_rot); let [y13, y16] = solo_fft2_f64(t_a13, t_b13_rot); let [y14, y15] = solo_fft2_f64(t_a14, t_b14_rot); [ y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22, y23, y24, y25, y26, y27, y28, ] } } // _____ _ _________ _ _ _ // |___ // | |___ /___ \| |__ (_) |_ // |_ \| | _____ |_ \ __) | '_ \| | __| // ___) | | |_____| ___) / __/| |_) | | |_ // |____/|_| |____/_____|_.__/|_|\__| // pub struct WasmSimdF32Butterfly31 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F32, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, twiddle3re: v128, twiddle3im: v128, twiddle4re: v128, twiddle4im: v128, twiddle5re: v128, twiddle5im: v128, twiddle6re: v128, twiddle6im: v128, twiddle7re: v128, twiddle7im: v128, twiddle8re: v128, twiddle8im: v128, twiddle9re: v128, twiddle9im: v128, twiddle10re: v128, twiddle10im: v128, twiddle11re: v128, twiddle11im: v128, twiddle12re: v128, twiddle12im: v128, twiddle13re: v128, twiddle13im: v128, twiddle14re: v128, twiddle14im: v128, twiddle15re: v128, twiddle15im: v128, } boilerplate_fft_wasm_simd_f32_butterfly!( WasmSimdF32Butterfly31, 31, |this: &WasmSimdF32Butterfly31<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF32Butterfly31, 31, |this: &WasmSimdF32Butterfly31<_>| this.direction ); impl WasmSimdF32Butterfly31 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f32::(); let rotate = Rotate90F32::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 31, direction); let tw2: Complex = twiddles::compute_twiddle(2, 31, direction); let tw3: Complex = twiddles::compute_twiddle(3, 31, direction); let tw4: Complex = twiddles::compute_twiddle(4, 31, direction); let tw5: Complex = twiddles::compute_twiddle(5, 31, direction); let tw6: Complex = twiddles::compute_twiddle(6, 31, direction); let tw7: Complex = twiddles::compute_twiddle(7, 31, direction); let tw8: Complex = twiddles::compute_twiddle(8, 31, direction); let tw9: Complex = twiddles::compute_twiddle(9, 31, direction); let tw10: Complex = twiddles::compute_twiddle(10, 31, direction); let tw11: Complex = twiddles::compute_twiddle(11, 31, direction); let tw12: Complex = twiddles::compute_twiddle(12, 31, direction); let tw13: Complex = twiddles::compute_twiddle(13, 31, direction); let tw14: Complex = twiddles::compute_twiddle(14, 31, direction); let tw15: Complex = twiddles::compute_twiddle(15, 31, direction); let twiddle1re = f32x4_splat(tw1.re); let twiddle1im = f32x4_splat(tw1.im); let twiddle2re = f32x4_splat(tw2.re); let twiddle2im = f32x4_splat(tw2.im); let twiddle3re = f32x4_splat(tw3.re); let twiddle3im = f32x4_splat(tw3.im); let twiddle4re = f32x4_splat(tw4.re); let twiddle4im = f32x4_splat(tw4.im); let twiddle5re = f32x4_splat(tw5.re); let twiddle5im = f32x4_splat(tw5.im); let twiddle6re = f32x4_splat(tw6.re); let twiddle6im = f32x4_splat(tw6.im); let twiddle7re = f32x4_splat(tw7.re); let twiddle7im = f32x4_splat(tw7.im); let twiddle8re = f32x4_splat(tw8.re); let twiddle8im = f32x4_splat(tw8.im); let twiddle9re = f32x4_splat(tw9.re); let twiddle9im = f32x4_splat(tw9.im); let twiddle10re = f32x4_splat(tw10.re); let twiddle10im = f32x4_splat(tw10.im); let twiddle11re = f32x4_splat(tw11.re); let twiddle11im = f32x4_splat(tw11.im); let twiddle12re = f32x4_splat(tw12.re); let twiddle12im = f32x4_splat(tw12.im); let twiddle13re = f32x4_splat(tw13.re); let twiddle13im = f32x4_splat(tw13.im); let twiddle14re = f32x4_splat(tw14.re); let twiddle14im = f32x4_splat(tw14.im); let twiddle15re = f32x4_splat(tw15.re); let twiddle15im = f32x4_splat(tw15.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, twiddle12re, twiddle12im, twiddle13re, twiddle13im, twiddle14re, twiddle14im, twiddle15re, twiddle15im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_partial1_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30}); let out = self.perform_parallel_fft_direct(values); write_partial_lo_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_contiguous( &self, mut buffer: impl WasmSimdArrayMut, ) { let input_packed = read_complex_to_array!(buffer, {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60}); let values = [ extract_lo_hi_f32(input_packed[0], input_packed[15]), extract_hi_lo_f32(input_packed[0], input_packed[16]), extract_lo_hi_f32(input_packed[1], input_packed[16]), extract_hi_lo_f32(input_packed[1], input_packed[17]), extract_lo_hi_f32(input_packed[2], input_packed[17]), extract_hi_lo_f32(input_packed[2], input_packed[18]), extract_lo_hi_f32(input_packed[3], input_packed[18]), extract_hi_lo_f32(input_packed[3], input_packed[19]), extract_lo_hi_f32(input_packed[4], input_packed[19]), extract_hi_lo_f32(input_packed[4], input_packed[20]), extract_lo_hi_f32(input_packed[5], input_packed[20]), extract_hi_lo_f32(input_packed[5], input_packed[21]), extract_lo_hi_f32(input_packed[6], input_packed[21]), extract_hi_lo_f32(input_packed[6], input_packed[22]), extract_lo_hi_f32(input_packed[7], input_packed[22]), extract_hi_lo_f32(input_packed[7], input_packed[23]), extract_lo_hi_f32(input_packed[8], input_packed[23]), extract_hi_lo_f32(input_packed[8], input_packed[24]), extract_lo_hi_f32(input_packed[9], input_packed[24]), extract_hi_lo_f32(input_packed[9], input_packed[25]), extract_lo_hi_f32(input_packed[10], input_packed[25]), extract_hi_lo_f32(input_packed[10], input_packed[26]), extract_lo_hi_f32(input_packed[11], input_packed[26]), extract_hi_lo_f32(input_packed[11], input_packed[27]), extract_lo_hi_f32(input_packed[12], input_packed[27]), extract_hi_lo_f32(input_packed[12], input_packed[28]), extract_lo_hi_f32(input_packed[13], input_packed[28]), extract_hi_lo_f32(input_packed[13], input_packed[29]), extract_lo_hi_f32(input_packed[14], input_packed[29]), extract_hi_lo_f32(input_packed[14], input_packed[30]), extract_lo_hi_f32(input_packed[15], input_packed[30]), ]; let out = self.perform_parallel_fft_direct(values); let out_packed = [ extract_lo_lo_f32(out[0], out[1]), extract_lo_lo_f32(out[2], out[3]), extract_lo_lo_f32(out[4], out[5]), extract_lo_lo_f32(out[6], out[7]), extract_lo_lo_f32(out[8], out[9]), extract_lo_lo_f32(out[10], out[11]), extract_lo_lo_f32(out[12], out[13]), extract_lo_lo_f32(out[14], out[15]), extract_lo_lo_f32(out[16], out[17]), extract_lo_lo_f32(out[18], out[19]), extract_lo_lo_f32(out[20], out[21]), extract_lo_lo_f32(out[22], out[23]), extract_lo_lo_f32(out[24], out[25]), extract_lo_lo_f32(out[26], out[27]), extract_lo_lo_f32(out[28], out[29]), extract_lo_hi_f32(out[30], out[0]), extract_hi_hi_f32(out[1], out[2]), extract_hi_hi_f32(out[3], out[4]), extract_hi_hi_f32(out[5], out[6]), extract_hi_hi_f32(out[7], out[8]), extract_hi_hi_f32(out[9], out[10]), extract_hi_hi_f32(out[11], out[12]), extract_hi_hi_f32(out[13], out[14]), extract_hi_hi_f32(out[15], out[16]), extract_hi_hi_f32(out[17], out[18]), extract_hi_hi_f32(out[19], out[20]), extract_hi_hi_f32(out[21], out[22]), extract_hi_hi_f32(out[23], out[24]), extract_hi_hi_f32(out[25], out[26]), extract_hi_hi_f32(out[27], out[28]), extract_hi_hi_f32(out[29], out[30]), ]; write_complex_to_array_strided!(out_packed, buffer, 2, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30}); } #[inline(always)] pub(crate) unsafe fn perform_parallel_fft_direct(&self, values: [v128; 31]) -> [v128; 31] { let [x1p30, x1m30] = parallel_fft2_interleaved_f32(values[1], values[30]); let [x2p29, x2m29] = parallel_fft2_interleaved_f32(values[2], values[29]); let [x3p28, x3m28] = parallel_fft2_interleaved_f32(values[3], values[28]); let [x4p27, x4m27] = parallel_fft2_interleaved_f32(values[4], values[27]); let [x5p26, x5m26] = parallel_fft2_interleaved_f32(values[5], values[26]); let [x6p25, x6m25] = parallel_fft2_interleaved_f32(values[6], values[25]); let [x7p24, x7m24] = parallel_fft2_interleaved_f32(values[7], values[24]); let [x8p23, x8m23] = parallel_fft2_interleaved_f32(values[8], values[23]); let [x9p22, x9m22] = parallel_fft2_interleaved_f32(values[9], values[22]); let [x10p21, x10m21] = parallel_fft2_interleaved_f32(values[10], values[21]); let [x11p20, x11m20] = parallel_fft2_interleaved_f32(values[11], values[20]); let [x12p19, x12m19] = parallel_fft2_interleaved_f32(values[12], values[19]); let [x13p18, x13m18] = parallel_fft2_interleaved_f32(values[13], values[18]); let [x14p17, x14m17] = parallel_fft2_interleaved_f32(values[14], values[17]); let [x15p16, x15m16] = parallel_fft2_interleaved_f32(values[15], values[16]); let t_a1_1 = f32x4_mul(self.twiddle1re, x1p30); let t_a1_2 = f32x4_mul(self.twiddle2re, x2p29); let t_a1_3 = f32x4_mul(self.twiddle3re, x3p28); let t_a1_4 = f32x4_mul(self.twiddle4re, x4p27); let t_a1_5 = f32x4_mul(self.twiddle5re, x5p26); let t_a1_6 = f32x4_mul(self.twiddle6re, x6p25); let t_a1_7 = f32x4_mul(self.twiddle7re, x7p24); let t_a1_8 = f32x4_mul(self.twiddle8re, x8p23); let t_a1_9 = f32x4_mul(self.twiddle9re, x9p22); let t_a1_10 = f32x4_mul(self.twiddle10re, x10p21); let t_a1_11 = f32x4_mul(self.twiddle11re, x11p20); let t_a1_12 = f32x4_mul(self.twiddle12re, x12p19); let t_a1_13 = f32x4_mul(self.twiddle13re, x13p18); let t_a1_14 = f32x4_mul(self.twiddle14re, x14p17); let t_a1_15 = f32x4_mul(self.twiddle15re, x15p16); let t_a2_1 = f32x4_mul(self.twiddle2re, x1p30); let t_a2_2 = f32x4_mul(self.twiddle4re, x2p29); let t_a2_3 = f32x4_mul(self.twiddle6re, x3p28); let t_a2_4 = f32x4_mul(self.twiddle8re, x4p27); let t_a2_5 = f32x4_mul(self.twiddle10re, x5p26); let t_a2_6 = f32x4_mul(self.twiddle12re, x6p25); let t_a2_7 = f32x4_mul(self.twiddle14re, x7p24); let t_a2_8 = f32x4_mul(self.twiddle15re, x8p23); let t_a2_9 = f32x4_mul(self.twiddle13re, x9p22); let t_a2_10 = f32x4_mul(self.twiddle11re, x10p21); let t_a2_11 = f32x4_mul(self.twiddle9re, x11p20); let t_a2_12 = f32x4_mul(self.twiddle7re, x12p19); let t_a2_13 = f32x4_mul(self.twiddle5re, x13p18); let t_a2_14 = f32x4_mul(self.twiddle3re, x14p17); let t_a2_15 = f32x4_mul(self.twiddle1re, x15p16); let t_a3_1 = f32x4_mul(self.twiddle3re, x1p30); let t_a3_2 = f32x4_mul(self.twiddle6re, x2p29); let t_a3_3 = f32x4_mul(self.twiddle9re, x3p28); let t_a3_4 = f32x4_mul(self.twiddle12re, x4p27); let t_a3_5 = f32x4_mul(self.twiddle15re, x5p26); let t_a3_6 = f32x4_mul(self.twiddle13re, x6p25); let t_a3_7 = f32x4_mul(self.twiddle10re, x7p24); let t_a3_8 = f32x4_mul(self.twiddle7re, x8p23); let t_a3_9 = f32x4_mul(self.twiddle4re, x9p22); let t_a3_10 = f32x4_mul(self.twiddle1re, x10p21); let t_a3_11 = f32x4_mul(self.twiddle2re, x11p20); let t_a3_12 = f32x4_mul(self.twiddle5re, x12p19); let t_a3_13 = f32x4_mul(self.twiddle8re, x13p18); let t_a3_14 = f32x4_mul(self.twiddle11re, x14p17); let t_a3_15 = f32x4_mul(self.twiddle14re, x15p16); let t_a4_1 = f32x4_mul(self.twiddle4re, x1p30); let t_a4_2 = f32x4_mul(self.twiddle8re, x2p29); let t_a4_3 = f32x4_mul(self.twiddle12re, x3p28); let t_a4_4 = f32x4_mul(self.twiddle15re, x4p27); let t_a4_5 = f32x4_mul(self.twiddle11re, x5p26); let t_a4_6 = f32x4_mul(self.twiddle7re, x6p25); let t_a4_7 = f32x4_mul(self.twiddle3re, x7p24); let t_a4_8 = f32x4_mul(self.twiddle1re, x8p23); let t_a4_9 = f32x4_mul(self.twiddle5re, x9p22); let t_a4_10 = f32x4_mul(self.twiddle9re, x10p21); let t_a4_11 = f32x4_mul(self.twiddle13re, x11p20); let t_a4_12 = f32x4_mul(self.twiddle14re, x12p19); let t_a4_13 = f32x4_mul(self.twiddle10re, x13p18); let t_a4_14 = f32x4_mul(self.twiddle6re, x14p17); let t_a4_15 = f32x4_mul(self.twiddle2re, x15p16); let t_a5_1 = f32x4_mul(self.twiddle5re, x1p30); let t_a5_2 = f32x4_mul(self.twiddle10re, x2p29); let t_a5_3 = f32x4_mul(self.twiddle15re, x3p28); let t_a5_4 = f32x4_mul(self.twiddle11re, x4p27); let t_a5_5 = f32x4_mul(self.twiddle6re, x5p26); let t_a5_6 = f32x4_mul(self.twiddle1re, x6p25); let t_a5_7 = f32x4_mul(self.twiddle4re, x7p24); let t_a5_8 = f32x4_mul(self.twiddle9re, x8p23); let t_a5_9 = f32x4_mul(self.twiddle14re, x9p22); let t_a5_10 = f32x4_mul(self.twiddle12re, x10p21); let t_a5_11 = f32x4_mul(self.twiddle7re, x11p20); let t_a5_12 = f32x4_mul(self.twiddle2re, x12p19); let t_a5_13 = f32x4_mul(self.twiddle3re, x13p18); let t_a5_14 = f32x4_mul(self.twiddle8re, x14p17); let t_a5_15 = f32x4_mul(self.twiddle13re, x15p16); let t_a6_1 = f32x4_mul(self.twiddle6re, x1p30); let t_a6_2 = f32x4_mul(self.twiddle12re, x2p29); let t_a6_3 = f32x4_mul(self.twiddle13re, x3p28); let t_a6_4 = f32x4_mul(self.twiddle7re, x4p27); let t_a6_5 = f32x4_mul(self.twiddle1re, x5p26); let t_a6_6 = f32x4_mul(self.twiddle5re, x6p25); let t_a6_7 = f32x4_mul(self.twiddle11re, x7p24); let t_a6_8 = f32x4_mul(self.twiddle14re, x8p23); let t_a6_9 = f32x4_mul(self.twiddle8re, x9p22); let t_a6_10 = f32x4_mul(self.twiddle2re, x10p21); let t_a6_11 = f32x4_mul(self.twiddle4re, x11p20); let t_a6_12 = f32x4_mul(self.twiddle10re, x12p19); let t_a6_13 = f32x4_mul(self.twiddle15re, x13p18); let t_a6_14 = f32x4_mul(self.twiddle9re, x14p17); let t_a6_15 = f32x4_mul(self.twiddle3re, x15p16); let t_a7_1 = f32x4_mul(self.twiddle7re, x1p30); let t_a7_2 = f32x4_mul(self.twiddle14re, x2p29); let t_a7_3 = f32x4_mul(self.twiddle10re, x3p28); let t_a7_4 = f32x4_mul(self.twiddle3re, x4p27); let t_a7_5 = f32x4_mul(self.twiddle4re, x5p26); let t_a7_6 = f32x4_mul(self.twiddle11re, x6p25); let t_a7_7 = f32x4_mul(self.twiddle13re, x7p24); let t_a7_8 = f32x4_mul(self.twiddle6re, x8p23); let t_a7_9 = f32x4_mul(self.twiddle1re, x9p22); let t_a7_10 = f32x4_mul(self.twiddle8re, x10p21); let t_a7_11 = f32x4_mul(self.twiddle15re, x11p20); let t_a7_12 = f32x4_mul(self.twiddle9re, x12p19); let t_a7_13 = f32x4_mul(self.twiddle2re, x13p18); let t_a7_14 = f32x4_mul(self.twiddle5re, x14p17); let t_a7_15 = f32x4_mul(self.twiddle12re, x15p16); let t_a8_1 = f32x4_mul(self.twiddle8re, x1p30); let t_a8_2 = f32x4_mul(self.twiddle15re, x2p29); let t_a8_3 = f32x4_mul(self.twiddle7re, x3p28); let t_a8_4 = f32x4_mul(self.twiddle1re, x4p27); let t_a8_5 = f32x4_mul(self.twiddle9re, x5p26); let t_a8_6 = f32x4_mul(self.twiddle14re, x6p25); let t_a8_7 = f32x4_mul(self.twiddle6re, x7p24); let t_a8_8 = f32x4_mul(self.twiddle2re, x8p23); let t_a8_9 = f32x4_mul(self.twiddle10re, x9p22); let t_a8_10 = f32x4_mul(self.twiddle13re, x10p21); let t_a8_11 = f32x4_mul(self.twiddle5re, x11p20); let t_a8_12 = f32x4_mul(self.twiddle3re, x12p19); let t_a8_13 = f32x4_mul(self.twiddle11re, x13p18); let t_a8_14 = f32x4_mul(self.twiddle12re, x14p17); let t_a8_15 = f32x4_mul(self.twiddle4re, x15p16); let t_a9_1 = f32x4_mul(self.twiddle9re, x1p30); let t_a9_2 = f32x4_mul(self.twiddle13re, x2p29); let t_a9_3 = f32x4_mul(self.twiddle4re, x3p28); let t_a9_4 = f32x4_mul(self.twiddle5re, x4p27); let t_a9_5 = f32x4_mul(self.twiddle14re, x5p26); let t_a9_6 = f32x4_mul(self.twiddle8re, x6p25); let t_a9_7 = f32x4_mul(self.twiddle1re, x7p24); let t_a9_8 = f32x4_mul(self.twiddle10re, x8p23); let t_a9_9 = f32x4_mul(self.twiddle12re, x9p22); let t_a9_10 = f32x4_mul(self.twiddle3re, x10p21); let t_a9_11 = f32x4_mul(self.twiddle6re, x11p20); let t_a9_12 = f32x4_mul(self.twiddle15re, x12p19); let t_a9_13 = f32x4_mul(self.twiddle7re, x13p18); let t_a9_14 = f32x4_mul(self.twiddle2re, x14p17); let t_a9_15 = f32x4_mul(self.twiddle11re, x15p16); let t_a10_1 = f32x4_mul(self.twiddle10re, x1p30); let t_a10_2 = f32x4_mul(self.twiddle11re, x2p29); let t_a10_3 = f32x4_mul(self.twiddle1re, x3p28); let t_a10_4 = f32x4_mul(self.twiddle9re, x4p27); let t_a10_5 = f32x4_mul(self.twiddle12re, x5p26); let t_a10_6 = f32x4_mul(self.twiddle2re, x6p25); let t_a10_7 = f32x4_mul(self.twiddle8re, x7p24); let t_a10_8 = f32x4_mul(self.twiddle13re, x8p23); let t_a10_9 = f32x4_mul(self.twiddle3re, x9p22); let t_a10_10 = f32x4_mul(self.twiddle7re, x10p21); let t_a10_11 = f32x4_mul(self.twiddle14re, x11p20); let t_a10_12 = f32x4_mul(self.twiddle4re, x12p19); let t_a10_13 = f32x4_mul(self.twiddle6re, x13p18); let t_a10_14 = f32x4_mul(self.twiddle15re, x14p17); let t_a10_15 = f32x4_mul(self.twiddle5re, x15p16); let t_a11_1 = f32x4_mul(self.twiddle11re, x1p30); let t_a11_2 = f32x4_mul(self.twiddle9re, x2p29); let t_a11_3 = f32x4_mul(self.twiddle2re, x3p28); let t_a11_4 = f32x4_mul(self.twiddle13re, x4p27); let t_a11_5 = f32x4_mul(self.twiddle7re, x5p26); let t_a11_6 = f32x4_mul(self.twiddle4re, x6p25); let t_a11_7 = f32x4_mul(self.twiddle15re, x7p24); let t_a11_8 = f32x4_mul(self.twiddle5re, x8p23); let t_a11_9 = f32x4_mul(self.twiddle6re, x9p22); let t_a11_10 = f32x4_mul(self.twiddle14re, x10p21); let t_a11_11 = f32x4_mul(self.twiddle3re, x11p20); let t_a11_12 = f32x4_mul(self.twiddle8re, x12p19); let t_a11_13 = f32x4_mul(self.twiddle12re, x13p18); let t_a11_14 = f32x4_mul(self.twiddle1re, x14p17); let t_a11_15 = f32x4_mul(self.twiddle10re, x15p16); let t_a12_1 = f32x4_mul(self.twiddle12re, x1p30); let t_a12_2 = f32x4_mul(self.twiddle7re, x2p29); let t_a12_3 = f32x4_mul(self.twiddle5re, x3p28); let t_a12_4 = f32x4_mul(self.twiddle14re, x4p27); let t_a12_5 = f32x4_mul(self.twiddle2re, x5p26); let t_a12_6 = f32x4_mul(self.twiddle10re, x6p25); let t_a12_7 = f32x4_mul(self.twiddle9re, x7p24); let t_a12_8 = f32x4_mul(self.twiddle3re, x8p23); let t_a12_9 = f32x4_mul(self.twiddle15re, x9p22); let t_a12_10 = f32x4_mul(self.twiddle4re, x10p21); let t_a12_11 = f32x4_mul(self.twiddle8re, x11p20); let t_a12_12 = f32x4_mul(self.twiddle11re, x12p19); let t_a12_13 = f32x4_mul(self.twiddle1re, x13p18); let t_a12_14 = f32x4_mul(self.twiddle13re, x14p17); let t_a12_15 = f32x4_mul(self.twiddle6re, x15p16); let t_a13_1 = f32x4_mul(self.twiddle13re, x1p30); let t_a13_2 = f32x4_mul(self.twiddle5re, x2p29); let t_a13_3 = f32x4_mul(self.twiddle8re, x3p28); let t_a13_4 = f32x4_mul(self.twiddle10re, x4p27); let t_a13_5 = f32x4_mul(self.twiddle3re, x5p26); let t_a13_6 = f32x4_mul(self.twiddle15re, x6p25); let t_a13_7 = f32x4_mul(self.twiddle2re, x7p24); let t_a13_8 = f32x4_mul(self.twiddle11re, x8p23); let t_a13_9 = f32x4_mul(self.twiddle7re, x9p22); let t_a13_10 = f32x4_mul(self.twiddle6re, x10p21); let t_a13_11 = f32x4_mul(self.twiddle12re, x11p20); let t_a13_12 = f32x4_mul(self.twiddle1re, x12p19); let t_a13_13 = f32x4_mul(self.twiddle14re, x13p18); let t_a13_14 = f32x4_mul(self.twiddle4re, x14p17); let t_a13_15 = f32x4_mul(self.twiddle9re, x15p16); let t_a14_1 = f32x4_mul(self.twiddle14re, x1p30); let t_a14_2 = f32x4_mul(self.twiddle3re, x2p29); let t_a14_3 = f32x4_mul(self.twiddle11re, x3p28); let t_a14_4 = f32x4_mul(self.twiddle6re, x4p27); let t_a14_5 = f32x4_mul(self.twiddle8re, x5p26); let t_a14_6 = f32x4_mul(self.twiddle9re, x6p25); let t_a14_7 = f32x4_mul(self.twiddle5re, x7p24); let t_a14_8 = f32x4_mul(self.twiddle12re, x8p23); let t_a14_9 = f32x4_mul(self.twiddle2re, x9p22); let t_a14_10 = f32x4_mul(self.twiddle15re, x10p21); let t_a14_11 = f32x4_mul(self.twiddle1re, x11p20); let t_a14_12 = f32x4_mul(self.twiddle13re, x12p19); let t_a14_13 = f32x4_mul(self.twiddle4re, x13p18); let t_a14_14 = f32x4_mul(self.twiddle10re, x14p17); let t_a14_15 = f32x4_mul(self.twiddle7re, x15p16); let t_a15_1 = f32x4_mul(self.twiddle15re, x1p30); let t_a15_2 = f32x4_mul(self.twiddle1re, x2p29); let t_a15_3 = f32x4_mul(self.twiddle14re, x3p28); let t_a15_4 = f32x4_mul(self.twiddle2re, x4p27); let t_a15_5 = f32x4_mul(self.twiddle13re, x5p26); let t_a15_6 = f32x4_mul(self.twiddle3re, x6p25); let t_a15_7 = f32x4_mul(self.twiddle12re, x7p24); let t_a15_8 = f32x4_mul(self.twiddle4re, x8p23); let t_a15_9 = f32x4_mul(self.twiddle11re, x9p22); let t_a15_10 = f32x4_mul(self.twiddle5re, x10p21); let t_a15_11 = f32x4_mul(self.twiddle10re, x11p20); let t_a15_12 = f32x4_mul(self.twiddle6re, x12p19); let t_a15_13 = f32x4_mul(self.twiddle9re, x13p18); let t_a15_14 = f32x4_mul(self.twiddle7re, x14p17); let t_a15_15 = f32x4_mul(self.twiddle8re, x15p16); let t_b1_1 = f32x4_mul(self.twiddle1im, x1m30); let t_b1_2 = f32x4_mul(self.twiddle2im, x2m29); let t_b1_3 = f32x4_mul(self.twiddle3im, x3m28); let t_b1_4 = f32x4_mul(self.twiddle4im, x4m27); let t_b1_5 = f32x4_mul(self.twiddle5im, x5m26); let t_b1_6 = f32x4_mul(self.twiddle6im, x6m25); let t_b1_7 = f32x4_mul(self.twiddle7im, x7m24); let t_b1_8 = f32x4_mul(self.twiddle8im, x8m23); let t_b1_9 = f32x4_mul(self.twiddle9im, x9m22); let t_b1_10 = f32x4_mul(self.twiddle10im, x10m21); let t_b1_11 = f32x4_mul(self.twiddle11im, x11m20); let t_b1_12 = f32x4_mul(self.twiddle12im, x12m19); let t_b1_13 = f32x4_mul(self.twiddle13im, x13m18); let t_b1_14 = f32x4_mul(self.twiddle14im, x14m17); let t_b1_15 = f32x4_mul(self.twiddle15im, x15m16); let t_b2_1 = f32x4_mul(self.twiddle2im, x1m30); let t_b2_2 = f32x4_mul(self.twiddle4im, x2m29); let t_b2_3 = f32x4_mul(self.twiddle6im, x3m28); let t_b2_4 = f32x4_mul(self.twiddle8im, x4m27); let t_b2_5 = f32x4_mul(self.twiddle10im, x5m26); let t_b2_6 = f32x4_mul(self.twiddle12im, x6m25); let t_b2_7 = f32x4_mul(self.twiddle14im, x7m24); let t_b2_8 = f32x4_mul(self.twiddle15im, x8m23); let t_b2_9 = f32x4_mul(self.twiddle13im, x9m22); let t_b2_10 = f32x4_mul(self.twiddle11im, x10m21); let t_b2_11 = f32x4_mul(self.twiddle9im, x11m20); let t_b2_12 = f32x4_mul(self.twiddle7im, x12m19); let t_b2_13 = f32x4_mul(self.twiddle5im, x13m18); let t_b2_14 = f32x4_mul(self.twiddle3im, x14m17); let t_b2_15 = f32x4_mul(self.twiddle1im, x15m16); let t_b3_1 = f32x4_mul(self.twiddle3im, x1m30); let t_b3_2 = f32x4_mul(self.twiddle6im, x2m29); let t_b3_3 = f32x4_mul(self.twiddle9im, x3m28); let t_b3_4 = f32x4_mul(self.twiddle12im, x4m27); let t_b3_5 = f32x4_mul(self.twiddle15im, x5m26); let t_b3_6 = f32x4_mul(self.twiddle13im, x6m25); let t_b3_7 = f32x4_mul(self.twiddle10im, x7m24); let t_b3_8 = f32x4_mul(self.twiddle7im, x8m23); let t_b3_9 = f32x4_mul(self.twiddle4im, x9m22); let t_b3_10 = f32x4_mul(self.twiddle1im, x10m21); let t_b3_11 = f32x4_mul(self.twiddle2im, x11m20); let t_b3_12 = f32x4_mul(self.twiddle5im, x12m19); let t_b3_13 = f32x4_mul(self.twiddle8im, x13m18); let t_b3_14 = f32x4_mul(self.twiddle11im, x14m17); let t_b3_15 = f32x4_mul(self.twiddle14im, x15m16); let t_b4_1 = f32x4_mul(self.twiddle4im, x1m30); let t_b4_2 = f32x4_mul(self.twiddle8im, x2m29); let t_b4_3 = f32x4_mul(self.twiddle12im, x3m28); let t_b4_4 = f32x4_mul(self.twiddle15im, x4m27); let t_b4_5 = f32x4_mul(self.twiddle11im, x5m26); let t_b4_6 = f32x4_mul(self.twiddle7im, x6m25); let t_b4_7 = f32x4_mul(self.twiddle3im, x7m24); let t_b4_8 = f32x4_mul(self.twiddle1im, x8m23); let t_b4_9 = f32x4_mul(self.twiddle5im, x9m22); let t_b4_10 = f32x4_mul(self.twiddle9im, x10m21); let t_b4_11 = f32x4_mul(self.twiddle13im, x11m20); let t_b4_12 = f32x4_mul(self.twiddle14im, x12m19); let t_b4_13 = f32x4_mul(self.twiddle10im, x13m18); let t_b4_14 = f32x4_mul(self.twiddle6im, x14m17); let t_b4_15 = f32x4_mul(self.twiddle2im, x15m16); let t_b5_1 = f32x4_mul(self.twiddle5im, x1m30); let t_b5_2 = f32x4_mul(self.twiddle10im, x2m29); let t_b5_3 = f32x4_mul(self.twiddle15im, x3m28); let t_b5_4 = f32x4_mul(self.twiddle11im, x4m27); let t_b5_5 = f32x4_mul(self.twiddle6im, x5m26); let t_b5_6 = f32x4_mul(self.twiddle1im, x6m25); let t_b5_7 = f32x4_mul(self.twiddle4im, x7m24); let t_b5_8 = f32x4_mul(self.twiddle9im, x8m23); let t_b5_9 = f32x4_mul(self.twiddle14im, x9m22); let t_b5_10 = f32x4_mul(self.twiddle12im, x10m21); let t_b5_11 = f32x4_mul(self.twiddle7im, x11m20); let t_b5_12 = f32x4_mul(self.twiddle2im, x12m19); let t_b5_13 = f32x4_mul(self.twiddle3im, x13m18); let t_b5_14 = f32x4_mul(self.twiddle8im, x14m17); let t_b5_15 = f32x4_mul(self.twiddle13im, x15m16); let t_b6_1 = f32x4_mul(self.twiddle6im, x1m30); let t_b6_2 = f32x4_mul(self.twiddle12im, x2m29); let t_b6_3 = f32x4_mul(self.twiddle13im, x3m28); let t_b6_4 = f32x4_mul(self.twiddle7im, x4m27); let t_b6_5 = f32x4_mul(self.twiddle1im, x5m26); let t_b6_6 = f32x4_mul(self.twiddle5im, x6m25); let t_b6_7 = f32x4_mul(self.twiddle11im, x7m24); let t_b6_8 = f32x4_mul(self.twiddle14im, x8m23); let t_b6_9 = f32x4_mul(self.twiddle8im, x9m22); let t_b6_10 = f32x4_mul(self.twiddle2im, x10m21); let t_b6_11 = f32x4_mul(self.twiddle4im, x11m20); let t_b6_12 = f32x4_mul(self.twiddle10im, x12m19); let t_b6_13 = f32x4_mul(self.twiddle15im, x13m18); let t_b6_14 = f32x4_mul(self.twiddle9im, x14m17); let t_b6_15 = f32x4_mul(self.twiddle3im, x15m16); let t_b7_1 = f32x4_mul(self.twiddle7im, x1m30); let t_b7_2 = f32x4_mul(self.twiddle14im, x2m29); let t_b7_3 = f32x4_mul(self.twiddle10im, x3m28); let t_b7_4 = f32x4_mul(self.twiddle3im, x4m27); let t_b7_5 = f32x4_mul(self.twiddle4im, x5m26); let t_b7_6 = f32x4_mul(self.twiddle11im, x6m25); let t_b7_7 = f32x4_mul(self.twiddle13im, x7m24); let t_b7_8 = f32x4_mul(self.twiddle6im, x8m23); let t_b7_9 = f32x4_mul(self.twiddle1im, x9m22); let t_b7_10 = f32x4_mul(self.twiddle8im, x10m21); let t_b7_11 = f32x4_mul(self.twiddle15im, x11m20); let t_b7_12 = f32x4_mul(self.twiddle9im, x12m19); let t_b7_13 = f32x4_mul(self.twiddle2im, x13m18); let t_b7_14 = f32x4_mul(self.twiddle5im, x14m17); let t_b7_15 = f32x4_mul(self.twiddle12im, x15m16); let t_b8_1 = f32x4_mul(self.twiddle8im, x1m30); let t_b8_2 = f32x4_mul(self.twiddle15im, x2m29); let t_b8_3 = f32x4_mul(self.twiddle7im, x3m28); let t_b8_4 = f32x4_mul(self.twiddle1im, x4m27); let t_b8_5 = f32x4_mul(self.twiddle9im, x5m26); let t_b8_6 = f32x4_mul(self.twiddle14im, x6m25); let t_b8_7 = f32x4_mul(self.twiddle6im, x7m24); let t_b8_8 = f32x4_mul(self.twiddle2im, x8m23); let t_b8_9 = f32x4_mul(self.twiddle10im, x9m22); let t_b8_10 = f32x4_mul(self.twiddle13im, x10m21); let t_b8_11 = f32x4_mul(self.twiddle5im, x11m20); let t_b8_12 = f32x4_mul(self.twiddle3im, x12m19); let t_b8_13 = f32x4_mul(self.twiddle11im, x13m18); let t_b8_14 = f32x4_mul(self.twiddle12im, x14m17); let t_b8_15 = f32x4_mul(self.twiddle4im, x15m16); let t_b9_1 = f32x4_mul(self.twiddle9im, x1m30); let t_b9_2 = f32x4_mul(self.twiddle13im, x2m29); let t_b9_3 = f32x4_mul(self.twiddle4im, x3m28); let t_b9_4 = f32x4_mul(self.twiddle5im, x4m27); let t_b9_5 = f32x4_mul(self.twiddle14im, x5m26); let t_b9_6 = f32x4_mul(self.twiddle8im, x6m25); let t_b9_7 = f32x4_mul(self.twiddle1im, x7m24); let t_b9_8 = f32x4_mul(self.twiddle10im, x8m23); let t_b9_9 = f32x4_mul(self.twiddle12im, x9m22); let t_b9_10 = f32x4_mul(self.twiddle3im, x10m21); let t_b9_11 = f32x4_mul(self.twiddle6im, x11m20); let t_b9_12 = f32x4_mul(self.twiddle15im, x12m19); let t_b9_13 = f32x4_mul(self.twiddle7im, x13m18); let t_b9_14 = f32x4_mul(self.twiddle2im, x14m17); let t_b9_15 = f32x4_mul(self.twiddle11im, x15m16); let t_b10_1 = f32x4_mul(self.twiddle10im, x1m30); let t_b10_2 = f32x4_mul(self.twiddle11im, x2m29); let t_b10_3 = f32x4_mul(self.twiddle1im, x3m28); let t_b10_4 = f32x4_mul(self.twiddle9im, x4m27); let t_b10_5 = f32x4_mul(self.twiddle12im, x5m26); let t_b10_6 = f32x4_mul(self.twiddle2im, x6m25); let t_b10_7 = f32x4_mul(self.twiddle8im, x7m24); let t_b10_8 = f32x4_mul(self.twiddle13im, x8m23); let t_b10_9 = f32x4_mul(self.twiddle3im, x9m22); let t_b10_10 = f32x4_mul(self.twiddle7im, x10m21); let t_b10_11 = f32x4_mul(self.twiddle14im, x11m20); let t_b10_12 = f32x4_mul(self.twiddle4im, x12m19); let t_b10_13 = f32x4_mul(self.twiddle6im, x13m18); let t_b10_14 = f32x4_mul(self.twiddle15im, x14m17); let t_b10_15 = f32x4_mul(self.twiddle5im, x15m16); let t_b11_1 = f32x4_mul(self.twiddle11im, x1m30); let t_b11_2 = f32x4_mul(self.twiddle9im, x2m29); let t_b11_3 = f32x4_mul(self.twiddle2im, x3m28); let t_b11_4 = f32x4_mul(self.twiddle13im, x4m27); let t_b11_5 = f32x4_mul(self.twiddle7im, x5m26); let t_b11_6 = f32x4_mul(self.twiddle4im, x6m25); let t_b11_7 = f32x4_mul(self.twiddle15im, x7m24); let t_b11_8 = f32x4_mul(self.twiddle5im, x8m23); let t_b11_9 = f32x4_mul(self.twiddle6im, x9m22); let t_b11_10 = f32x4_mul(self.twiddle14im, x10m21); let t_b11_11 = f32x4_mul(self.twiddle3im, x11m20); let t_b11_12 = f32x4_mul(self.twiddle8im, x12m19); let t_b11_13 = f32x4_mul(self.twiddle12im, x13m18); let t_b11_14 = f32x4_mul(self.twiddle1im, x14m17); let t_b11_15 = f32x4_mul(self.twiddle10im, x15m16); let t_b12_1 = f32x4_mul(self.twiddle12im, x1m30); let t_b12_2 = f32x4_mul(self.twiddle7im, x2m29); let t_b12_3 = f32x4_mul(self.twiddle5im, x3m28); let t_b12_4 = f32x4_mul(self.twiddle14im, x4m27); let t_b12_5 = f32x4_mul(self.twiddle2im, x5m26); let t_b12_6 = f32x4_mul(self.twiddle10im, x6m25); let t_b12_7 = f32x4_mul(self.twiddle9im, x7m24); let t_b12_8 = f32x4_mul(self.twiddle3im, x8m23); let t_b12_9 = f32x4_mul(self.twiddle15im, x9m22); let t_b12_10 = f32x4_mul(self.twiddle4im, x10m21); let t_b12_11 = f32x4_mul(self.twiddle8im, x11m20); let t_b12_12 = f32x4_mul(self.twiddle11im, x12m19); let t_b12_13 = f32x4_mul(self.twiddle1im, x13m18); let t_b12_14 = f32x4_mul(self.twiddle13im, x14m17); let t_b12_15 = f32x4_mul(self.twiddle6im, x15m16); let t_b13_1 = f32x4_mul(self.twiddle13im, x1m30); let t_b13_2 = f32x4_mul(self.twiddle5im, x2m29); let t_b13_3 = f32x4_mul(self.twiddle8im, x3m28); let t_b13_4 = f32x4_mul(self.twiddle10im, x4m27); let t_b13_5 = f32x4_mul(self.twiddle3im, x5m26); let t_b13_6 = f32x4_mul(self.twiddle15im, x6m25); let t_b13_7 = f32x4_mul(self.twiddle2im, x7m24); let t_b13_8 = f32x4_mul(self.twiddle11im, x8m23); let t_b13_9 = f32x4_mul(self.twiddle7im, x9m22); let t_b13_10 = f32x4_mul(self.twiddle6im, x10m21); let t_b13_11 = f32x4_mul(self.twiddle12im, x11m20); let t_b13_12 = f32x4_mul(self.twiddle1im, x12m19); let t_b13_13 = f32x4_mul(self.twiddle14im, x13m18); let t_b13_14 = f32x4_mul(self.twiddle4im, x14m17); let t_b13_15 = f32x4_mul(self.twiddle9im, x15m16); let t_b14_1 = f32x4_mul(self.twiddle14im, x1m30); let t_b14_2 = f32x4_mul(self.twiddle3im, x2m29); let t_b14_3 = f32x4_mul(self.twiddle11im, x3m28); let t_b14_4 = f32x4_mul(self.twiddle6im, x4m27); let t_b14_5 = f32x4_mul(self.twiddle8im, x5m26); let t_b14_6 = f32x4_mul(self.twiddle9im, x6m25); let t_b14_7 = f32x4_mul(self.twiddle5im, x7m24); let t_b14_8 = f32x4_mul(self.twiddle12im, x8m23); let t_b14_9 = f32x4_mul(self.twiddle2im, x9m22); let t_b14_10 = f32x4_mul(self.twiddle15im, x10m21); let t_b14_11 = f32x4_mul(self.twiddle1im, x11m20); let t_b14_12 = f32x4_mul(self.twiddle13im, x12m19); let t_b14_13 = f32x4_mul(self.twiddle4im, x13m18); let t_b14_14 = f32x4_mul(self.twiddle10im, x14m17); let t_b14_15 = f32x4_mul(self.twiddle7im, x15m16); let t_b15_1 = f32x4_mul(self.twiddle15im, x1m30); let t_b15_2 = f32x4_mul(self.twiddle1im, x2m29); let t_b15_3 = f32x4_mul(self.twiddle14im, x3m28); let t_b15_4 = f32x4_mul(self.twiddle2im, x4m27); let t_b15_5 = f32x4_mul(self.twiddle13im, x5m26); let t_b15_6 = f32x4_mul(self.twiddle3im, x6m25); let t_b15_7 = f32x4_mul(self.twiddle12im, x7m24); let t_b15_8 = f32x4_mul(self.twiddle4im, x8m23); let t_b15_9 = f32x4_mul(self.twiddle11im, x9m22); let t_b15_10 = f32x4_mul(self.twiddle5im, x10m21); let t_b15_11 = f32x4_mul(self.twiddle10im, x11m20); let t_b15_12 = f32x4_mul(self.twiddle6im, x12m19); let t_b15_13 = f32x4_mul(self.twiddle9im, x13m18); let t_b15_14 = f32x4_mul(self.twiddle7im, x14m17); let t_b15_15 = f32x4_mul(self.twiddle8im, x15m16); let x0 = values[0]; let t_a1 = calc_f32!( x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11 + t_a1_12 + t_a1_13 + t_a1_14 + t_a1_15 ); let t_a2 = calc_f32!( x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11 + t_a2_12 + t_a2_13 + t_a2_14 + t_a2_15 ); let t_a3 = calc_f32!( x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11 + t_a3_12 + t_a3_13 + t_a3_14 + t_a3_15 ); let t_a4 = calc_f32!( x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11 + t_a4_12 + t_a4_13 + t_a4_14 + t_a4_15 ); let t_a5 = calc_f32!( x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11 + t_a5_12 + t_a5_13 + t_a5_14 + t_a5_15 ); let t_a6 = calc_f32!( x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11 + t_a6_12 + t_a6_13 + t_a6_14 + t_a6_15 ); let t_a7 = calc_f32!( x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11 + t_a7_12 + t_a7_13 + t_a7_14 + t_a7_15 ); let t_a8 = calc_f32!( x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11 + t_a8_12 + t_a8_13 + t_a8_14 + t_a8_15 ); let t_a9 = calc_f32!( x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11 + t_a9_12 + t_a9_13 + t_a9_14 + t_a9_15 ); let t_a10 = calc_f32!( x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11 + t_a10_12 + t_a10_13 + t_a10_14 + t_a10_15 ); let t_a11 = calc_f32!( x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11 + t_a11_12 + t_a11_13 + t_a11_14 + t_a11_15 ); let t_a12 = calc_f32!( x0 + t_a12_1 + t_a12_2 + t_a12_3 + t_a12_4 + t_a12_5 + t_a12_6 + t_a12_7 + t_a12_8 + t_a12_9 + t_a12_10 + t_a12_11 + t_a12_12 + t_a12_13 + t_a12_14 + t_a12_15 ); let t_a13 = calc_f32!( x0 + t_a13_1 + t_a13_2 + t_a13_3 + t_a13_4 + t_a13_5 + t_a13_6 + t_a13_7 + t_a13_8 + t_a13_9 + t_a13_10 + t_a13_11 + t_a13_12 + t_a13_13 + t_a13_14 + t_a13_15 ); let t_a14 = calc_f32!( x0 + t_a14_1 + t_a14_2 + t_a14_3 + t_a14_4 + t_a14_5 + t_a14_6 + t_a14_7 + t_a14_8 + t_a14_9 + t_a14_10 + t_a14_11 + t_a14_12 + t_a14_13 + t_a14_14 + t_a14_15 ); let t_a15 = calc_f32!( x0 + t_a15_1 + t_a15_2 + t_a15_3 + t_a15_4 + t_a15_5 + t_a15_6 + t_a15_7 + t_a15_8 + t_a15_9 + t_a15_10 + t_a15_11 + t_a15_12 + t_a15_13 + t_a15_14 + t_a15_15 ); let t_b1 = calc_f32!( t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11 + t_b1_12 + t_b1_13 + t_b1_14 + t_b1_15 ); let t_b2 = calc_f32!( t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 + t_b2_6 + t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11 - t_b2_12 - t_b2_13 - t_b2_14 - t_b2_15 ); let t_b3 = calc_f32!( t_b3_1 + t_b3_2 + t_b3_3 + t_b3_4 + t_b3_5 - t_b3_6 - t_b3_7 - t_b3_8 - t_b3_9 - t_b3_10 + t_b3_11 + t_b3_12 + t_b3_13 + t_b3_14 + t_b3_15 ); let t_b4 = calc_f32!( t_b4_1 + t_b4_2 + t_b4_3 - t_b4_4 - t_b4_5 - t_b4_6 - t_b4_7 + t_b4_8 + t_b4_9 + t_b4_10 + t_b4_11 - t_b4_12 - t_b4_13 - t_b4_14 - t_b4_15 ); let t_b5 = calc_f32!( t_b5_1 + t_b5_2 + t_b5_3 - t_b5_4 - t_b5_5 - t_b5_6 + t_b5_7 + t_b5_8 + t_b5_9 - t_b5_10 - t_b5_11 - t_b5_12 + t_b5_13 + t_b5_14 + t_b5_15 ); let t_b6 = calc_f32!( t_b6_1 + t_b6_2 - t_b6_3 - t_b6_4 - t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9 - t_b6_10 + t_b6_11 + t_b6_12 - t_b6_13 - t_b6_14 - t_b6_15 ); let t_b7 = calc_f32!( t_b7_1 + t_b7_2 - t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9 + t_b7_10 + t_b7_11 - t_b7_12 - t_b7_13 + t_b7_14 + t_b7_15 ); let t_b8 = calc_f32!( t_b8_1 - t_b8_2 - t_b8_3 + t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 + t_b8_9 - t_b8_10 - t_b8_11 + t_b8_12 + t_b8_13 - t_b8_14 - t_b8_15 ); let t_b9 = calc_f32!( t_b9_1 - t_b9_2 - t_b9_3 + t_b9_4 + t_b9_5 - t_b9_6 + t_b9_7 + t_b9_8 - t_b9_9 - t_b9_10 + t_b9_11 + t_b9_12 - t_b9_13 + t_b9_14 + t_b9_15 ); let t_b10 = calc_f32!( t_b10_1 - t_b10_2 - t_b10_3 + t_b10_4 - t_b10_5 - t_b10_6 + t_b10_7 - t_b10_8 - t_b10_9 + t_b10_10 - t_b10_11 - t_b10_12 + t_b10_13 - t_b10_14 - t_b10_15 ); let t_b11 = calc_f32!( t_b11_1 - t_b11_2 + t_b11_3 + t_b11_4 - t_b11_5 + t_b11_6 + t_b11_7 - t_b11_8 + t_b11_9 - t_b11_10 - t_b11_11 + t_b11_12 - t_b11_13 - t_b11_14 + t_b11_15 ); let t_b12 = calc_f32!( t_b12_1 - t_b12_2 + t_b12_3 - t_b12_4 - t_b12_5 + t_b12_6 - t_b12_7 + t_b12_8 + t_b12_9 - t_b12_10 + t_b12_11 - t_b12_12 + t_b12_13 + t_b12_14 - t_b12_15 ); let t_b13 = calc_f32!( t_b13_1 - t_b13_2 + t_b13_3 - t_b13_4 + t_b13_5 - t_b13_6 - t_b13_7 + t_b13_8 - t_b13_9 + t_b13_10 - t_b13_11 + t_b13_12 + t_b13_13 - t_b13_14 + t_b13_15 ); let t_b14 = calc_f32!( t_b14_1 - t_b14_2 + t_b14_3 - t_b14_4 + t_b14_5 - t_b14_6 + t_b14_7 - t_b14_8 + t_b14_9 - t_b14_10 - t_b14_11 + t_b14_12 - t_b14_13 + t_b14_14 - t_b14_15 ); let t_b15 = calc_f32!( t_b15_1 - t_b15_2 + t_b15_3 - t_b15_4 + t_b15_5 - t_b15_6 + t_b15_7 - t_b15_8 + t_b15_9 - t_b15_10 + t_b15_11 - t_b15_12 + t_b15_13 - t_b15_14 + t_b15_15 ); let t_b1_rot = self.rotate.rotate_both(t_b1); let t_b2_rot = self.rotate.rotate_both(t_b2); let t_b3_rot = self.rotate.rotate_both(t_b3); let t_b4_rot = self.rotate.rotate_both(t_b4); let t_b5_rot = self.rotate.rotate_both(t_b5); let t_b6_rot = self.rotate.rotate_both(t_b6); let t_b7_rot = self.rotate.rotate_both(t_b7); let t_b8_rot = self.rotate.rotate_both(t_b8); let t_b9_rot = self.rotate.rotate_both(t_b9); let t_b10_rot = self.rotate.rotate_both(t_b10); let t_b11_rot = self.rotate.rotate_both(t_b11); let t_b12_rot = self.rotate.rotate_both(t_b12); let t_b13_rot = self.rotate.rotate_both(t_b13); let t_b14_rot = self.rotate.rotate_both(t_b14); let t_b15_rot = self.rotate.rotate_both(t_b15); let y0 = calc_f32!( x0 + x1p30 + x2p29 + x3p28 + x4p27 + x5p26 + x6p25 + x7p24 + x8p23 + x9p22 + x10p21 + x11p20 + x12p19 + x13p18 + x14p17 + x15p16 ); let [y1, y30] = parallel_fft2_interleaved_f32(t_a1, t_b1_rot); let [y2, y29] = parallel_fft2_interleaved_f32(t_a2, t_b2_rot); let [y3, y28] = parallel_fft2_interleaved_f32(t_a3, t_b3_rot); let [y4, y27] = parallel_fft2_interleaved_f32(t_a4, t_b4_rot); let [y5, y26] = parallel_fft2_interleaved_f32(t_a5, t_b5_rot); let [y6, y25] = parallel_fft2_interleaved_f32(t_a6, t_b6_rot); let [y7, y24] = parallel_fft2_interleaved_f32(t_a7, t_b7_rot); let [y8, y23] = parallel_fft2_interleaved_f32(t_a8, t_b8_rot); let [y9, y22] = parallel_fft2_interleaved_f32(t_a9, t_b9_rot); let [y10, y21] = parallel_fft2_interleaved_f32(t_a10, t_b10_rot); let [y11, y20] = parallel_fft2_interleaved_f32(t_a11, t_b11_rot); let [y12, y19] = parallel_fft2_interleaved_f32(t_a12, t_b12_rot); let [y13, y18] = parallel_fft2_interleaved_f32(t_a13, t_b13_rot); let [y14, y17] = parallel_fft2_interleaved_f32(t_a14, t_b14_rot); let [y15, y16] = parallel_fft2_interleaved_f32(t_a15, t_b15_rot); [ y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22, y23, y24, y25, y26, y27, y28, y29, y30, ] } } // _____ _ __ _ _ _ _ _ // |___ // | / /_ | || | | |__ (_) |_ // |_ \| | _____ | '_ \| || |_| '_ \| | __| // ___) | | |_____| | (_) |__ _| |_) | | |_ // |____/|_| \___/ |_| |_.__/|_|\__| // pub struct WasmSimdF64Butterfly31 { direction: FftDirection, _phantom: std::marker::PhantomData, rotate: Rotate90F64, twiddle1re: v128, twiddle1im: v128, twiddle2re: v128, twiddle2im: v128, twiddle3re: v128, twiddle3im: v128, twiddle4re: v128, twiddle4im: v128, twiddle5re: v128, twiddle5im: v128, twiddle6re: v128, twiddle6im: v128, twiddle7re: v128, twiddle7im: v128, twiddle8re: v128, twiddle8im: v128, twiddle9re: v128, twiddle9im: v128, twiddle10re: v128, twiddle10im: v128, twiddle11re: v128, twiddle11im: v128, twiddle12re: v128, twiddle12im: v128, twiddle13re: v128, twiddle13im: v128, twiddle14re: v128, twiddle14im: v128, twiddle15re: v128, twiddle15im: v128, } boilerplate_fft_wasm_simd_f64_butterfly!( WasmSimdF64Butterfly31, 31, |this: &WasmSimdF64Butterfly31<_>| this.direction ); boilerplate_fft_wasm_simd_common_butterfly!( WasmSimdF64Butterfly31, 31, |this: &WasmSimdF64Butterfly31<_>| this.direction ); impl WasmSimdF64Butterfly31 { #[inline(always)] pub fn new(direction: FftDirection) -> Self { assert_f64::(); let rotate = Rotate90F64::new(true); let tw1: Complex = twiddles::compute_twiddle(1, 31, direction); let tw2: Complex = twiddles::compute_twiddle(2, 31, direction); let tw3: Complex = twiddles::compute_twiddle(3, 31, direction); let tw4: Complex = twiddles::compute_twiddle(4, 31, direction); let tw5: Complex = twiddles::compute_twiddle(5, 31, direction); let tw6: Complex = twiddles::compute_twiddle(6, 31, direction); let tw7: Complex = twiddles::compute_twiddle(7, 31, direction); let tw8: Complex = twiddles::compute_twiddle(8, 31, direction); let tw9: Complex = twiddles::compute_twiddle(9, 31, direction); let tw10: Complex = twiddles::compute_twiddle(10, 31, direction); let tw11: Complex = twiddles::compute_twiddle(11, 31, direction); let tw12: Complex = twiddles::compute_twiddle(12, 31, direction); let tw13: Complex = twiddles::compute_twiddle(13, 31, direction); let tw14: Complex = twiddles::compute_twiddle(14, 31, direction); let tw15: Complex = twiddles::compute_twiddle(15, 31, direction); let twiddle1re = f64x2_splat(tw1.re); let twiddle1im = f64x2_splat(tw1.im); let twiddle2re = f64x2_splat(tw2.re); let twiddle2im = f64x2_splat(tw2.im); let twiddle3re = f64x2_splat(tw3.re); let twiddle3im = f64x2_splat(tw3.im); let twiddle4re = f64x2_splat(tw4.re); let twiddle4im = f64x2_splat(tw4.im); let twiddle5re = f64x2_splat(tw5.re); let twiddle5im = f64x2_splat(tw5.im); let twiddle6re = f64x2_splat(tw6.re); let twiddle6im = f64x2_splat(tw6.im); let twiddle7re = f64x2_splat(tw7.re); let twiddle7im = f64x2_splat(tw7.im); let twiddle8re = f64x2_splat(tw8.re); let twiddle8im = f64x2_splat(tw8.im); let twiddle9re = f64x2_splat(tw9.re); let twiddle9im = f64x2_splat(tw9.im); let twiddle10re = f64x2_splat(tw10.re); let twiddle10im = f64x2_splat(tw10.im); let twiddle11re = f64x2_splat(tw11.re); let twiddle11im = f64x2_splat(tw11.im); let twiddle12re = f64x2_splat(tw12.re); let twiddle12im = f64x2_splat(tw12.im); let twiddle13re = f64x2_splat(tw13.re); let twiddle13im = f64x2_splat(tw13.im); let twiddle14re = f64x2_splat(tw14.re); let twiddle14im = f64x2_splat(tw14.im); let twiddle15re = f64x2_splat(tw15.re); let twiddle15im = f64x2_splat(tw15.im); Self { direction, _phantom: std::marker::PhantomData, rotate, twiddle1re, twiddle1im, twiddle2re, twiddle2im, twiddle3re, twiddle3im, twiddle4re, twiddle4im, twiddle5re, twiddle5im, twiddle6re, twiddle6im, twiddle7re, twiddle7im, twiddle8re, twiddle8im, twiddle9re, twiddle9im, twiddle10re, twiddle10im, twiddle11re, twiddle11im, twiddle12re, twiddle12im, twiddle13re, twiddle13im, twiddle14re, twiddle14im, twiddle15re, twiddle15im, } } #[inline(always)] pub(crate) unsafe fn perform_fft_contiguous(&self, mut buffer: impl WasmSimdArrayMut) { let values = read_complex_to_array!(buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30}); let out = self.perform_fft_direct(values); write_complex_to_array!(out, buffer, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30}); } #[inline(always)] pub(crate) unsafe fn perform_fft_direct(&self, values: [v128; 31]) -> [v128; 31] { let [x1p30, x1m30] = solo_fft2_f64(values[1], values[30]); let [x2p29, x2m29] = solo_fft2_f64(values[2], values[29]); let [x3p28, x3m28] = solo_fft2_f64(values[3], values[28]); let [x4p27, x4m27] = solo_fft2_f64(values[4], values[27]); let [x5p26, x5m26] = solo_fft2_f64(values[5], values[26]); let [x6p25, x6m25] = solo_fft2_f64(values[6], values[25]); let [x7p24, x7m24] = solo_fft2_f64(values[7], values[24]); let [x8p23, x8m23] = solo_fft2_f64(values[8], values[23]); let [x9p22, x9m22] = solo_fft2_f64(values[9], values[22]); let [x10p21, x10m21] = solo_fft2_f64(values[10], values[21]); let [x11p20, x11m20] = solo_fft2_f64(values[11], values[20]); let [x12p19, x12m19] = solo_fft2_f64(values[12], values[19]); let [x13p18, x13m18] = solo_fft2_f64(values[13], values[18]); let [x14p17, x14m17] = solo_fft2_f64(values[14], values[17]); let [x15p16, x15m16] = solo_fft2_f64(values[15], values[16]); let t_a1_1 = f64x2_mul(self.twiddle1re, x1p30); let t_a1_2 = f64x2_mul(self.twiddle2re, x2p29); let t_a1_3 = f64x2_mul(self.twiddle3re, x3p28); let t_a1_4 = f64x2_mul(self.twiddle4re, x4p27); let t_a1_5 = f64x2_mul(self.twiddle5re, x5p26); let t_a1_6 = f64x2_mul(self.twiddle6re, x6p25); let t_a1_7 = f64x2_mul(self.twiddle7re, x7p24); let t_a1_8 = f64x2_mul(self.twiddle8re, x8p23); let t_a1_9 = f64x2_mul(self.twiddle9re, x9p22); let t_a1_10 = f64x2_mul(self.twiddle10re, x10p21); let t_a1_11 = f64x2_mul(self.twiddle11re, x11p20); let t_a1_12 = f64x2_mul(self.twiddle12re, x12p19); let t_a1_13 = f64x2_mul(self.twiddle13re, x13p18); let t_a1_14 = f64x2_mul(self.twiddle14re, x14p17); let t_a1_15 = f64x2_mul(self.twiddle15re, x15p16); let t_a2_1 = f64x2_mul(self.twiddle2re, x1p30); let t_a2_2 = f64x2_mul(self.twiddle4re, x2p29); let t_a2_3 = f64x2_mul(self.twiddle6re, x3p28); let t_a2_4 = f64x2_mul(self.twiddle8re, x4p27); let t_a2_5 = f64x2_mul(self.twiddle10re, x5p26); let t_a2_6 = f64x2_mul(self.twiddle12re, x6p25); let t_a2_7 = f64x2_mul(self.twiddle14re, x7p24); let t_a2_8 = f64x2_mul(self.twiddle15re, x8p23); let t_a2_9 = f64x2_mul(self.twiddle13re, x9p22); let t_a2_10 = f64x2_mul(self.twiddle11re, x10p21); let t_a2_11 = f64x2_mul(self.twiddle9re, x11p20); let t_a2_12 = f64x2_mul(self.twiddle7re, x12p19); let t_a2_13 = f64x2_mul(self.twiddle5re, x13p18); let t_a2_14 = f64x2_mul(self.twiddle3re, x14p17); let t_a2_15 = f64x2_mul(self.twiddle1re, x15p16); let t_a3_1 = f64x2_mul(self.twiddle3re, x1p30); let t_a3_2 = f64x2_mul(self.twiddle6re, x2p29); let t_a3_3 = f64x2_mul(self.twiddle9re, x3p28); let t_a3_4 = f64x2_mul(self.twiddle12re, x4p27); let t_a3_5 = f64x2_mul(self.twiddle15re, x5p26); let t_a3_6 = f64x2_mul(self.twiddle13re, x6p25); let t_a3_7 = f64x2_mul(self.twiddle10re, x7p24); let t_a3_8 = f64x2_mul(self.twiddle7re, x8p23); let t_a3_9 = f64x2_mul(self.twiddle4re, x9p22); let t_a3_10 = f64x2_mul(self.twiddle1re, x10p21); let t_a3_11 = f64x2_mul(self.twiddle2re, x11p20); let t_a3_12 = f64x2_mul(self.twiddle5re, x12p19); let t_a3_13 = f64x2_mul(self.twiddle8re, x13p18); let t_a3_14 = f64x2_mul(self.twiddle11re, x14p17); let t_a3_15 = f64x2_mul(self.twiddle14re, x15p16); let t_a4_1 = f64x2_mul(self.twiddle4re, x1p30); let t_a4_2 = f64x2_mul(self.twiddle8re, x2p29); let t_a4_3 = f64x2_mul(self.twiddle12re, x3p28); let t_a4_4 = f64x2_mul(self.twiddle15re, x4p27); let t_a4_5 = f64x2_mul(self.twiddle11re, x5p26); let t_a4_6 = f64x2_mul(self.twiddle7re, x6p25); let t_a4_7 = f64x2_mul(self.twiddle3re, x7p24); let t_a4_8 = f64x2_mul(self.twiddle1re, x8p23); let t_a4_9 = f64x2_mul(self.twiddle5re, x9p22); let t_a4_10 = f64x2_mul(self.twiddle9re, x10p21); let t_a4_11 = f64x2_mul(self.twiddle13re, x11p20); let t_a4_12 = f64x2_mul(self.twiddle14re, x12p19); let t_a4_13 = f64x2_mul(self.twiddle10re, x13p18); let t_a4_14 = f64x2_mul(self.twiddle6re, x14p17); let t_a4_15 = f64x2_mul(self.twiddle2re, x15p16); let t_a5_1 = f64x2_mul(self.twiddle5re, x1p30); let t_a5_2 = f64x2_mul(self.twiddle10re, x2p29); let t_a5_3 = f64x2_mul(self.twiddle15re, x3p28); let t_a5_4 = f64x2_mul(self.twiddle11re, x4p27); let t_a5_5 = f64x2_mul(self.twiddle6re, x5p26); let t_a5_6 = f64x2_mul(self.twiddle1re, x6p25); let t_a5_7 = f64x2_mul(self.twiddle4re, x7p24); let t_a5_8 = f64x2_mul(self.twiddle9re, x8p23); let t_a5_9 = f64x2_mul(self.twiddle14re, x9p22); let t_a5_10 = f64x2_mul(self.twiddle12re, x10p21); let t_a5_11 = f64x2_mul(self.twiddle7re, x11p20); let t_a5_12 = f64x2_mul(self.twiddle2re, x12p19); let t_a5_13 = f64x2_mul(self.twiddle3re, x13p18); let t_a5_14 = f64x2_mul(self.twiddle8re, x14p17); let t_a5_15 = f64x2_mul(self.twiddle13re, x15p16); let t_a6_1 = f64x2_mul(self.twiddle6re, x1p30); let t_a6_2 = f64x2_mul(self.twiddle12re, x2p29); let t_a6_3 = f64x2_mul(self.twiddle13re, x3p28); let t_a6_4 = f64x2_mul(self.twiddle7re, x4p27); let t_a6_5 = f64x2_mul(self.twiddle1re, x5p26); let t_a6_6 = f64x2_mul(self.twiddle5re, x6p25); let t_a6_7 = f64x2_mul(self.twiddle11re, x7p24); let t_a6_8 = f64x2_mul(self.twiddle14re, x8p23); let t_a6_9 = f64x2_mul(self.twiddle8re, x9p22); let t_a6_10 = f64x2_mul(self.twiddle2re, x10p21); let t_a6_11 = f64x2_mul(self.twiddle4re, x11p20); let t_a6_12 = f64x2_mul(self.twiddle10re, x12p19); let t_a6_13 = f64x2_mul(self.twiddle15re, x13p18); let t_a6_14 = f64x2_mul(self.twiddle9re, x14p17); let t_a6_15 = f64x2_mul(self.twiddle3re, x15p16); let t_a7_1 = f64x2_mul(self.twiddle7re, x1p30); let t_a7_2 = f64x2_mul(self.twiddle14re, x2p29); let t_a7_3 = f64x2_mul(self.twiddle10re, x3p28); let t_a7_4 = f64x2_mul(self.twiddle3re, x4p27); let t_a7_5 = f64x2_mul(self.twiddle4re, x5p26); let t_a7_6 = f64x2_mul(self.twiddle11re, x6p25); let t_a7_7 = f64x2_mul(self.twiddle13re, x7p24); let t_a7_8 = f64x2_mul(self.twiddle6re, x8p23); let t_a7_9 = f64x2_mul(self.twiddle1re, x9p22); let t_a7_10 = f64x2_mul(self.twiddle8re, x10p21); let t_a7_11 = f64x2_mul(self.twiddle15re, x11p20); let t_a7_12 = f64x2_mul(self.twiddle9re, x12p19); let t_a7_13 = f64x2_mul(self.twiddle2re, x13p18); let t_a7_14 = f64x2_mul(self.twiddle5re, x14p17); let t_a7_15 = f64x2_mul(self.twiddle12re, x15p16); let t_a8_1 = f64x2_mul(self.twiddle8re, x1p30); let t_a8_2 = f64x2_mul(self.twiddle15re, x2p29); let t_a8_3 = f64x2_mul(self.twiddle7re, x3p28); let t_a8_4 = f64x2_mul(self.twiddle1re, x4p27); let t_a8_5 = f64x2_mul(self.twiddle9re, x5p26); let t_a8_6 = f64x2_mul(self.twiddle14re, x6p25); let t_a8_7 = f64x2_mul(self.twiddle6re, x7p24); let t_a8_8 = f64x2_mul(self.twiddle2re, x8p23); let t_a8_9 = f64x2_mul(self.twiddle10re, x9p22); let t_a8_10 = f64x2_mul(self.twiddle13re, x10p21); let t_a8_11 = f64x2_mul(self.twiddle5re, x11p20); let t_a8_12 = f64x2_mul(self.twiddle3re, x12p19); let t_a8_13 = f64x2_mul(self.twiddle11re, x13p18); let t_a8_14 = f64x2_mul(self.twiddle12re, x14p17); let t_a8_15 = f64x2_mul(self.twiddle4re, x15p16); let t_a9_1 = f64x2_mul(self.twiddle9re, x1p30); let t_a9_2 = f64x2_mul(self.twiddle13re, x2p29); let t_a9_3 = f64x2_mul(self.twiddle4re, x3p28); let t_a9_4 = f64x2_mul(self.twiddle5re, x4p27); let t_a9_5 = f64x2_mul(self.twiddle14re, x5p26); let t_a9_6 = f64x2_mul(self.twiddle8re, x6p25); let t_a9_7 = f64x2_mul(self.twiddle1re, x7p24); let t_a9_8 = f64x2_mul(self.twiddle10re, x8p23); let t_a9_9 = f64x2_mul(self.twiddle12re, x9p22); let t_a9_10 = f64x2_mul(self.twiddle3re, x10p21); let t_a9_11 = f64x2_mul(self.twiddle6re, x11p20); let t_a9_12 = f64x2_mul(self.twiddle15re, x12p19); let t_a9_13 = f64x2_mul(self.twiddle7re, x13p18); let t_a9_14 = f64x2_mul(self.twiddle2re, x14p17); let t_a9_15 = f64x2_mul(self.twiddle11re, x15p16); let t_a10_1 = f64x2_mul(self.twiddle10re, x1p30); let t_a10_2 = f64x2_mul(self.twiddle11re, x2p29); let t_a10_3 = f64x2_mul(self.twiddle1re, x3p28); let t_a10_4 = f64x2_mul(self.twiddle9re, x4p27); let t_a10_5 = f64x2_mul(self.twiddle12re, x5p26); let t_a10_6 = f64x2_mul(self.twiddle2re, x6p25); let t_a10_7 = f64x2_mul(self.twiddle8re, x7p24); let t_a10_8 = f64x2_mul(self.twiddle13re, x8p23); let t_a10_9 = f64x2_mul(self.twiddle3re, x9p22); let t_a10_10 = f64x2_mul(self.twiddle7re, x10p21); let t_a10_11 = f64x2_mul(self.twiddle14re, x11p20); let t_a10_12 = f64x2_mul(self.twiddle4re, x12p19); let t_a10_13 = f64x2_mul(self.twiddle6re, x13p18); let t_a10_14 = f64x2_mul(self.twiddle15re, x14p17); let t_a10_15 = f64x2_mul(self.twiddle5re, x15p16); let t_a11_1 = f64x2_mul(self.twiddle11re, x1p30); let t_a11_2 = f64x2_mul(self.twiddle9re, x2p29); let t_a11_3 = f64x2_mul(self.twiddle2re, x3p28); let t_a11_4 = f64x2_mul(self.twiddle13re, x4p27); let t_a11_5 = f64x2_mul(self.twiddle7re, x5p26); let t_a11_6 = f64x2_mul(self.twiddle4re, x6p25); let t_a11_7 = f64x2_mul(self.twiddle15re, x7p24); let t_a11_8 = f64x2_mul(self.twiddle5re, x8p23); let t_a11_9 = f64x2_mul(self.twiddle6re, x9p22); let t_a11_10 = f64x2_mul(self.twiddle14re, x10p21); let t_a11_11 = f64x2_mul(self.twiddle3re, x11p20); let t_a11_12 = f64x2_mul(self.twiddle8re, x12p19); let t_a11_13 = f64x2_mul(self.twiddle12re, x13p18); let t_a11_14 = f64x2_mul(self.twiddle1re, x14p17); let t_a11_15 = f64x2_mul(self.twiddle10re, x15p16); let t_a12_1 = f64x2_mul(self.twiddle12re, x1p30); let t_a12_2 = f64x2_mul(self.twiddle7re, x2p29); let t_a12_3 = f64x2_mul(self.twiddle5re, x3p28); let t_a12_4 = f64x2_mul(self.twiddle14re, x4p27); let t_a12_5 = f64x2_mul(self.twiddle2re, x5p26); let t_a12_6 = f64x2_mul(self.twiddle10re, x6p25); let t_a12_7 = f64x2_mul(self.twiddle9re, x7p24); let t_a12_8 = f64x2_mul(self.twiddle3re, x8p23); let t_a12_9 = f64x2_mul(self.twiddle15re, x9p22); let t_a12_10 = f64x2_mul(self.twiddle4re, x10p21); let t_a12_11 = f64x2_mul(self.twiddle8re, x11p20); let t_a12_12 = f64x2_mul(self.twiddle11re, x12p19); let t_a12_13 = f64x2_mul(self.twiddle1re, x13p18); let t_a12_14 = f64x2_mul(self.twiddle13re, x14p17); let t_a12_15 = f64x2_mul(self.twiddle6re, x15p16); let t_a13_1 = f64x2_mul(self.twiddle13re, x1p30); let t_a13_2 = f64x2_mul(self.twiddle5re, x2p29); let t_a13_3 = f64x2_mul(self.twiddle8re, x3p28); let t_a13_4 = f64x2_mul(self.twiddle10re, x4p27); let t_a13_5 = f64x2_mul(self.twiddle3re, x5p26); let t_a13_6 = f64x2_mul(self.twiddle15re, x6p25); let t_a13_7 = f64x2_mul(self.twiddle2re, x7p24); let t_a13_8 = f64x2_mul(self.twiddle11re, x8p23); let t_a13_9 = f64x2_mul(self.twiddle7re, x9p22); let t_a13_10 = f64x2_mul(self.twiddle6re, x10p21); let t_a13_11 = f64x2_mul(self.twiddle12re, x11p20); let t_a13_12 = f64x2_mul(self.twiddle1re, x12p19); let t_a13_13 = f64x2_mul(self.twiddle14re, x13p18); let t_a13_14 = f64x2_mul(self.twiddle4re, x14p17); let t_a13_15 = f64x2_mul(self.twiddle9re, x15p16); let t_a14_1 = f64x2_mul(self.twiddle14re, x1p30); let t_a14_2 = f64x2_mul(self.twiddle3re, x2p29); let t_a14_3 = f64x2_mul(self.twiddle11re, x3p28); let t_a14_4 = f64x2_mul(self.twiddle6re, x4p27); let t_a14_5 = f64x2_mul(self.twiddle8re, x5p26); let t_a14_6 = f64x2_mul(self.twiddle9re, x6p25); let t_a14_7 = f64x2_mul(self.twiddle5re, x7p24); let t_a14_8 = f64x2_mul(self.twiddle12re, x8p23); let t_a14_9 = f64x2_mul(self.twiddle2re, x9p22); let t_a14_10 = f64x2_mul(self.twiddle15re, x10p21); let t_a14_11 = f64x2_mul(self.twiddle1re, x11p20); let t_a14_12 = f64x2_mul(self.twiddle13re, x12p19); let t_a14_13 = f64x2_mul(self.twiddle4re, x13p18); let t_a14_14 = f64x2_mul(self.twiddle10re, x14p17); let t_a14_15 = f64x2_mul(self.twiddle7re, x15p16); let t_a15_1 = f64x2_mul(self.twiddle15re, x1p30); let t_a15_2 = f64x2_mul(self.twiddle1re, x2p29); let t_a15_3 = f64x2_mul(self.twiddle14re, x3p28); let t_a15_4 = f64x2_mul(self.twiddle2re, x4p27); let t_a15_5 = f64x2_mul(self.twiddle13re, x5p26); let t_a15_6 = f64x2_mul(self.twiddle3re, x6p25); let t_a15_7 = f64x2_mul(self.twiddle12re, x7p24); let t_a15_8 = f64x2_mul(self.twiddle4re, x8p23); let t_a15_9 = f64x2_mul(self.twiddle11re, x9p22); let t_a15_10 = f64x2_mul(self.twiddle5re, x10p21); let t_a15_11 = f64x2_mul(self.twiddle10re, x11p20); let t_a15_12 = f64x2_mul(self.twiddle6re, x12p19); let t_a15_13 = f64x2_mul(self.twiddle9re, x13p18); let t_a15_14 = f64x2_mul(self.twiddle7re, x14p17); let t_a15_15 = f64x2_mul(self.twiddle8re, x15p16); let t_b1_1 = f64x2_mul(self.twiddle1im, x1m30); let t_b1_2 = f64x2_mul(self.twiddle2im, x2m29); let t_b1_3 = f64x2_mul(self.twiddle3im, x3m28); let t_b1_4 = f64x2_mul(self.twiddle4im, x4m27); let t_b1_5 = f64x2_mul(self.twiddle5im, x5m26); let t_b1_6 = f64x2_mul(self.twiddle6im, x6m25); let t_b1_7 = f64x2_mul(self.twiddle7im, x7m24); let t_b1_8 = f64x2_mul(self.twiddle8im, x8m23); let t_b1_9 = f64x2_mul(self.twiddle9im, x9m22); let t_b1_10 = f64x2_mul(self.twiddle10im, x10m21); let t_b1_11 = f64x2_mul(self.twiddle11im, x11m20); let t_b1_12 = f64x2_mul(self.twiddle12im, x12m19); let t_b1_13 = f64x2_mul(self.twiddle13im, x13m18); let t_b1_14 = f64x2_mul(self.twiddle14im, x14m17); let t_b1_15 = f64x2_mul(self.twiddle15im, x15m16); let t_b2_1 = f64x2_mul(self.twiddle2im, x1m30); let t_b2_2 = f64x2_mul(self.twiddle4im, x2m29); let t_b2_3 = f64x2_mul(self.twiddle6im, x3m28); let t_b2_4 = f64x2_mul(self.twiddle8im, x4m27); let t_b2_5 = f64x2_mul(self.twiddle10im, x5m26); let t_b2_6 = f64x2_mul(self.twiddle12im, x6m25); let t_b2_7 = f64x2_mul(self.twiddle14im, x7m24); let t_b2_8 = f64x2_mul(self.twiddle15im, x8m23); let t_b2_9 = f64x2_mul(self.twiddle13im, x9m22); let t_b2_10 = f64x2_mul(self.twiddle11im, x10m21); let t_b2_11 = f64x2_mul(self.twiddle9im, x11m20); let t_b2_12 = f64x2_mul(self.twiddle7im, x12m19); let t_b2_13 = f64x2_mul(self.twiddle5im, x13m18); let t_b2_14 = f64x2_mul(self.twiddle3im, x14m17); let t_b2_15 = f64x2_mul(self.twiddle1im, x15m16); let t_b3_1 = f64x2_mul(self.twiddle3im, x1m30); let t_b3_2 = f64x2_mul(self.twiddle6im, x2m29); let t_b3_3 = f64x2_mul(self.twiddle9im, x3m28); let t_b3_4 = f64x2_mul(self.twiddle12im, x4m27); let t_b3_5 = f64x2_mul(self.twiddle15im, x5m26); let t_b3_6 = f64x2_mul(self.twiddle13im, x6m25); let t_b3_7 = f64x2_mul(self.twiddle10im, x7m24); let t_b3_8 = f64x2_mul(self.twiddle7im, x8m23); let t_b3_9 = f64x2_mul(self.twiddle4im, x9m22); let t_b3_10 = f64x2_mul(self.twiddle1im, x10m21); let t_b3_11 = f64x2_mul(self.twiddle2im, x11m20); let t_b3_12 = f64x2_mul(self.twiddle5im, x12m19); let t_b3_13 = f64x2_mul(self.twiddle8im, x13m18); let t_b3_14 = f64x2_mul(self.twiddle11im, x14m17); let t_b3_15 = f64x2_mul(self.twiddle14im, x15m16); let t_b4_1 = f64x2_mul(self.twiddle4im, x1m30); let t_b4_2 = f64x2_mul(self.twiddle8im, x2m29); let t_b4_3 = f64x2_mul(self.twiddle12im, x3m28); let t_b4_4 = f64x2_mul(self.twiddle15im, x4m27); let t_b4_5 = f64x2_mul(self.twiddle11im, x5m26); let t_b4_6 = f64x2_mul(self.twiddle7im, x6m25); let t_b4_7 = f64x2_mul(self.twiddle3im, x7m24); let t_b4_8 = f64x2_mul(self.twiddle1im, x8m23); let t_b4_9 = f64x2_mul(self.twiddle5im, x9m22); let t_b4_10 = f64x2_mul(self.twiddle9im, x10m21); let t_b4_11 = f64x2_mul(self.twiddle13im, x11m20); let t_b4_12 = f64x2_mul(self.twiddle14im, x12m19); let t_b4_13 = f64x2_mul(self.twiddle10im, x13m18); let t_b4_14 = f64x2_mul(self.twiddle6im, x14m17); let t_b4_15 = f64x2_mul(self.twiddle2im, x15m16); let t_b5_1 = f64x2_mul(self.twiddle5im, x1m30); let t_b5_2 = f64x2_mul(self.twiddle10im, x2m29); let t_b5_3 = f64x2_mul(self.twiddle15im, x3m28); let t_b5_4 = f64x2_mul(self.twiddle11im, x4m27); let t_b5_5 = f64x2_mul(self.twiddle6im, x5m26); let t_b5_6 = f64x2_mul(self.twiddle1im, x6m25); let t_b5_7 = f64x2_mul(self.twiddle4im, x7m24); let t_b5_8 = f64x2_mul(self.twiddle9im, x8m23); let t_b5_9 = f64x2_mul(self.twiddle14im, x9m22); let t_b5_10 = f64x2_mul(self.twiddle12im, x10m21); let t_b5_11 = f64x2_mul(self.twiddle7im, x11m20); let t_b5_12 = f64x2_mul(self.twiddle2im, x12m19); let t_b5_13 = f64x2_mul(self.twiddle3im, x13m18); let t_b5_14 = f64x2_mul(self.twiddle8im, x14m17); let t_b5_15 = f64x2_mul(self.twiddle13im, x15m16); let t_b6_1 = f64x2_mul(self.twiddle6im, x1m30); let t_b6_2 = f64x2_mul(self.twiddle12im, x2m29); let t_b6_3 = f64x2_mul(self.twiddle13im, x3m28); let t_b6_4 = f64x2_mul(self.twiddle7im, x4m27); let t_b6_5 = f64x2_mul(self.twiddle1im, x5m26); let t_b6_6 = f64x2_mul(self.twiddle5im, x6m25); let t_b6_7 = f64x2_mul(self.twiddle11im, x7m24); let t_b6_8 = f64x2_mul(self.twiddle14im, x8m23); let t_b6_9 = f64x2_mul(self.twiddle8im, x9m22); let t_b6_10 = f64x2_mul(self.twiddle2im, x10m21); let t_b6_11 = f64x2_mul(self.twiddle4im, x11m20); let t_b6_12 = f64x2_mul(self.twiddle10im, x12m19); let t_b6_13 = f64x2_mul(self.twiddle15im, x13m18); let t_b6_14 = f64x2_mul(self.twiddle9im, x14m17); let t_b6_15 = f64x2_mul(self.twiddle3im, x15m16); let t_b7_1 = f64x2_mul(self.twiddle7im, x1m30); let t_b7_2 = f64x2_mul(self.twiddle14im, x2m29); let t_b7_3 = f64x2_mul(self.twiddle10im, x3m28); let t_b7_4 = f64x2_mul(self.twiddle3im, x4m27); let t_b7_5 = f64x2_mul(self.twiddle4im, x5m26); let t_b7_6 = f64x2_mul(self.twiddle11im, x6m25); let t_b7_7 = f64x2_mul(self.twiddle13im, x7m24); let t_b7_8 = f64x2_mul(self.twiddle6im, x8m23); let t_b7_9 = f64x2_mul(self.twiddle1im, x9m22); let t_b7_10 = f64x2_mul(self.twiddle8im, x10m21); let t_b7_11 = f64x2_mul(self.twiddle15im, x11m20); let t_b7_12 = f64x2_mul(self.twiddle9im, x12m19); let t_b7_13 = f64x2_mul(self.twiddle2im, x13m18); let t_b7_14 = f64x2_mul(self.twiddle5im, x14m17); let t_b7_15 = f64x2_mul(self.twiddle12im, x15m16); let t_b8_1 = f64x2_mul(self.twiddle8im, x1m30); let t_b8_2 = f64x2_mul(self.twiddle15im, x2m29); let t_b8_3 = f64x2_mul(self.twiddle7im, x3m28); let t_b8_4 = f64x2_mul(self.twiddle1im, x4m27); let t_b8_5 = f64x2_mul(self.twiddle9im, x5m26); let t_b8_6 = f64x2_mul(self.twiddle14im, x6m25); let t_b8_7 = f64x2_mul(self.twiddle6im, x7m24); let t_b8_8 = f64x2_mul(self.twiddle2im, x8m23); let t_b8_9 = f64x2_mul(self.twiddle10im, x9m22); let t_b8_10 = f64x2_mul(self.twiddle13im, x10m21); let t_b8_11 = f64x2_mul(self.twiddle5im, x11m20); let t_b8_12 = f64x2_mul(self.twiddle3im, x12m19); let t_b8_13 = f64x2_mul(self.twiddle11im, x13m18); let t_b8_14 = f64x2_mul(self.twiddle12im, x14m17); let t_b8_15 = f64x2_mul(self.twiddle4im, x15m16); let t_b9_1 = f64x2_mul(self.twiddle9im, x1m30); let t_b9_2 = f64x2_mul(self.twiddle13im, x2m29); let t_b9_3 = f64x2_mul(self.twiddle4im, x3m28); let t_b9_4 = f64x2_mul(self.twiddle5im, x4m27); let t_b9_5 = f64x2_mul(self.twiddle14im, x5m26); let t_b9_6 = f64x2_mul(self.twiddle8im, x6m25); let t_b9_7 = f64x2_mul(self.twiddle1im, x7m24); let t_b9_8 = f64x2_mul(self.twiddle10im, x8m23); let t_b9_9 = f64x2_mul(self.twiddle12im, x9m22); let t_b9_10 = f64x2_mul(self.twiddle3im, x10m21); let t_b9_11 = f64x2_mul(self.twiddle6im, x11m20); let t_b9_12 = f64x2_mul(self.twiddle15im, x12m19); let t_b9_13 = f64x2_mul(self.twiddle7im, x13m18); let t_b9_14 = f64x2_mul(self.twiddle2im, x14m17); let t_b9_15 = f64x2_mul(self.twiddle11im, x15m16); let t_b10_1 = f64x2_mul(self.twiddle10im, x1m30); let t_b10_2 = f64x2_mul(self.twiddle11im, x2m29); let t_b10_3 = f64x2_mul(self.twiddle1im, x3m28); let t_b10_4 = f64x2_mul(self.twiddle9im, x4m27); let t_b10_5 = f64x2_mul(self.twiddle12im, x5m26); let t_b10_6 = f64x2_mul(self.twiddle2im, x6m25); let t_b10_7 = f64x2_mul(self.twiddle8im, x7m24); let t_b10_8 = f64x2_mul(self.twiddle13im, x8m23); let t_b10_9 = f64x2_mul(self.twiddle3im, x9m22); let t_b10_10 = f64x2_mul(self.twiddle7im, x10m21); let t_b10_11 = f64x2_mul(self.twiddle14im, x11m20); let t_b10_12 = f64x2_mul(self.twiddle4im, x12m19); let t_b10_13 = f64x2_mul(self.twiddle6im, x13m18); let t_b10_14 = f64x2_mul(self.twiddle15im, x14m17); let t_b10_15 = f64x2_mul(self.twiddle5im, x15m16); let t_b11_1 = f64x2_mul(self.twiddle11im, x1m30); let t_b11_2 = f64x2_mul(self.twiddle9im, x2m29); let t_b11_3 = f64x2_mul(self.twiddle2im, x3m28); let t_b11_4 = f64x2_mul(self.twiddle13im, x4m27); let t_b11_5 = f64x2_mul(self.twiddle7im, x5m26); let t_b11_6 = f64x2_mul(self.twiddle4im, x6m25); let t_b11_7 = f64x2_mul(self.twiddle15im, x7m24); let t_b11_8 = f64x2_mul(self.twiddle5im, x8m23); let t_b11_9 = f64x2_mul(self.twiddle6im, x9m22); let t_b11_10 = f64x2_mul(self.twiddle14im, x10m21); let t_b11_11 = f64x2_mul(self.twiddle3im, x11m20); let t_b11_12 = f64x2_mul(self.twiddle8im, x12m19); let t_b11_13 = f64x2_mul(self.twiddle12im, x13m18); let t_b11_14 = f64x2_mul(self.twiddle1im, x14m17); let t_b11_15 = f64x2_mul(self.twiddle10im, x15m16); let t_b12_1 = f64x2_mul(self.twiddle12im, x1m30); let t_b12_2 = f64x2_mul(self.twiddle7im, x2m29); let t_b12_3 = f64x2_mul(self.twiddle5im, x3m28); let t_b12_4 = f64x2_mul(self.twiddle14im, x4m27); let t_b12_5 = f64x2_mul(self.twiddle2im, x5m26); let t_b12_6 = f64x2_mul(self.twiddle10im, x6m25); let t_b12_7 = f64x2_mul(self.twiddle9im, x7m24); let t_b12_8 = f64x2_mul(self.twiddle3im, x8m23); let t_b12_9 = f64x2_mul(self.twiddle15im, x9m22); let t_b12_10 = f64x2_mul(self.twiddle4im, x10m21); let t_b12_11 = f64x2_mul(self.twiddle8im, x11m20); let t_b12_12 = f64x2_mul(self.twiddle11im, x12m19); let t_b12_13 = f64x2_mul(self.twiddle1im, x13m18); let t_b12_14 = f64x2_mul(self.twiddle13im, x14m17); let t_b12_15 = f64x2_mul(self.twiddle6im, x15m16); let t_b13_1 = f64x2_mul(self.twiddle13im, x1m30); let t_b13_2 = f64x2_mul(self.twiddle5im, x2m29); let t_b13_3 = f64x2_mul(self.twiddle8im, x3m28); let t_b13_4 = f64x2_mul(self.twiddle10im, x4m27); let t_b13_5 = f64x2_mul(self.twiddle3im, x5m26); let t_b13_6 = f64x2_mul(self.twiddle15im, x6m25); let t_b13_7 = f64x2_mul(self.twiddle2im, x7m24); let t_b13_8 = f64x2_mul(self.twiddle11im, x8m23); let t_b13_9 = f64x2_mul(self.twiddle7im, x9m22); let t_b13_10 = f64x2_mul(self.twiddle6im, x10m21); let t_b13_11 = f64x2_mul(self.twiddle12im, x11m20); let t_b13_12 = f64x2_mul(self.twiddle1im, x12m19); let t_b13_13 = f64x2_mul(self.twiddle14im, x13m18); let t_b13_14 = f64x2_mul(self.twiddle4im, x14m17); let t_b13_15 = f64x2_mul(self.twiddle9im, x15m16); let t_b14_1 = f64x2_mul(self.twiddle14im, x1m30); let t_b14_2 = f64x2_mul(self.twiddle3im, x2m29); let t_b14_3 = f64x2_mul(self.twiddle11im, x3m28); let t_b14_4 = f64x2_mul(self.twiddle6im, x4m27); let t_b14_5 = f64x2_mul(self.twiddle8im, x5m26); let t_b14_6 = f64x2_mul(self.twiddle9im, x6m25); let t_b14_7 = f64x2_mul(self.twiddle5im, x7m24); let t_b14_8 = f64x2_mul(self.twiddle12im, x8m23); let t_b14_9 = f64x2_mul(self.twiddle2im, x9m22); let t_b14_10 = f64x2_mul(self.twiddle15im, x10m21); let t_b14_11 = f64x2_mul(self.twiddle1im, x11m20); let t_b14_12 = f64x2_mul(self.twiddle13im, x12m19); let t_b14_13 = f64x2_mul(self.twiddle4im, x13m18); let t_b14_14 = f64x2_mul(self.twiddle10im, x14m17); let t_b14_15 = f64x2_mul(self.twiddle7im, x15m16); let t_b15_1 = f64x2_mul(self.twiddle15im, x1m30); let t_b15_2 = f64x2_mul(self.twiddle1im, x2m29); let t_b15_3 = f64x2_mul(self.twiddle14im, x3m28); let t_b15_4 = f64x2_mul(self.twiddle2im, x4m27); let t_b15_5 = f64x2_mul(self.twiddle13im, x5m26); let t_b15_6 = f64x2_mul(self.twiddle3im, x6m25); let t_b15_7 = f64x2_mul(self.twiddle12im, x7m24); let t_b15_8 = f64x2_mul(self.twiddle4im, x8m23); let t_b15_9 = f64x2_mul(self.twiddle11im, x9m22); let t_b15_10 = f64x2_mul(self.twiddle5im, x10m21); let t_b15_11 = f64x2_mul(self.twiddle10im, x11m20); let t_b15_12 = f64x2_mul(self.twiddle6im, x12m19); let t_b15_13 = f64x2_mul(self.twiddle9im, x13m18); let t_b15_14 = f64x2_mul(self.twiddle7im, x14m17); let t_b15_15 = f64x2_mul(self.twiddle8im, x15m16); let x0 = values[0]; let t_a1 = calc_f64!( x0 + t_a1_1 + t_a1_2 + t_a1_3 + t_a1_4 + t_a1_5 + t_a1_6 + t_a1_7 + t_a1_8 + t_a1_9 + t_a1_10 + t_a1_11 + t_a1_12 + t_a1_13 + t_a1_14 + t_a1_15 ); let t_a2 = calc_f64!( x0 + t_a2_1 + t_a2_2 + t_a2_3 + t_a2_4 + t_a2_5 + t_a2_6 + t_a2_7 + t_a2_8 + t_a2_9 + t_a2_10 + t_a2_11 + t_a2_12 + t_a2_13 + t_a2_14 + t_a2_15 ); let t_a3 = calc_f64!( x0 + t_a3_1 + t_a3_2 + t_a3_3 + t_a3_4 + t_a3_5 + t_a3_6 + t_a3_7 + t_a3_8 + t_a3_9 + t_a3_10 + t_a3_11 + t_a3_12 + t_a3_13 + t_a3_14 + t_a3_15 ); let t_a4 = calc_f64!( x0 + t_a4_1 + t_a4_2 + t_a4_3 + t_a4_4 + t_a4_5 + t_a4_6 + t_a4_7 + t_a4_8 + t_a4_9 + t_a4_10 + t_a4_11 + t_a4_12 + t_a4_13 + t_a4_14 + t_a4_15 ); let t_a5 = calc_f64!( x0 + t_a5_1 + t_a5_2 + t_a5_3 + t_a5_4 + t_a5_5 + t_a5_6 + t_a5_7 + t_a5_8 + t_a5_9 + t_a5_10 + t_a5_11 + t_a5_12 + t_a5_13 + t_a5_14 + t_a5_15 ); let t_a6 = calc_f64!( x0 + t_a6_1 + t_a6_2 + t_a6_3 + t_a6_4 + t_a6_5 + t_a6_6 + t_a6_7 + t_a6_8 + t_a6_9 + t_a6_10 + t_a6_11 + t_a6_12 + t_a6_13 + t_a6_14 + t_a6_15 ); let t_a7 = calc_f64!( x0 + t_a7_1 + t_a7_2 + t_a7_3 + t_a7_4 + t_a7_5 + t_a7_6 + t_a7_7 + t_a7_8 + t_a7_9 + t_a7_10 + t_a7_11 + t_a7_12 + t_a7_13 + t_a7_14 + t_a7_15 ); let t_a8 = calc_f64!( x0 + t_a8_1 + t_a8_2 + t_a8_3 + t_a8_4 + t_a8_5 + t_a8_6 + t_a8_7 + t_a8_8 + t_a8_9 + t_a8_10 + t_a8_11 + t_a8_12 + t_a8_13 + t_a8_14 + t_a8_15 ); let t_a9 = calc_f64!( x0 + t_a9_1 + t_a9_2 + t_a9_3 + t_a9_4 + t_a9_5 + t_a9_6 + t_a9_7 + t_a9_8 + t_a9_9 + t_a9_10 + t_a9_11 + t_a9_12 + t_a9_13 + t_a9_14 + t_a9_15 ); let t_a10 = calc_f64!( x0 + t_a10_1 + t_a10_2 + t_a10_3 + t_a10_4 + t_a10_5 + t_a10_6 + t_a10_7 + t_a10_8 + t_a10_9 + t_a10_10 + t_a10_11 + t_a10_12 + t_a10_13 + t_a10_14 + t_a10_15 ); let t_a11 = calc_f64!( x0 + t_a11_1 + t_a11_2 + t_a11_3 + t_a11_4 + t_a11_5 + t_a11_6 + t_a11_7 + t_a11_8 + t_a11_9 + t_a11_10 + t_a11_11 + t_a11_12 + t_a11_13 + t_a11_14 + t_a11_15 ); let t_a12 = calc_f64!( x0 + t_a12_1 + t_a12_2 + t_a12_3 + t_a12_4 + t_a12_5 + t_a12_6 + t_a12_7 + t_a12_8 + t_a12_9 + t_a12_10 + t_a12_11 + t_a12_12 + t_a12_13 + t_a12_14 + t_a12_15 ); let t_a13 = calc_f64!( x0 + t_a13_1 + t_a13_2 + t_a13_3 + t_a13_4 + t_a13_5 + t_a13_6 + t_a13_7 + t_a13_8 + t_a13_9 + t_a13_10 + t_a13_11 + t_a13_12 + t_a13_13 + t_a13_14 + t_a13_15 ); let t_a14 = calc_f64!( x0 + t_a14_1 + t_a14_2 + t_a14_3 + t_a14_4 + t_a14_5 + t_a14_6 + t_a14_7 + t_a14_8 + t_a14_9 + t_a14_10 + t_a14_11 + t_a14_12 + t_a14_13 + t_a14_14 + t_a14_15 ); let t_a15 = calc_f64!( x0 + t_a15_1 + t_a15_2 + t_a15_3 + t_a15_4 + t_a15_5 + t_a15_6 + t_a15_7 + t_a15_8 + t_a15_9 + t_a15_10 + t_a15_11 + t_a15_12 + t_a15_13 + t_a15_14 + t_a15_15 ); let t_b1 = calc_f64!( t_b1_1 + t_b1_2 + t_b1_3 + t_b1_4 + t_b1_5 + t_b1_6 + t_b1_7 + t_b1_8 + t_b1_9 + t_b1_10 + t_b1_11 + t_b1_12 + t_b1_13 + t_b1_14 + t_b1_15 ); let t_b2 = calc_f64!( t_b2_1 + t_b2_2 + t_b2_3 + t_b2_4 + t_b2_5 + t_b2_6 + t_b2_7 - t_b2_8 - t_b2_9 - t_b2_10 - t_b2_11 - t_b2_12 - t_b2_13 - t_b2_14 - t_b2_15 ); let t_b3 = calc_f64!( t_b3_1 + t_b3_2 + t_b3_3 + t_b3_4 + t_b3_5 - t_b3_6 - t_b3_7 - t_b3_8 - t_b3_9 - t_b3_10 + t_b3_11 + t_b3_12 + t_b3_13 + t_b3_14 + t_b3_15 ); let t_b4 = calc_f64!( t_b4_1 + t_b4_2 + t_b4_3 - t_b4_4 - t_b4_5 - t_b4_6 - t_b4_7 + t_b4_8 + t_b4_9 + t_b4_10 + t_b4_11 - t_b4_12 - t_b4_13 - t_b4_14 - t_b4_15 ); let t_b5 = calc_f64!( t_b5_1 + t_b5_2 + t_b5_3 - t_b5_4 - t_b5_5 - t_b5_6 + t_b5_7 + t_b5_8 + t_b5_9 - t_b5_10 - t_b5_11 - t_b5_12 + t_b5_13 + t_b5_14 + t_b5_15 ); let t_b6 = calc_f64!( t_b6_1 + t_b6_2 - t_b6_3 - t_b6_4 - t_b6_5 + t_b6_6 + t_b6_7 - t_b6_8 - t_b6_9 - t_b6_10 + t_b6_11 + t_b6_12 - t_b6_13 - t_b6_14 - t_b6_15 ); let t_b7 = calc_f64!( t_b7_1 + t_b7_2 - t_b7_3 - t_b7_4 + t_b7_5 + t_b7_6 - t_b7_7 - t_b7_8 + t_b7_9 + t_b7_10 + t_b7_11 - t_b7_12 - t_b7_13 + t_b7_14 + t_b7_15 ); let t_b8 = calc_f64!( t_b8_1 - t_b8_2 - t_b8_3 + t_b8_4 + t_b8_5 - t_b8_6 - t_b8_7 + t_b8_8 + t_b8_9 - t_b8_10 - t_b8_11 + t_b8_12 + t_b8_13 - t_b8_14 - t_b8_15 ); let t_b9 = calc_f64!( t_b9_1 - t_b9_2 - t_b9_3 + t_b9_4 + t_b9_5 - t_b9_6 + t_b9_7 + t_b9_8 - t_b9_9 - t_b9_10 + t_b9_11 + t_b9_12 - t_b9_13 + t_b9_14 + t_b9_15 ); let t_b10 = calc_f64!( t_b10_1 - t_b10_2 - t_b10_3 + t_b10_4 - t_b10_5 - t_b10_6 + t_b10_7 - t_b10_8 - t_b10_9 + t_b10_10 - t_b10_11 - t_b10_12 + t_b10_13 - t_b10_14 - t_b10_15 ); let t_b11 = calc_f64!( t_b11_1 - t_b11_2 + t_b11_3 + t_b11_4 - t_b11_5 + t_b11_6 + t_b11_7 - t_b11_8 + t_b11_9 - t_b11_10 - t_b11_11 + t_b11_12 - t_b11_13 - t_b11_14 + t_b11_15 ); let t_b12 = calc_f64!( t_b12_1 - t_b12_2 + t_b12_3 - t_b12_4 - t_b12_5 + t_b12_6 - t_b12_7 + t_b12_8 + t_b12_9 - t_b12_10 + t_b12_11 - t_b12_12 + t_b12_13 + t_b12_14 - t_b12_15 ); let t_b13 = calc_f64!( t_b13_1 - t_b13_2 + t_b13_3 - t_b13_4 + t_b13_5 - t_b13_6 - t_b13_7 + t_b13_8 - t_b13_9 + t_b13_10 - t_b13_11 + t_b13_12 + t_b13_13 - t_b13_14 + t_b13_15 ); let t_b14 = calc_f64!( t_b14_1 - t_b14_2 + t_b14_3 - t_b14_4 + t_b14_5 - t_b14_6 + t_b14_7 - t_b14_8 + t_b14_9 - t_b14_10 - t_b14_11 + t_b14_12 - t_b14_13 + t_b14_14 - t_b14_15 ); let t_b15 = calc_f64!( t_b15_1 - t_b15_2 + t_b15_3 - t_b15_4 + t_b15_5 - t_b15_6 + t_b15_7 - t_b15_8 + t_b15_9 - t_b15_10 + t_b15_11 - t_b15_12 + t_b15_13 - t_b15_14 + t_b15_15 ); let t_b1_rot = self.rotate.rotate(t_b1); let t_b2_rot = self.rotate.rotate(t_b2); let t_b3_rot = self.rotate.rotate(t_b3); let t_b4_rot = self.rotate.rotate(t_b4); let t_b5_rot = self.rotate.rotate(t_b5); let t_b6_rot = self.rotate.rotate(t_b6); let t_b7_rot = self.rotate.rotate(t_b7); let t_b8_rot = self.rotate.rotate(t_b8); let t_b9_rot = self.rotate.rotate(t_b9); let t_b10_rot = self.rotate.rotate(t_b10); let t_b11_rot = self.rotate.rotate(t_b11); let t_b12_rot = self.rotate.rotate(t_b12); let t_b13_rot = self.rotate.rotate(t_b13); let t_b14_rot = self.rotate.rotate(t_b14); let t_b15_rot = self.rotate.rotate(t_b15); let y0 = calc_f64!( x0 + x1p30 + x2p29 + x3p28 + x4p27 + x5p26 + x6p25 + x7p24 + x8p23 + x9p22 + x10p21 + x11p20 + x12p19 + x13p18 + x14p17 + x15p16 ); let [y1, y30] = solo_fft2_f64(t_a1, t_b1_rot); let [y2, y29] = solo_fft2_f64(t_a2, t_b2_rot); let [y3, y28] = solo_fft2_f64(t_a3, t_b3_rot); let [y4, y27] = solo_fft2_f64(t_a4, t_b4_rot); let [y5, y26] = solo_fft2_f64(t_a5, t_b5_rot); let [y6, y25] = solo_fft2_f64(t_a6, t_b6_rot); let [y7, y24] = solo_fft2_f64(t_a7, t_b7_rot); let [y8, y23] = solo_fft2_f64(t_a8, t_b8_rot); let [y9, y22] = solo_fft2_f64(t_a9, t_b9_rot); let [y10, y21] = solo_fft2_f64(t_a10, t_b10_rot); let [y11, y20] = solo_fft2_f64(t_a11, t_b11_rot); let [y12, y19] = solo_fft2_f64(t_a12, t_b12_rot); let [y13, y18] = solo_fft2_f64(t_a13, t_b13_rot); let [y14, y17] = solo_fft2_f64(t_a14, t_b14_rot); let [y15, y16] = solo_fft2_f64(t_a15, t_b15_rot); [ y0, y1, y2, y3, y4, y5, y6, y7, y8, y9, y10, y11, y12, y13, y14, y15, y16, y17, y18, y19, y20, y21, y22, y23, y24, y25, y26, y27, y28, y29, y30, ] } } // _____ _____ ____ _____ ____ // |_ _| ____/ ___|_ _/ ___| // | | | _| \___ \ | | \___ \ // | | | |___ ___) || | ___) | // |_| |_____|____/ |_| |____/ // #[cfg(test)] mod unit_tests { use super::*; use crate::test_utils::check_fft_algorithm; use wasm_bindgen_test::wasm_bindgen_test; //the tests for all butterflies will be identical except for the identifiers used and size //so it's ideal for a macro macro_rules! test_butterfly_32_func { ($test_name:ident, $struct_name:ident, $size:expr) => { #[wasm_bindgen_test] fn $test_name() { let butterfly = $struct_name::new(FftDirection::Forward); check_fft_algorithm::(&butterfly, $size, FftDirection::Forward); let butterfly_direction = $struct_name::new(FftDirection::Inverse); check_fft_algorithm::(&butterfly_direction, $size, FftDirection::Inverse); } }; } test_butterfly_32_func!(test_wasm_simdf32_butterfly7, WasmSimdF32Butterfly7, 7); test_butterfly_32_func!(test_wasm_simdf32_butterfly11, WasmSimdF32Butterfly11, 11); test_butterfly_32_func!(test_wasm_simdf32_butterfly13, WasmSimdF32Butterfly13, 13); test_butterfly_32_func!(test_wasm_simdf32_butterfly17, WasmSimdF32Butterfly17, 17); test_butterfly_32_func!(test_wasm_simdf32_butterfly19, WasmSimdF32Butterfly19, 19); test_butterfly_32_func!(test_wasm_simdf32_butterfly23, WasmSimdF32Butterfly23, 23); test_butterfly_32_func!(test_wasm_simdf32_butterfly29, WasmSimdF32Butterfly29, 29); test_butterfly_32_func!(test_wasm_simdf32_butterfly31, WasmSimdF32Butterfly31, 31); //the tests for all butterflies will be identical except for the identifiers used and size //so it's ideal for a macro macro_rules! test_butterfly_64_func { ($test_name:ident, $struct_name:ident, $size:expr) => { #[test] fn $test_name() { let butterfly = $struct_name::new(FftDirection::Forward); check_fft_algorithm::(&butterfly, $size, FftDirection::Forward); let butterfly_direction = $struct_name::new(FftDirection::Inverse); check_fft_algorithm::(&butterfly_direction, $size, FftDirection::Inverse); } }; } test_butterfly_64_func!(test_wasm_simdf64_butterfly7, WasmSimdF64Butterfly7, 7); test_butterfly_64_func!(test_wasm_simdf64_butterfly11, WasmSimdF64Butterfly11, 11); test_butterfly_64_func!(test_wasm_simdf64_butterfly13, WasmSimdF64Butterfly13, 13); test_butterfly_64_func!(test_wasm_simdf64_butterfly17, WasmSimdF64Butterfly17, 17); test_butterfly_64_func!(test_wasm_simdf64_butterfly19, WasmSimdF64Butterfly19, 19); test_butterfly_64_func!(test_wasm_simdf64_butterfly23, WasmSimdF64Butterfly23, 23); test_butterfly_64_func!(test_wasm_simdf64_butterfly29, WasmSimdF64Butterfly29, 29); test_butterfly_64_func!(test_wasm_simdf64_butterfly31, WasmSimdF64Butterfly31, 31); } rustfft-6.2.0/src/wasm_simd/wasm_simd_radix4.rs000064400000000000000000000433510072674642500177420ustar 00000000000000use num_complex::Complex; use core::arch::wasm32::*; use crate::algorithm::bitreversed_transpose; use crate::array_utils; use crate::array_utils::workaround_transmute_mut; use crate::common::{fft_error_inplace, fft_error_outofplace}; use crate::wasm_simd::wasm_simd_butterflies::{ WasmSimdF32Butterfly1, WasmSimdF32Butterfly16, WasmSimdF32Butterfly2, WasmSimdF32Butterfly32, WasmSimdF32Butterfly4, WasmSimdF32Butterfly8, }; use crate::wasm_simd::wasm_simd_butterflies::{ WasmSimdF64Butterfly1, WasmSimdF64Butterfly16, WasmSimdF64Butterfly2, WasmSimdF64Butterfly32, WasmSimdF64Butterfly4, WasmSimdF64Butterfly8, }; use crate::{common::FftNum, twiddles, FftDirection}; use crate::{Direction, Fft, Length}; use super::wasm_simd_common::{assert_f32, assert_f64}; use super::wasm_simd_utils::*; use super::wasm_simd_vector::{WasmSimdArray, WasmSimdArrayMut}; /// FFT algorithm optimized for power-of-two sizes, WasmSimd accelerated version. /// This is designed to be used via a Planner, and not created directly. const USE_BUTTERFLY32_FROM: usize = 262144; // Use length 32 butterfly starting from this length enum WasmSimd32Butterfly { Len1(WasmSimdF32Butterfly1), Len2(WasmSimdF32Butterfly2), Len4(WasmSimdF32Butterfly4), Len8(WasmSimdF32Butterfly8), Len16(WasmSimdF32Butterfly16), Len32(WasmSimdF32Butterfly32), } enum WasmSimd64Butterfly { Len1(WasmSimdF64Butterfly1), Len2(WasmSimdF64Butterfly2), Len4(WasmSimdF64Butterfly4), Len8(WasmSimdF64Butterfly8), Len16(WasmSimdF64Butterfly16), Len32(WasmSimdF64Butterfly32), } pub struct WasmSimd32Radix4 { _phantom: std::marker::PhantomData, twiddles: Box<[v128]>, base_fft: WasmSimd32Butterfly, base_len: usize, len: usize, direction: FftDirection, bf4: WasmSimdF32Butterfly4, } impl WasmSimd32Radix4 { /// Preallocates necessary arrays and precomputes necessary data to efficiently compute the power-of-two FFT pub fn new(len: usize, direction: FftDirection) -> Self { assert!( len.is_power_of_two(), "Radix4 algorithm requires a power-of-two input size. Got {}", len ); assert_f32::(); // figure out which base length we're going to use let num_bits = len.trailing_zeros(); let (base_len, base_fft) = match num_bits { 0 => ( len, WasmSimd32Butterfly::Len1(WasmSimdF32Butterfly1::new(direction)), ), 1 => ( len, WasmSimd32Butterfly::Len2(WasmSimdF32Butterfly2::new(direction)), ), 2 => ( len, WasmSimd32Butterfly::Len4(WasmSimdF32Butterfly4::new(direction)), ), 3 => ( len, WasmSimd32Butterfly::Len8(WasmSimdF32Butterfly8::new(direction)), ), _ => { if num_bits % 2 == 1 { if len < USE_BUTTERFLY32_FROM { ( 8, WasmSimd32Butterfly::Len8(WasmSimdF32Butterfly8::new(direction)), ) } else { ( 32, WasmSimd32Butterfly::Len32(WasmSimdF32Butterfly32::new(direction)), ) } } else { ( 16, WasmSimd32Butterfly::Len16(WasmSimdF32Butterfly16::new(direction)), ) } } }; // precompute the twiddle factors this algorithm will use. // we're doing the same precomputation of twiddle factors as the mixed radix algorithm where width=4 and height=len/4 // but mixed radix only does one step and then calls itself recusrively, and this algorithm does every layer all the way down // so we're going to pack all the "layers" of twiddle factors into a single array, starting with the bottom layer and going up let mut twiddle_stride = len / (base_len * 4); let mut twiddle_factors = Vec::with_capacity(len * 2); while twiddle_stride > 0 { let num_rows = len / (twiddle_stride * 4); for i in 0..num_rows / 2 { for k in 1..4 { let twiddle_a = twiddles::compute_twiddle::( 2 * i * k * twiddle_stride, len, direction, ); let twiddle_b = twiddles::compute_twiddle::( (2 * i + 1) * k * twiddle_stride, len, direction, ); let twiddles_packed = unsafe { [twiddle_a, twiddle_b].as_slice().load_complex(0) }; twiddle_factors.push(twiddles_packed); } } twiddle_stride >>= 2; } Self { twiddles: twiddle_factors.into_boxed_slice(), base_fft, base_len, len, direction, _phantom: std::marker::PhantomData, bf4: WasmSimdF32Butterfly4::::new(direction), } } #[target_feature(enable = "simd128")] unsafe fn perform_fft_out_of_place( &self, signal: &[Complex], spectrum: &mut [Complex], _scratch: &mut [Complex], ) { // copy the data into the spectrum vector if self.len() == self.base_len { spectrum.copy_from_slice(signal); } else { bitreversed_transpose(self.base_len, signal, spectrum); } // Base-level FFTs match &self.base_fft { WasmSimd32Butterfly::Len1(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), WasmSimd32Butterfly::Len2(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), WasmSimd32Butterfly::Len4(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), WasmSimd32Butterfly::Len8(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), WasmSimd32Butterfly::Len16(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), WasmSimd32Butterfly::Len32(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), }; // cross-FFTs let mut current_size = self.base_len * 4; let mut layer_twiddles: &[v128] = &self.twiddles; while current_size <= signal.len() { let num_rows = signal.len() / current_size; for i in 0..num_rows { butterfly_4_32( &mut spectrum[i * current_size..], layer_twiddles, current_size / 4, &self.bf4, ) } //skip past all the twiddle factors used in this layer let twiddle_offset = (current_size * 3) / 8; layer_twiddles = &layer_twiddles[twiddle_offset..]; current_size *= 4; } } } boilerplate_fft_wasm_simd_oop!(WasmSimd32Radix4, |this: &WasmSimd32Radix4<_>| this.len); #[target_feature(enable = "simd128")] unsafe fn butterfly_4_32( data: &mut [Complex], twiddles: &[v128], num_ffts: usize, bf4: &WasmSimdF32Butterfly4, ) { let mut idx = 0usize; let mut buffer: &mut [Complex] = workaround_transmute_mut(data); for tw in twiddles.chunks_exact(6).take(num_ffts / 4) { let scratch0 = buffer.load_complex(idx); let scratch0b = buffer.load_complex(idx + 2); let mut scratch1 = buffer.load_complex(idx + 1 * num_ffts); let mut scratch1b = buffer.load_complex(idx + 2 + 1 * num_ffts); let mut scratch2 = buffer.load_complex(idx + 2 * num_ffts); let mut scratch2b = buffer.load_complex(idx + 2 + 2 * num_ffts); let mut scratch3 = buffer.load_complex(idx + 3 * num_ffts); let mut scratch3b = buffer.load_complex(idx + 2 + 3 * num_ffts); scratch1 = mul_complex_f32(scratch1, tw[0]); scratch2 = mul_complex_f32(scratch2, tw[1]); scratch3 = mul_complex_f32(scratch3, tw[2]); scratch1b = mul_complex_f32(scratch1b, tw[3]); scratch2b = mul_complex_f32(scratch2b, tw[4]); scratch3b = mul_complex_f32(scratch3b, tw[5]); let scratch = bf4.perform_parallel_fft_direct(scratch0, scratch1, scratch2, scratch3); let scratchb = bf4.perform_parallel_fft_direct(scratch0b, scratch1b, scratch2b, scratch3b); buffer.store_complex(scratch[0], idx); buffer.store_complex(scratchb[0], idx + 2); buffer.store_complex(scratch[1], idx + 1 * num_ffts); buffer.store_complex(scratchb[1], idx + 2 + 1 * num_ffts); buffer.store_complex(scratch[2], idx + 2 * num_ffts); buffer.store_complex(scratchb[2], idx + 2 + 2 * num_ffts); buffer.store_complex(scratch[3], idx + 3 * num_ffts); buffer.store_complex(scratchb[3], idx + 2 + 3 * num_ffts); idx += 4; } } pub struct WasmSimd64Radix4 { _phantom: std::marker::PhantomData, twiddles: Box<[v128]>, base_fft: WasmSimd64Butterfly, base_len: usize, len: usize, direction: FftDirection, bf4: WasmSimdF64Butterfly4, } impl WasmSimd64Radix4 { /// Preallocates necessary arrays and precomputes necessary data to efficiently compute the power-of-two FFT pub fn new(len: usize, direction: FftDirection) -> Self { assert!( len.is_power_of_two(), "Radix4 algorithm requires a power-of-two input size. Got {}", len ); assert_f64::(); // figure out which base length we're going to use let num_bits = len.trailing_zeros(); let (base_len, base_fft) = match num_bits { 0 => ( len, WasmSimd64Butterfly::Len1(WasmSimdF64Butterfly1::new(direction)), ), 1 => ( len, WasmSimd64Butterfly::Len2(WasmSimdF64Butterfly2::new(direction)), ), 2 => ( len, WasmSimd64Butterfly::Len4(WasmSimdF64Butterfly4::new(direction)), ), 3 => ( len, WasmSimd64Butterfly::Len8(WasmSimdF64Butterfly8::new(direction)), ), _ => { if num_bits % 2 == 1 { if len < USE_BUTTERFLY32_FROM { ( 8, WasmSimd64Butterfly::Len8(WasmSimdF64Butterfly8::new(direction)), ) } else { ( 32, WasmSimd64Butterfly::Len32(WasmSimdF64Butterfly32::new(direction)), ) } } else { ( 16, WasmSimd64Butterfly::Len16(WasmSimdF64Butterfly16::new(direction)), ) } } }; // precompute the twiddle factors this algorithm will use. // we're doing the same precomputation of twiddle factors as the mixed radix algorithm where width=4 and height=len/4 // but mixed radix only does one step and then calls itself recusrively, and this algorithm does every layer all the way down // so we're going to pack all the "layers" of twiddle factors into a single array, starting with the bottom layer and going up let mut twiddle_stride = len / (base_len * 4); let mut twiddle_factors = Vec::with_capacity(len * 2); while twiddle_stride > 0 { let num_rows = len / (twiddle_stride * 4); for i in 0..num_rows { for k in 1..4 { let twiddle = twiddles::compute_twiddle::(i * k * twiddle_stride, len, direction); let twiddle_packed = unsafe { [twiddle].as_slice().load_complex(0) }; twiddle_factors.push(twiddle_packed); } } twiddle_stride >>= 2; } Self { twiddles: twiddle_factors.into_boxed_slice(), base_fft, base_len, len, direction, _phantom: std::marker::PhantomData, bf4: WasmSimdF64Butterfly4::::new(direction), } } #[target_feature(enable = "simd128")] unsafe fn perform_fft_out_of_place( &self, signal: &[Complex], spectrum: &mut [Complex], _scratch: &mut [Complex], ) { // copy the data into the spectrum vector if self.len() == self.base_len { spectrum.copy_from_slice(signal); } else { bitreversed_transpose(self.base_len, signal, spectrum); } // Base-level FFTs match &self.base_fft { WasmSimd64Butterfly::Len1(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), WasmSimd64Butterfly::Len2(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), WasmSimd64Butterfly::Len4(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), WasmSimd64Butterfly::Len8(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), WasmSimd64Butterfly::Len16(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), WasmSimd64Butterfly::Len32(bf) => bf.perform_fft_butterfly_multi(spectrum).unwrap(), } // cross-FFTs let mut current_size = self.base_len * 4; let mut layer_twiddles: &[v128] = &self.twiddles; while current_size <= signal.len() { let num_rows = signal.len() / current_size; for i in 0..num_rows { butterfly_4_64( &mut spectrum[i * current_size..], layer_twiddles, current_size / 4, &self.bf4, ) } //skip past all the twiddle factors used in this layer let twiddle_offset = (current_size * 3) / 4; layer_twiddles = &layer_twiddles[twiddle_offset..]; current_size *= 4; } } } boilerplate_fft_wasm_simd_oop!(WasmSimd64Radix4, |this: &WasmSimd64Radix4<_>| this.len); #[target_feature(enable = "simd128")] unsafe fn butterfly_4_64( data: &mut [Complex], twiddles: &[v128], num_ffts: usize, bf4: &WasmSimdF64Butterfly4, ) { let mut idx = 0usize; let mut buffer: &mut [Complex] = workaround_transmute_mut(data); for tw in twiddles.chunks_exact(6).take(num_ffts / 2) { let scratch0 = buffer.load_complex(idx); let scratch0b = buffer.load_complex(idx + 1); let mut scratch1 = buffer.load_complex(idx + 1 * num_ffts); let mut scratch1b = buffer.load_complex(idx + 1 + 1 * num_ffts); let mut scratch2 = buffer.load_complex(idx + 2 * num_ffts); let mut scratch2b = buffer.load_complex(idx + 1 + 2 * num_ffts); let mut scratch3 = buffer.load_complex(idx + 3 * num_ffts); let mut scratch3b = buffer.load_complex(idx + 1 + 3 * num_ffts); scratch1 = mul_complex_f64(scratch1, tw[0]); scratch2 = mul_complex_f64(scratch2, tw[1]); scratch3 = mul_complex_f64(scratch3, tw[2]); scratch1b = mul_complex_f64(scratch1b, tw[3]); scratch2b = mul_complex_f64(scratch2b, tw[4]); scratch3b = mul_complex_f64(scratch3b, tw[5]); let scratch = bf4.perform_fft_direct(scratch0, scratch1, scratch2, scratch3); let scratchb = bf4.perform_fft_direct(scratch0b, scratch1b, scratch2b, scratch3b); buffer.store_complex(scratch[0], idx); buffer.store_complex(scratchb[0], idx + 1); buffer.store_complex(scratch[1], idx + 1 * num_ffts); buffer.store_complex(scratchb[1], idx + 1 + 1 * num_ffts); buffer.store_complex(scratch[2], idx + 2 * num_ffts); buffer.store_complex(scratchb[2], idx + 1 + 2 * num_ffts); buffer.store_complex(scratch[3], idx + 3 * num_ffts); buffer.store_complex(scratchb[3], idx + 1 + 3 * num_ffts); idx += 2; } } #[cfg(test)] mod unit_tests { use super::*; use crate::test_utils::check_fft_algorithm; use wasm_bindgen_test::wasm_bindgen_test; #[wasm_bindgen_test] fn test_wasm_simd_radix4_64() { for pow in 4..12 { let len = 1 << pow; test_wasm_simd_radix4_64_with_length(len, FftDirection::Forward); test_wasm_simd_radix4_64_with_length(len, FftDirection::Inverse); } } fn test_wasm_simd_radix4_64_with_length(len: usize, direction: FftDirection) { let fft = WasmSimd64Radix4::new(len, direction); check_fft_algorithm::(&fft, len, direction); } #[wasm_bindgen_test] fn test_wasm_simd_radix4_32() { for pow in 0..12 { let len = 1 << pow; test_wasm_simd_radix4_32_with_length(len, FftDirection::Forward); test_wasm_simd_radix4_32_with_length(len, FftDirection::Inverse); } } fn test_wasm_simd_radix4_32_with_length(len: usize, direction: FftDirection) { let fft = WasmSimd32Radix4::new(len, direction); check_fft_algorithm::(&fft, len, direction); } } rustfft-6.2.0/src/wasm_simd/wasm_simd_utils.rs000064400000000000000000000252500072674642500177050ustar 00000000000000use core::arch::wasm32::*; // __ __ _ _ _________ _ _ _ // | \/ | __ _| |_| |__ |___ /___ \| |__ (_) |_ // | |\/| |/ _` | __| '_ \ _____ |_ \ __) | '_ \| | __| // | | | | (_| | |_| | | | |_____| ___) / __/| |_) | | |_ // |_| |_|\__,_|\__|_| |_| |____/_____|_.__/|_|\__| // /// Utility functions to rotate complex numbers by 90 degrees pub struct Rotate90F32 { sign_hi: v128, sign_both: v128, } impl Rotate90F32 { pub fn new(positive: bool) -> Self { let sign_hi = if positive { f32x4(0.0, 0.0, -0.0, 0.0) } else { f32x4(0.0, 0.0, 0.0, -0.0) }; let sign_both = if positive { f32x4(-0.0, 0.0, -0.0, 0.0) } else { f32x4(0.0, -0.0, 0.0, -0.0) }; Self { sign_hi, sign_both } } #[inline(always)] pub fn rotate_hi(&self, values: v128) -> v128 { v128_xor(u32x4_shuffle::<0, 1, 3, 2>(values, values), self.sign_hi) } #[inline(always)] pub unsafe fn rotate_both(&self, values: v128) -> v128 { v128_xor(u32x4_shuffle::<1, 0, 3, 2>(values, values), self.sign_both) } } /// Pack low (1st) complex /// left: l1.re, l1.im, l2.re, l2.im /// right: r1.re, r1.im, r2.re, r2.im /// --> l1.re, l1.im, r1.re, r1.im #[inline(always)] pub fn extract_lo_lo_f32(left: v128, right: v128) -> v128 { u32x4_shuffle::<0, 1, 4, 5>(left, right) } /// Pack high (2nd) complex /// left: l1.re, l1.im, l2.re, l2.im /// right: r1.re, r1.im, r2.re, r2.im /// --> l2.re, l2.im, r2.re, r2.im #[inline(always)] pub fn extract_hi_hi_f32(left: v128, right: v128) -> v128 { u32x4_shuffle::<2, 3, 6, 7>(left, right) } /// Pack low (1st) and high (2nd) complex /// left: l1.re, l1.im, l2.re, l2.im /// right: r1.re, r1.im, r2.re, r2.im /// --> l1.re, l1.im, r2.re, r2.im #[inline(always)] pub fn extract_lo_hi_f32(left: v128, right: v128) -> v128 { u32x4_shuffle::<0, 1, 6, 7>(left, right) } /// Pack high (2nd) and low (1st) complex /// left: r1.re, r1.im, r2.re, r2.im /// right: l1.re, l1.im, l2.re, l2.im /// --> r2.re, r2.im, l1.re, l1.im #[inline(always)] pub fn extract_hi_lo_f32(left: v128, right: v128) -> v128 { u32x4_shuffle::<2, 3, 4, 5>(left, right) } /// Reverse complex /// values: a.re, a.im, b.re, b.im /// --> b.re, b.im, a.re, a.im #[inline(always)] pub fn reverse_complex_elements_f32(values: v128) -> v128 { u64x2_shuffle::<1, 0>(values, values) } /// Reverse complex and then negate hi complex /// values: a.re, a.im, b.re, b.im /// --> b.re, b.im, -a.re, -a.im #[inline(always)] pub unsafe fn reverse_complex_and_negate_hi_f32(values: v128) -> v128 { v128_xor( u32x4_shuffle::<2, 3, 0, 1>(values, values), f32x4(0.0, 0.0, -0.0, -0.0), ) } // Invert sign of high (2nd) complex // values: a.re, a.im, b.re, b.im // --> a.re, a.im, -b.re, -b.im //#[inline(always)] //pub unsafe fn negate_hi_f32(values: float32x4_t) -> float32x4_t { // vcombine_f32(vget_low_f32(values), vneg_f32(vget_high_f32(values))) //} /// Duplicate low (1st) complex /// values: a.re, a.im, b.re, b.im /// --> a.re, a.im, a.re, a.im #[inline(always)] pub unsafe fn duplicate_lo_f32(values: v128) -> v128 { u64x2_shuffle::<0, 0>(values, values) } /// Duplicate high (2nd) complex /// values: a.re, a.im, b.re, b.im /// --> b.re, b.im, b.re, b.im #[inline(always)] pub unsafe fn duplicate_hi_f32(values: v128) -> v128 { u64x2_shuffle::<1, 1>(values, values) } /// transpose a 2x2 complex matrix given as [x0, x1], [x2, x3] /// result is [x0, x2], [x1, x3] #[inline(always)] pub unsafe fn transpose_complex_2x2_f32(left: v128, right: v128) -> [v128; 2] { let temp02 = extract_lo_lo_f32(left, right); let temp13 = extract_hi_hi_f32(left, right); [temp02, temp13] } /// Complex multiplication. /// Each input contains two complex values, which are multiplied in parallel. #[inline(always)] pub unsafe fn mul_complex_f32(left: v128, right: v128) -> v128 { let temp1 = u32x4_shuffle::<0, 4, 2, 6>(right, right); let temp2 = u32x4_shuffle::<1, 5, 3, 7>(right, f32x4_neg(right)); let temp3 = f32x4_mul(temp2, left); let temp4 = u32x4_shuffle::<1, 0, 3, 2>(temp3, temp3); let temp5 = f32x4_mul(temp1, left); f32x4_add(temp4, temp5) } // __ __ _ _ __ _ _ _ _ _ // | \/ | __ _| |_| |__ / /_ | || | | |__ (_) |_ // | |\/| |/ _` | __| '_ \ _____ | '_ \| || |_| '_ \| | __| // | | | | (_| | |_| | | | |_____| | (_) |__ _| |_) | | |_ // |_| |_|\__,_|\__|_| |_| \___/ |_| |_.__/|_|\__| // /// Utility functions to rotate complex pointers by 90 degrees pub(crate) struct Rotate90F64 { sign: v128, } impl Rotate90F64 { pub fn new(positive: bool) -> Self { let sign = if positive { f64x2(-0.0, 0.0) } else { f64x2(0.0, -0.0) }; Self { sign } } #[inline(always)] pub unsafe fn rotate(&self, values: v128) -> v128 { v128_xor(u64x2_shuffle::<1, 0>(values, values), self.sign) } } #[inline(always)] pub unsafe fn mul_complex_f64(left: v128, right: v128) -> v128 { const NEGATE_LEFT: v128 = f64x2(-0.0, 0.0); let temp = v128_xor(u64x2_shuffle::<1, 0>(left, left), NEGATE_LEFT); let sum = f64x2_mul(left, u64x2_shuffle::<0, 0>(right, right)); f64x2_add(sum, f64x2_mul(temp, u64x2_shuffle::<1, 1>(right, right))) } #[cfg(test)] mod unit_tests { use super::*; use num_complex::Complex; use wasm_bindgen_test::wasm_bindgen_test; #[wasm_bindgen_test] fn test_positive_rotation_f32() { unsafe { let rotate = Rotate90F32::new(true); let input = f32x4(1.0, 2.0, 69.0, 420.0); let actual_hi = rotate.rotate_hi(input); let expected_hi = f32x4(1.0, 2.0, -420.0, 69.0); assert_eq!( std::mem::transmute::; 2]>(actual_hi), std::mem::transmute::; 2]>(expected_hi) ); let actual = rotate.rotate_both(input); let expected = f32x4(-2.0, 1.0, -420.0, 69.0); assert_eq!( std::mem::transmute::; 2]>(actual), std::mem::transmute::; 2]>(expected) ); } } #[wasm_bindgen_test] fn test_negative_rotation_f32() { unsafe { let rotate = Rotate90F32::new(false); let input = f32x4(1.0, 2.0, 69.0, 420.0); let actual_hi = rotate.rotate_hi(input); let expected_hi = f32x4(1.0, 2.0, 420.0, -69.0); assert_eq!( std::mem::transmute::; 2]>(actual_hi), std::mem::transmute::; 2]>(expected_hi) ); let actual = rotate.rotate_both(input); let expected = f32x4(2.0, -1.0, 420.0, -69.0); assert_eq!( std::mem::transmute::; 2]>(actual), std::mem::transmute::; 2]>(expected) ); } } #[wasm_bindgen_test] fn test_negative_rotation_f64() { unsafe { let rotate = Rotate90F64::new(false); let input = f64x2(69.0, 420.0); let actual = rotate.rotate(input); let expected = f64x2(420.0, -69.0); assert_eq!( std::mem::transmute::>(actual), std::mem::transmute::>(expected) ); } } #[wasm_bindgen_test] fn test_positive_rotation_f64() { unsafe { let rotate = Rotate90F64::new(true); let input = f64x2(69.0, 420.0); let actual = rotate.rotate(input); let expected = f64x2(-420.0, 69.0); assert_eq!( std::mem::transmute::>(actual), std::mem::transmute::>(expected) ); } } #[wasm_bindgen_test] fn test_reverse_complex_number_f32() { let input = f32x4(1.0, 5.0, 9.0, 13.0); let actual = reverse_complex_elements_f32(input); let expected = f32x4(9.0, 13.0, 1.0, 5.0); unsafe { assert_eq!( std::mem::transmute::; 2]>(actual), std::mem::transmute::; 2]>(expected) ); } } #[wasm_bindgen_test] fn test_mul_complex_f64() { unsafe { // let right = vld1q_f64([1.0, 2.0].as_ptr()); let right = f64x2(1.0, 2.0); // let left = vld1q_f64([5.0, 7.0].as_ptr()); let left = f64x2(5.0, 7.0); let res = mul_complex_f64(left, right); // let expected = vld1q_f64([1.0 * 5.0 - 2.0 * 7.0, 1.0 * 7.0 + 2.0 * 5.0].as_ptr()); let expected = f64x2(1.0 * 5.0 - 2.0 * 7.0, 1.0 * 7.0 + 2.0 * 5.0); assert_eq!( std::mem::transmute::>(res), std::mem::transmute::>(expected) ); } } #[wasm_bindgen_test] fn test_mul_complex_f32() { unsafe { let val1 = Complex::::new(1.0, 2.5); let val2 = Complex::::new(3.2, 4.75); let val3 = Complex::::new(5.75, 6.25); let val4 = Complex::::new(7.4, 8.5); let nbr2 = v128_load([val3, val4].as_ptr() as *const v128); let nbr1 = v128_load([val1, val2].as_ptr() as *const v128); let res = mul_complex_f32(nbr1, nbr2); let res = std::mem::transmute::; 2]>(res); let expected = [val1 * val3, val2 * val4]; assert_eq!(res, expected); } } #[wasm_bindgen_test] fn test_pack() { unsafe { let nbr2 = f32x4(5.0, 6.0, 7.0, 8.0); let nbr1 = f32x4(1.0, 2.0, 3.0, 4.0); let first = extract_lo_lo_f32(nbr1, nbr2); let second = extract_hi_hi_f32(nbr1, nbr2); let first = std::mem::transmute::; 2]>(first); let second = std::mem::transmute::; 2]>(second); let first_expected = [Complex::new(1.0, 2.0), Complex::new(5.0, 6.0)]; let second_expected = [Complex::new(3.0, 4.0), Complex::new(7.0, 8.0)]; assert_eq!(first, first_expected); assert_eq!(second, second_expected); } } } rustfft-6.2.0/src/wasm_simd/wasm_simd_vector.rs000064400000000000000000000331750072674642500200540ustar 00000000000000use core::arch::wasm32::*; use num_complex::Complex; use std::ops::{Deref, DerefMut}; use crate::array_utils::DoubleBuf; /// Read these indexes from an WasmSimdArray and build an array of simd vectors. /// Takes a name of a vector to read from, and a list of indexes to read. /// This statement: /// ``` /// let values = read_complex_to_array!(input, {0, 1, 2, 3}); /// ``` /// is equivalent to: /// ``` /// let values = [ /// input.load_complex(0), /// input.load_complex(1), /// input.load_complex(2), /// input.load_complex(3), /// ]; /// ``` macro_rules! read_complex_to_array { ($input:ident, { $($idx:literal),* }) => { [ $( $input.load_complex($idx), )* ] } } /// Read these indexes from an WasmSimdArray and build an array or partially filled simd vectors. /// Takes a name of a vector to read from, and a list of indexes to read. /// This statement: /// ``` /// let values = read_partial1_complex_to_array!(input, {0, 1, 2, 3}); /// ``` /// is equivalent to: /// ``` /// let values = [ /// input.load1_complex(0), /// input.load1_complex(1), /// input.load1_complex(2), /// input.load1_complex(3), /// ]; /// ``` macro_rules! read_partial1_complex_to_array { ($input:ident, { $($idx:literal),* }) => { [ $( $input.load1_complex($idx), )* ] } } /// Write these indexes of an array of simd vectors to the same indexes of an WasmSimdArray. /// Takes a name of a vector to read from, one to write to, and a list of indexes. /// This statement: /// ``` /// let values = write_complex_to_array!(input, output, {0, 1, 2, 3}); /// ``` /// is equivalent to: /// ``` /// let values = [ /// output.store_complex(input[0], 0), /// output.store_complex(input[1], 1), /// output.store_complex(input[2], 2), /// output.store_complex(input[3], 3), /// ]; /// ``` macro_rules! write_complex_to_array { ($input:ident, $output:ident, { $($idx:literal),* }) => { $( $output.store_complex($input[$idx], $idx); )* } } /// Write the low half of these indexes of an array of simd vectors to the same indexes of an WasmSimdArray. /// Takes a name of a vector to read from, one to write to, and a list of indexes. /// This statement: /// ``` /// let values = write_partial_lo_complex_to_array!(input, output, {0, 1, 2, 3}); /// ``` /// is equivalent to: /// ``` /// let values = [ /// output.store_partial_lo_complex(input[0], 0), /// output.store_partial_lo_complex(input[1], 1), /// output.store_partial_lo_complex(input[2], 2), /// output.store_partial_lo_complex(input[3], 3), /// ]; /// ``` macro_rules! write_partial_lo_complex_to_array { ($input:ident, $output:ident, { $($idx:literal),* }) => { $( $output.store_partial_lo_complex($input[$idx], $idx); )* } } /// Write these indexes of an array of simd vectors to the same indexes, multiplied by a stride, of an WasmSimdArray. /// Takes a name of a vector to read from, one to write to, an integer stride, and a list of indexes. /// This statement: /// ``` /// let values = write_complex_to_array_separate!(input, output, {0, 1, 2, 3}); /// ``` /// is equivalent to: /// ``` /// let values = [ /// output.store_complex(input[0], 0), /// output.store_complex(input[1], 2), /// output.store_complex(input[2], 4), /// output.store_complex(input[3], 6), /// ]; /// ``` macro_rules! write_complex_to_array_strided { ($input:ident, $output:ident, $stride:literal, { $($idx:literal),* }) => { $( $output.store_complex($input[$idx], $idx*$stride); )* } } pub trait WasmSimdNum { type VectorType; const COMPLEX_PER_VECTOR: usize; } impl WasmSimdNum for f32 { type VectorType = v128; const COMPLEX_PER_VECTOR: usize = 2; } impl WasmSimdNum for f64 { type VectorType = v128; const COMPLEX_PER_VECTOR: usize = 1; } /// A trait to handle reading from an array of complex floats into WASM SIMD vectors. /// WASM works with 128-bit vectors, meaning a vector can hold two complex f32, /// or a single complex f64. pub trait WasmSimdArray: Deref { /// Load complex numbers from the array to fill a WASM SIMD vector. unsafe fn load_complex(&self, index: usize) -> T::VectorType; /// Load a single complex number from the array into a WASM SIMD vector, setting the unused elements to zero. unsafe fn load_partial1_complex(&self, index: usize) -> T::VectorType; /// Load a single complex number from the array, and copy it to all elements of a WASM SIMD vector. unsafe fn load1_complex(&self, index: usize) -> T::VectorType; } impl WasmSimdArray for &[Complex] { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); v128_load(self.as_ptr().add(index) as *const v128) } #[inline(always)] unsafe fn load_partial1_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + 1); v128_load64_lane::<0>(f32x4_splat(0.0), self.as_ptr().add(index) as *const u64) } #[inline(always)] unsafe fn load1_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + 1); v128_load64_splat(self.as_ptr().add(index) as *const u64) } } impl WasmSimdArray for &mut [Complex] { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); v128_load(self.as_ptr().add(index) as *const v128) } #[inline(always)] unsafe fn load_partial1_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + 1); v128_load64_lane::<0>(f32x4_splat(0.0), self.as_ptr().add(index) as *const u64) } #[inline(always)] unsafe fn load1_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + 1); v128_load64_splat(self.as_ptr().add(index) as *const u64) } } impl WasmSimdArray for &[Complex] { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); v128_load(self.as_ptr().add(index) as *const v128) } #[inline(always)] unsafe fn load_partial1_complex(&self, _index: usize) -> ::VectorType { unimplemented!("Impossible to do a partial load of complex f64's"); } #[inline(always)] unsafe fn load1_complex(&self, _index: usize) -> ::VectorType { unimplemented!("Impossible to do a partial load of complex f64's"); } } impl WasmSimdArray for &mut [Complex] { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> ::VectorType { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); v128_load(self.as_ptr().add(index) as *const v128) } #[inline(always)] unsafe fn load_partial1_complex(&self, _index: usize) -> ::VectorType { unimplemented!("Impossible to do a partial load of complex f64's"); } #[inline(always)] unsafe fn load1_complex(&self, _index: usize) -> ::VectorType { unimplemented!("Impossible to do a partial load of complex f64's"); } } impl<'a, T: WasmSimdNum> WasmSimdArray for DoubleBuf<'a, T> where &'a [Complex]: WasmSimdArray, { #[inline(always)] unsafe fn load_complex(&self, index: usize) -> T::VectorType { self.input.load_complex(index) } #[inline(always)] unsafe fn load_partial1_complex(&self, index: usize) -> T::VectorType { self.input.load_partial1_complex(index) } #[inline(always)] unsafe fn load1_complex(&self, index: usize) -> T::VectorType { self.input.load1_complex(index) } } /// A trait to handle writing to an array of complex floats from WASM SIMD vectors. /// WASM works with 128-bit vectors, meaning a vector can hold two complex f32, /// or a single complex f64. pub trait WasmSimdArrayMut: WasmSimdArray + DerefMut { /// Store all complex numbers from a WASM SIMD vector to the array. unsafe fn store_complex(&mut self, vector: T::VectorType, index: usize); /// Store the low complex number from a WASM SIMD vector to the array. unsafe fn store_partial_lo_complex(&mut self, vector: T::VectorType, index: usize); /// Store the high complex number from a WASM SIMD vector to the array. unsafe fn store_partial_hi_complex(&mut self, vector: T::VectorType, index: usize); } impl WasmSimdArrayMut for &mut [Complex] { #[inline(always)] unsafe fn store_complex(&mut self, vector: ::VectorType, index: usize) { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); v128_store(self.as_mut_ptr().add(index) as *mut v128, vector); } #[inline(always)] unsafe fn store_partial_hi_complex( &mut self, vector: ::VectorType, index: usize, ) { debug_assert!(self.len() >= index + 1); v128_store64_lane::<1>(vector, self.as_mut_ptr().add(index) as *mut u64); } #[inline(always)] unsafe fn store_partial_lo_complex( &mut self, vector: ::VectorType, index: usize, ) { debug_assert!(self.len() >= index + 1); v128_store64_lane::<0>(vector, self.as_mut_ptr().add(index) as *mut u64); } } impl WasmSimdArrayMut for &mut [Complex] { #[inline(always)] unsafe fn store_complex(&mut self, vector: ::VectorType, index: usize) { debug_assert!(self.len() >= index + ::COMPLEX_PER_VECTOR); v128_store(self.as_mut_ptr().add(index) as *mut v128, vector); } #[inline(always)] unsafe fn store_partial_hi_complex( &mut self, _vector: ::VectorType, _index: usize, ) { unimplemented!("Impossible to do a partial store of complex f64's"); } #[inline(always)] unsafe fn store_partial_lo_complex( &mut self, _vector: ::VectorType, _index: usize, ) { unimplemented!("Impossible to do a partial store of complex f64's"); } } impl<'a, T: WasmSimdNum> WasmSimdArrayMut for DoubleBuf<'a, T> where Self: WasmSimdArray, &'a mut [Complex]: WasmSimdArrayMut, { #[inline(always)] unsafe fn store_complex(&mut self, vector: T::VectorType, index: usize) { self.output.store_complex(vector, index); } #[inline(always)] unsafe fn store_partial_hi_complex(&mut self, vector: T::VectorType, index: usize) { self.output.store_partial_hi_complex(vector, index); } #[inline(always)] unsafe fn store_partial_lo_complex(&mut self, vector: T::VectorType, index: usize) { self.output.store_partial_lo_complex(vector, index); } } #[cfg(test)] mod unit_tests { use super::*; use num_complex::Complex; use wasm_bindgen_test::wasm_bindgen_test; #[wasm_bindgen_test] fn test_load_f64() { unsafe { let val1: Complex = Complex::new(1.0, 2.0); let val2: Complex = Complex::new(3.0, 4.0); let val3: Complex = Complex::new(5.0, 6.0); let val4: Complex = Complex::new(7.0, 8.0); let values = vec![val1, val2, val3, val4]; let slice = values.as_slice(); let load1 = slice.load_complex(0); let load2 = slice.load_complex(1); let load3 = slice.load_complex(2); let load4 = slice.load_complex(3); assert_eq!(val1, std::mem::transmute::>(load1)); assert_eq!(val2, std::mem::transmute::>(load2)); assert_eq!(val3, std::mem::transmute::>(load3)); assert_eq!(val4, std::mem::transmute::>(load4)); } } #[wasm_bindgen_test] fn test_store_f64() { unsafe { let val1: Complex = Complex::new(1.0, 2.0); let val2: Complex = Complex::new(3.0, 4.0); let val3: Complex = Complex::new(5.0, 6.0); let val4: Complex = Complex::new(7.0, 8.0); let nbr1 = v128_load(&val1 as *const _ as *const v128); let nbr2 = v128_load(&val2 as *const _ as *const v128); let nbr3 = v128_load(&val3 as *const _ as *const v128); let nbr4 = v128_load(&val4 as *const _ as *const v128); let mut values: Vec> = vec![Complex::new(0.0, 0.0); 4]; let mut slice = values.as_mut_slice(); slice.store_complex(nbr1, 0); slice.store_complex(nbr2, 1); slice.store_complex(nbr3, 2); slice.store_complex(nbr4, 3); assert_eq!(val1, values[0]); assert_eq!(val2, values[1]); assert_eq!(val3, values[2]); assert_eq!(val4, values[3]); } } } rustfft-6.2.0/tests/accuracy.rs000064400000000000000000000133600072674642500146630ustar 00000000000000//! To test the accuracy of our FFT algorithm, we first test that our //! naive Dft function is correct by comparing its output against several //! known signal/spectrum relationships. Then, we generate random signals //! for a variety of lengths, and test that our FFT algorithm matches our //! Dft calculation for those signals. use std::sync::Arc; use num_traits::Float; use rustfft::{ algorithm::{BluesteinsAlgorithm, Radix4}, num_complex::Complex, Fft, FftNum, FftPlanner, }; use rustfft::{num_traits::Zero, FftDirection}; use rand::distributions::{uniform::SampleUniform, Distribution, Uniform}; use rand::{rngs::StdRng, SeedableRng}; /// The seed for the random number generator used to generate /// random signals. It's defined here so that we have deterministic /// tests const RNG_SEED: [u8; 32] = [ 1, 9, 1, 0, 1, 1, 4, 3, 1, 4, 9, 8, 4, 1, 4, 8, 2, 8, 1, 2, 2, 2, 6, 1, 2, 3, 4, 5, 6, 7, 8, 9, ]; /// Returns true if the mean difference in the elements of the two vectors /// is small fn compare_vectors(vec1: &[Complex], vec2: &[Complex]) -> bool { assert_eq!(vec1.len(), vec2.len()); let mut error = T::zero(); for (&a, &b) in vec1.iter().zip(vec2.iter()) { error = error + (a - b).norm(); } return (error / T::from_usize(vec1.len()).unwrap()) < T::from_f32(0.1).unwrap(); } fn fft_matches_control(control: Arc>, input: &[Complex]) -> bool { let mut control_input = input.to_vec(); let mut test_input = input.to_vec(); let mut planner = FftPlanner::new(); let fft = planner.plan_fft(control.len(), control.fft_direction()); assert_eq!( fft.len(), control.len(), "FFTplanner created FFT of wrong length" ); assert_eq!( fft.fft_direction(), control.fft_direction(), "FFTplanner created FFT of wrong direction" ); let scratch_max = std::cmp::max( control.get_inplace_scratch_len(), fft.get_inplace_scratch_len(), ); let mut scratch = vec![Zero::zero(); scratch_max]; control.process_with_scratch(&mut control_input, &mut scratch); fft.process_with_scratch(&mut test_input, &mut scratch); return compare_vectors(&test_input, &control_input); } fn random_signal(length: usize) -> Vec> { let mut sig = Vec::with_capacity(length); let dist: Uniform = Uniform::new(T::zero(), T::from_f64(10.0).unwrap()); let mut rng: StdRng = SeedableRng::from_seed(RNG_SEED); for _ in 0..length { sig.push(Complex { re: (dist.sample(&mut rng)), im: (dist.sample(&mut rng)), }); } return sig; } // A cache that makes setup for integration tests faster struct ControlCache { fft_cache: Vec>>, } impl ControlCache { pub fn new(max_outer_len: usize, direction: FftDirection) -> Self { let max_inner_len = (max_outer_len * 2 - 1).checked_next_power_of_two().unwrap(); let max_power = max_inner_len.trailing_zeros() as usize; Self { fft_cache: (0..=max_power) .map(|i| { let len = 1 << i; Arc::new(Radix4::new(len, direction)) as Arc> }) .collect(), } } pub fn plan_fft(&self, len: usize) -> Arc> { let inner_fft_len = (len * 2 - 1).checked_next_power_of_two().unwrap(); let inner_fft_index = inner_fft_len.trailing_zeros() as usize; let inner_fft = Arc::clone(&self.fft_cache[inner_fft_index]); Arc::new(BluesteinsAlgorithm::new(len, inner_fft)) } } const TEST_MAX: usize = 1001; /// Integration tests that verify our FFT output matches the direct Dft calculation /// for random signals. #[test] fn test_planned_fft_forward_f32() { let direction = FftDirection::Forward; let cache: ControlCache = ControlCache::new(TEST_MAX, direction); for len in 1..TEST_MAX { let control = cache.plan_fft(len); assert_eq!(control.len(), len); assert_eq!(control.fft_direction(), direction); let signal = random_signal(len); assert!(fft_matches_control(control, &signal), "length = {}", len); } } #[test] fn test_planned_fft_inverse_f32() { let direction = FftDirection::Inverse; let cache: ControlCache = ControlCache::new(TEST_MAX, direction); for len in 1..TEST_MAX { let control = cache.plan_fft(len); assert_eq!(control.len(), len); assert_eq!(control.fft_direction(), direction); let signal = random_signal(len); assert!(fft_matches_control(control, &signal), "length = {}", len); } } #[test] fn test_planned_fft_forward_f64() { let direction = FftDirection::Forward; let cache: ControlCache = ControlCache::new(TEST_MAX, direction); for len in 1..TEST_MAX { let control = cache.plan_fft(len); assert_eq!(control.len(), len); assert_eq!(control.fft_direction(), direction); let signal = random_signal(len); assert!(fft_matches_control(control, &signal), "length = {}", len); } } #[test] fn test_planned_fft_inverse_f64() { let direction = FftDirection::Inverse; let cache: ControlCache = ControlCache::new(TEST_MAX, direction); for len in 1..TEST_MAX { let control = cache.plan_fft(len); assert_eq!(control.len(), len); assert_eq!(control.fft_direction(), direction); let signal = random_signal(len); assert!(fft_matches_control(control, &signal), "length = {}", len); } } rustfft-6.2.0/tools/gen_sse_butterflies.py000064400000000000000000000155070072674642500171330ustar 00000000000000# A simple Python script to generate the code for odd-sized optimized DFTs # The generated code is simply printed in the terminal. # This is only intended for prime lengths, where the usual tricks can't be used. # The generated code is O(n^2), but for short lengths this is still faster than fancier algorithms. # Example, make a length 5 Dft: # > python genbutterflies.py 5 # Output: # let x14p = *buffer.get_unchecked(1) + *buffer.get_unchecked(4); # let x14n = *buffer.get_unchecked(1) - *buffer.get_unchecked(4); # let x23p = *buffer.get_unchecked(2) + *buffer.get_unchecked(3); # let x23n = *buffer.get_unchecked(2) - *buffer.get_unchecked(3); # let sum = *buffer.get_unchecked(0) + x14p + x23p; # let b14re_a = buffer.get_unchecked(0).re + self.twiddle1.re*x14p.re + self.twiddle2.re*x23p.re; # let b14re_b = self.twiddle1.im*x14n.im + self.twiddle2.im*x23n.im; # let b23re_a = buffer.get_unchecked(0).re + self.twiddle2.re*x14p.re + self.twiddle1.re*x23p.re; # let b23re_b = self.twiddle2.im*x14n.im + -self.twiddle1.im*x23n.im; # # let b14im_a = buffer.get_unchecked(0).im + self.twiddle1.re*x14p.im + self.twiddle2.re*x23p.im; # let b14im_b = self.twiddle1.im*x14n.re + self.twiddle2.im*x23n.re; # let b23im_a = buffer.get_unchecked(0).im + self.twiddle2.re*x14p.im + self.twiddle1.re*x23p.im; # let b23im_b = self.twiddle2.im*x14n.re + -self.twiddle1.im*x23n.re; # # let out1re = b14re_a - b14re_b; # let out1im = b14im_a + b14im_b; # let out2re = b23re_a - b23re_b; # let out2im = b23im_a + b23im_b; # let out3re = b23re_a + b23re_b; # let out3im = b23im_a - b23im_b; # let out4re = b14re_a + b14re_b; # let out4im = b14im_a - b14im_b; # *buffer.get_unchecked_mut(0) = sum; # *buffer.get_unchecked_mut(1) = Complex{ re: out1re, im: out1im }; # *buffer.get_unchecked_mut(2) = Complex{ re: out2re, im: out2im }; # *buffer.get_unchecked_mut(3) = Complex{ re: out3re, im: out3im }; # *buffer.get_unchecked_mut(4) = Complex{ re: out4re, im: out4im }; # # # This required the Butterfly5 to already exist, with twiddles defined like this: # pub struct Butterfly5 { # twiddle1: Complex, # twiddle2: Complex, # direction: FftDirection, # } # # With twiddle values: # twiddle1: Complex = twiddles::single_twiddle(1, 5, direction); # twiddle2: Complex = twiddles::single_twiddle(2, 5, direction); import sys def make_shuffling_single_f64(len): inputs = ", ".join([str(n) for n in range(len)]) print(f"let values = read_complex_to_array!(buffer, {{{inputs}}});") print("") print("let out = self.perform_fft_direct(values);") print("") print(f"write_complex_to_array!(out, buffer, {{{inputs}}});") def make_shuffling_single_f32(len): inputs = ", ".join([str(n) for n in range(len)]) print(f"let values = read_partial1_complex_to_array!(buffer, {{{inputs}}});") print("") print("let out = self.perform_parallel_fft_direct(values);") print("") print(f"write_partial_lo_complex_to_array!(out, buffer, {{{inputs}}});") def make_shuffling_parallel_f32(len): inputs = ", ".join([str(2*n) for n in range(len)]) outputs = ", ".join([str(n) for n in range(len)]) print(f"let input_packed = read_complex_to_array!(buffer, {{{inputs}}});") print("") print("let values = [") for n in range(int(len/2)): print(f" extract_lo_hi_f32(input_packed[{int(n)}], input_packed[{int(len/2 + n)}]),") print(f" extract_hi_lo_f32(input_packed[{int(n)}], input_packed[{int(len/2 + n+1)}]),") print(f" extract_lo_hi_f32(input_packed[{int(len/2)}], input_packed[{int(len-1)}]),") print("];") print("") print("let out = self.perform_parallel_fft_direct(values);") print("") print("let out_packed = [") for n in range(int(len/2)): print(f" extract_lo_lo_f32(out[{int(2*n)}], out[{int(2*n+1)}]),") print(f" extract_lo_hi_f32(out[{int(len-1)}], out[0]),") for n in range(int(len/2)): print(f" extract_hi_hi_f32(out[{int(2*n+1)}], out[{int(2*n+2)}]),") print("];") print("") print(f"write_complex_to_array_strided!(out_packed, buffer, 2, {{{outputs}}});") def make_butterfly(len, fft2func, calcfunc, mulfunc, rotatefunc): halflen = int((fftlen+1)/2) for n in range(1, halflen): print(f"let [x{n}p{fftlen-n}, x{n}m{fftlen-n}] = {fft2func}(values[{n}], values[{fftlen-n}]);") print("") items = [] for m in range (1, halflen): for n in range(1, halflen): mn = (m*n)%fftlen if mn > fftlen/2: mn = fftlen-mn print(f"let t_a{m}_{n} = {mulfunc}(self.twiddle{mn}re, x{n}p{fftlen-n});") print("") items = [] for m in range (1, halflen): for n in range(1, halflen): mn = (m*n)%fftlen if mn > fftlen/2: mn = fftlen-mn print(f"let t_b{m}_{n} = {mulfunc}(self.twiddle{mn}im, x{n}m{fftlen-n});") print("") print("let x0 = values[0];") for m in range(1, halflen): items = ["x0"] for n in range(1, halflen): items.append(f"t_a{m}_{n}") terms = " + ".join(items) print(f'let t_a{m} = {calcfunc}({terms});') print("") for m in range(1, halflen): terms = f"t_b{m}_1" for n in range(2, halflen): mn = (m*n)%fftlen if mn > fftlen/2: sign = " - " else: sign = " + " terms = terms + sign + f"t_b{m}_{n}" print(f'let t_b{m} = {calcfunc}({terms});') print("") for m in range(1, halflen): print(f'let t_b{m}_rot = self.rotate.{rotatefunc}(t_b{m});') print("") items = ["x0"] for n in range(1, halflen): items.append(f"x{n}p{fftlen-n}") terms = " + ".join(items) print(f'let y0 = {calcfunc}({terms});') for m in range(1, halflen): print(f"let [y{m}, y{fftlen-m}] = {fft2func}(t_a{m}, t_b{m}_rot);") items = [] for n in range(0, fftlen): items.append(f"y{n}") print(f'[{", ".join(items)}]') if __name__ == "__main__": fftlen = int(sys.argv[1]) print("\n\n--------------- f32 ---------------") print("\n ----- perform_fft_contiguous -----") make_shuffling_single_f32(fftlen) print("\n ----- perform_parallel_fft_contiguous -----") make_shuffling_parallel_f32(fftlen) print("\n ----- perform_parallel_fft_direct -----") make_butterfly(fftlen, "parallel_fft2_interleaved_f32", "calc_f32!", "_mm_mul_ps", "rotate_both") print("\n\n--------------- f64 ---------------") print("\n ----- perform_fft_contiguous -----") make_shuffling_single_f64(fftlen) print("\n ----- perform_parallel_fft_direct -----") make_butterfly(fftlen, "solo_fft2_f64", "calc_f64!", "_mm_mul_pd", "rotate") rustfft-6.2.0/tools/genbutterflies.py000064400000000000000000000105710072674642500161160ustar 00000000000000# A simple Python script to generate the code for odd-sized optimized DFTs # The generated code is simply printed in the terminal. # This is only intended for prime lengths, where the usual tricks can't be used. # The generated code is O(n^2), but for short lengths this is still faster than fancier algorithms. # Example, make a length 5 Dft: # > python genbutterflies.py 5 # Output: # let x14p = *buffer.get_unchecked(1) + *buffer.get_unchecked(4); # let x14n = *buffer.get_unchecked(1) - *buffer.get_unchecked(4); # let x23p = *buffer.get_unchecked(2) + *buffer.get_unchecked(3); # let x23n = *buffer.get_unchecked(2) - *buffer.get_unchecked(3); # let sum = *buffer.get_unchecked(0) + x14p + x23p; # let b14re_a = buffer.get_unchecked(0).re + self.twiddle1.re*x14p.re + self.twiddle2.re*x23p.re; # let b14re_b = self.twiddle1.im*x14n.im + self.twiddle2.im*x23n.im; # let b23re_a = buffer.get_unchecked(0).re + self.twiddle2.re*x14p.re + self.twiddle1.re*x23p.re; # let b23re_b = self.twiddle2.im*x14n.im + -self.twiddle1.im*x23n.im; # # let b14im_a = buffer.get_unchecked(0).im + self.twiddle1.re*x14p.im + self.twiddle2.re*x23p.im; # let b14im_b = self.twiddle1.im*x14n.re + self.twiddle2.im*x23n.re; # let b23im_a = buffer.get_unchecked(0).im + self.twiddle2.re*x14p.im + self.twiddle1.re*x23p.im; # let b23im_b = self.twiddle2.im*x14n.re + -self.twiddle1.im*x23n.re; # # let out1re = b14re_a - b14re_b; # let out1im = b14im_a + b14im_b; # let out2re = b23re_a - b23re_b; # let out2im = b23im_a + b23im_b; # let out3re = b23re_a + b23re_b; # let out3im = b23im_a - b23im_b; # let out4re = b14re_a + b14re_b; # let out4im = b14im_a - b14im_b; # *buffer.get_unchecked_mut(0) = sum; # *buffer.get_unchecked_mut(1) = Complex{ re: out1re, im: out1im }; # *buffer.get_unchecked_mut(2) = Complex{ re: out2re, im: out2im }; # *buffer.get_unchecked_mut(3) = Complex{ re: out3re, im: out3im }; # *buffer.get_unchecked_mut(4) = Complex{ re: out4re, im: out4im }; # # # This required the Butterfly5 to already exist, with twiddles defined like this: # pub struct Butterfly5 { # twiddle1: Complex, # twiddle2: Complex, # direction: FftDirection, # } # # With twiddle values: # twiddle1: Complex = twiddles::single_twiddle(1, 5, direction); # twiddle2: Complex = twiddles::single_twiddle(2, 5, direction); import sys len = int(sys.argv[1]) halflen = int((len+1)/2) for n in range(1, halflen): print(f"let x{n}{len-n}p = buffer.load({n}) + buffer.load({len-n});") print(f"let x{n}{len-n}n = buffer.load({n}) - buffer.load({len-n});") row = ["let sum = buffer.load(0)"] for n in range(1, halflen): row.append(f"x{n}{len-n}p") print(" + ".join(row) + ";") for n in range(1, halflen): row = [f"let b{n}{len-n}re_a = buffer.load(0).re"] for m in range(1, halflen): mn = (m*n)%len if mn > len/2: mn = len-mn row.append(f"self.twiddle{mn}.re*x{m}{len-m}p.re") print(" + ".join(row) + ";") row = [] for m in range(1, halflen): mn = (m*n)%len if mn > len/2: mn = len-mn row.append(f"-self.twiddle{mn}.im*x{m}{len-m}n.im") else: row.append(f"self.twiddle{mn}.im*x{m}{len-m}n.im") print(f"let b{n}{len-n}re_b = " + " + ".join(row) + ";") print("") for n in range(1, halflen): row = [f"let b{n}{len-n}im_a = buffer.load(0).im"] for m in range(1, halflen): mn = (m*n)%len if mn > len/2: mn = len-mn row.append(f"self.twiddle{mn}.re*x{m}{len-m}p.im") print(" + ".join(row) + ";") row = [] for m in range(1, halflen): mn = (m*n)%len if mn > len/2: mn = len-mn row.append(f"-self.twiddle{mn}.im*x{m}{len-m}n.re") else: row.append(f"self.twiddle{mn}.im*x{m}{len-m}n.re") print(f"let b{n}{len-n}im_b = " + " + ".join(row) + ";") print("") for n in range(1,len): nfold = n sign_re = "-" sign_im = "+" if n > len/2: nfold = len-n sign_re = "+" sign_im = "-" print(f"let out{n}re = b{nfold}{len-nfold}re_a {sign_re} b{nfold}{len-nfold}re_b;") print(f"let out{n}im = b{nfold}{len-nfold}im_a {sign_im} b{nfold}{len-nfold}im_b;") print("buffer.store(sum, 0);") for n in range(1,len): print(f"buffer.store(Complex{{ re: out{n}re, im: out{n}im }}, {n})")rustfft-6.2.0/tools/p2comparison.py000064400000000000000000000033170072674642500155100ustar 00000000000000import sys import math from matplotlib import pyplot as plt with open(sys.argv[1]) as f: lines = f.readlines() results = {"f32": {"scalar": {}, "sse": {}, "avx":{}}, "f64": {"scalar": {}, "sse": {}, "avx":{}}} for line in lines: if line.startswith("test ") and not line.startswith("test result"): name, result = line.split("... bench:") name = name.split()[1] _, length, ftype, algo = name.split("_") value = float(result.strip().split(" ")[0].replace(",", "")) results[ftype][algo][float(length)] = value lengths = sorted(list(results["f32"]["scalar"].keys())) scalar_32 = [] avx_32 = [] sse_32 = [] for l in lengths: sc32 = results["f32"]["scalar"][l] av32 = results["f32"]["avx"][l] ss32 = results["f32"]["sse"][l] scalar_32.append(100.0) sse_32.append(100.0 * sc32/ss32) avx_32.append(100.0 * sc32/av32) scalar_64 = [] avx_64 = [] sse_64 = [] for l in lengths: sc64 = results["f64"]["scalar"][l] av64 = results["f64"]["avx"][l] ss64 = results["f64"]["sse"][l] scalar_64.append(100.0) sse_64.append(100.0 * sc64/ss64) avx_64.append(100.0 * sc64/av64) lengths = [math.log(l, 2) for l in lengths] plt.figure() plt.plot(lengths, scalar_64, lengths, sse_64, lengths, avx_64) plt.title("f64") plt.ylabel("relative speed, %") plt.xlabel("log2(length)") plt.xticks(list(range(4,23))) plt.grid() plt.legend(["scalar", "sse", "avx"]) plt.figure() plt.plot(lengths, scalar_32, lengths, sse_32, lengths, avx_32) plt.title("f32") plt.ylabel("relative speed, %") plt.xlabel("log2(length)") plt.legend(["scalar", "sse", "avx"]) plt.xticks(list(range(4,23))) plt.grid() plt.show() rustfft-6.2.0/tools/p2comparison_neon.py000064400000000000000000000027470072674642500165350ustar 00000000000000import sys import math from matplotlib import pyplot as plt with open(sys.argv[1]) as f: lines = f.readlines() results = {"f32": {"scalar": {}, "neon": {}}, "f64": {"scalar": {}, "neon": {}}} for line in lines: if line.startswith("test ") and not line.startswith("test result"): name, result = line.split("... bench:") name = name.split()[1] _, length, ftype, algo = name.split("_") value = float(result.strip().split(" ")[0].replace(",", "")) results[ftype][algo][float(length)] = value lengths = sorted(list(results["f32"]["scalar"].keys())) scalar_32 = [] neon_32 = [] for l in lengths: sc32 = results["f32"]["scalar"][l] nn32 = results["f32"]["neon"][l] scalar_32.append(100.0) neon_32.append(100.0 * sc32/nn32) scalar_64 = [] neon_64 = [] for l in lengths: sc64 = results["f64"]["scalar"][l] nn64 = results["f64"]["neon"][l] scalar_64.append(100.0) neon_64.append(100.0 * sc64/nn64) lengths = [math.log(l, 2) for l in lengths] plt.figure() plt.plot(lengths, scalar_64, lengths, neon_64) plt.title("f64") plt.ylabel("relative speed, %") plt.xlabel("log2(length)") plt.xticks(list(range(4,23))) plt.grid() plt.legend(["scalar", "neon"]) plt.figure() plt.plot(lengths, scalar_32, lengths, neon_32) plt.title("f32") plt.ylabel("relative speed, %") plt.xlabel("log2(length)") plt.legend(["scalar", "neon"]) plt.xticks(list(range(4,23))) plt.grid() plt.show() rustfft-6.2.0/tools/p2comparison_wasm_simd.py000064400000000000000000000030270072674642500175510ustar 00000000000000import sys import math from matplotlib import pyplot as plt with open(sys.argv[1]) as f: lines = f.readlines() results = {"f32": {"scalar": {}, "wasmsimd": {}}, "f64": {"scalar": {}, "wasmsimd": {}}} for line in lines: if line.startswith("test ") and not line.startswith("test result"): name, result = line.split("... bench:") name = name.split()[1] _, length, ftype, algo = name.split("_") value = float(result.strip().split(" ")[0].replace(",", "")) results[ftype][algo][float(length)] = value lengths = sorted(list(results["f32"]["scalar"].keys())) scalar_32 = [] wasmsimd_32 = [] for l in lengths: sc32 = results["f32"]["scalar"][l] nn32 = results["f32"]["wasmsimd"][l] scalar_32.append(100.0) wasmsimd_32.append(100.0 * sc32/nn32) scalar_64 = [] wasmsimd_64 = [] for l in lengths: sc64 = results["f64"]["scalar"][l] nn64 = results["f64"]["wasmsimd"][l] scalar_64.append(100.0) wasmsimd_64.append(100.0 * sc64/nn64) lengths = [math.log(l, 2) for l in lengths] plt.figure() plt.plot(lengths, scalar_64, lengths, wasmsimd_64) plt.title("f64") plt.ylabel("relative speed, %") plt.xlabel("log2(length)") plt.xticks(list(range(4,23))) plt.grid() plt.legend(["scalar", "wasmsimd"]) plt.figure() plt.plot(lengths, scalar_32, lengths, wasmsimd_32) plt.title("f32") plt.ylabel("relative speed, %") plt.xlabel("log2(length)") plt.legend(["scalar", "wasmsimd"]) plt.xticks(list(range(4,23))) plt.grid() plt.show()