aboutsummaryrefslogtreecommitdiff
path: root/ARMeilleure/CodeGen
AgeCommit message (Collapse)Author
2021-10-18Add an early `TailMerge` pass (#2721)FICTURE7
* Add an early `TailMerge` pass Some translations can have a lot of guest calls and since for each guest call there is a call guard which may return. This can produce a lot of epilogue code for returns. This pass merges the epilogue into a single block. ``` Using filter 'hcq'. Using metric 'code size'. Total diff: -1648111 (-7.19 %) (bytes): Base: 22913847 Diff: 21265736 Improved: 4567, regressed: 14, unchanged: 144 ``` * Set PTC version * Address feedback * Handle `void` returning functions * Actually handle `void` returning functions * Fix `RegisterToLocal` logging
2021-10-08Optimize LSRA (#2563)FICTURE7
* Optimize `TryAllocateRegWithtoutSpill` a bit * Add a fast path for when all registers are live. * Do not query `GetOverlapPosition` if the register is already in use (i.e: free position is 0). * Do not allocate child split list if not parent * Turn `LiveRange` into a reference struct `LiveRange` is now a reference wrapping struct like `Operand` and `Operation`. It has also been changed into a singly linked-list. In micro-benchmarks traversing the linked-list was faster than binary search on `List<T>`. Even for quite large input sizes (e.g: 1,000,000), surprisingly. Could be because the code gen for traversing the linked-list is much much cleaner and there is no virtual dispatch happening when checking if intervals overlaps. * Turn `LiveInterval` into an iterator The LSRA allocates in forward order and never inspect previous `LiveInterval` once they are expired. Something similar can be done for the `LiveRange`s within the `LiveInterval`s themselves. The `LiveInterval` is turned into a iterator which expires `LiveRange` within it. The iterator is moved forward along with interval walking code, i.e: AllocateInterval(context, interval, cIndex). * Remove `LinearScanAllocator.Sources` Local methods are less susceptible to do allocations than lambdas. * Optimize `GetOverlapPosition(interval)` a bit Time complexity should be in O(n+m) instead of O(nm) now. * Optimize `NumberLocals` a bit Use the same idea as in `HybridAllocator` to store the visited state in the MSB of the Operand's value instead of using a `HashSet<T>`. * Optimize `InsertSplitCopies` a bit Avoid allocating a redundant `CopyResolver`. * Optimize `InsertSplitCopiesAtEdges` a bit Avoid redundant allocations of `CopyResolver`. * Use stack allocation for `freePositions` Avoid redundant computations. * Add `UseList` Replace `SortedIntegerList` with an even more specialized data structure. It allocates memory on the arena allocators and does not require copying use positions when splitting it. * Turn `LiveInterval` into a reference struct `LiveInterval` is now a reference wrapping struct like `Operand` and `Operation`. The rationale behind turning this in a reference wrapping struct is because a `LiveInterval` is associated with each local variable, and these intervals may themselves be split further. I've seen translations having up to 8000 local variables. To make the `LiveInterval` unmanaged, a new data structure called `LiveIntervalList` was added to store child splits. This differs from `SortedList<,>` because it can contain intervals with the same start position. Really wished we got some more of C++ template in C#. :^( * Optimize `GetChildSplit` a bit No need to inspect the remaining ranges if we've reached a range which starts after position, since the split list is ordered. * Optimize `CopyResolver` a bit Lazily allocate the fill, spill and parallel copy structures since most of the time only one of them is needed. * Optimize `BitMap.Enumerator` a bit Marking `MoveNext` as `AggressiveInlining` allows RyuJIT to promote the `Enumerator` struct into registers completely, reducing load/store code a lot since it does not have to store the struct on the stack for ABI purposes. * Use stack allocation for `use/blockedPositions` * Optimize `AllocateWithSpill` a bit * Address feedback * Make `LiveInterval.AddRange(,)` more conservative Produces no diff against master, but just for good measure.
2021-10-05Add `Operand.Label` support to `Assembler` (#2680)FICTURE7
* Add `Operand.Label` support to `Assembler` This adds label support to `Assembler` and enables branch tightening when compiling with relocatables. Jump management and patching has been moved to the `Assembler`. * Move instruction table to `Assembler.Table` * Set PTC internal version * Rename `Assembler.Table` to `AssemblerTable`
2021-09-29Optimize `HybridAllocator` (#2637)FICTURE7
* Store constant `Operand`s in the `LocalInfo` Since the spill slot and register assigned is fixed, we can just store the `Operand` reference in the `LocalInfo` struct. This allows skipping hitting the intern-table for a look up. * Skip `Uses`/`Assignments` management Since the `HybridAllocator` is the last pass and we do not care about uses/assignments we can skip managing that when setting destinations or sources. * Make `GetLocalInfo` inlineable Also fix a possible issue where with numbered locals. See or-assignment operator in `SetVisited(local)` before patch. * Do not run `BlockPlacement` in LCQ With the host mapped memory manager, there is a lot less cold code to split from hot code. So disabling this in LCQ gives some extra throughput - where we need it. * Address Mou-Ikkai's feedback * Apply suggestions from code review Co-authored-by: VocalFan <45863583+Mou-Ikkai@users.noreply.github.com> * Move check to an assert Co-authored-by: VocalFan <45863583+Mou-Ikkai@users.noreply.github.com>
2021-09-14Refactor `PtcInfo` (#2625)FICTURE7
* Refactor `PtcInfo` This change reduces the coupling of `PtcInfo` by moving relocation tracking to the backend. `RelocEntry`s remains as `RelocEntry`s through out the pipeline until it actually needs to be written to the PTC streams. Keeping this representation makes inspecting and manipulating relocations after compilations less painful. This is something I needed to do to patch relocations to 0 to diff dumps. Contributes to #1125. * Turn `Symbol` & `RelocInfo` into readonly structs * Add documentation to `CompiledFunction` * Remove `Compiler.Compile<T>` Remove `Compiler.Compile<T>` and replace it by `Map<T>` of the `CompiledFunction` returned.
2021-08-20Fix type mismatch in `BitwiseAnd` simplification (#2571)FICTURE7
* Fix type mismatch in `BitwiseAnd` simplification `TryEliminateBitwiseAnd` would turn the `BitwiseAnd` operation into a copy of the wrong type. E.g: Before `Simplification`: ```llvm i64 %0 = BitwiseAnd i64 0x0, %1 ``` After `Simplication`: ```llvm i64 %0 = Copy i32 0x0 ``` Since the with the changes in #2515, we iterate in reverse order and `Simplication`, `ConstantFolding` does not indicate if it modified the CFG, the second pass to "retype" the copy into the proper destination type does not happen. This also blocked copy propagation since its destination type did not match with its source type. But in the cases I've seen, the `PreAllocator` would insert a copy for the propagated constant, which results in no diffs. Since the copy remained as is, asserts are fired when generating it. * Set PPTC version
2021-08-17Reduce JIT GC allocations (#2515)FICTURE7
* Turn `MemoryOperand` into a struct * Remove `IntrinsicOperation` * Remove `PhiNode` * Remove `Node` * Turn `Operand` into a struct * Turn `Operation` into a struct * Clean up pool management methods * Add `Arena` allocator * Move `OperationHelper` to `Operation.Factory` * Move `OperandHelper` to `Operand.Factory` * Optimize `Operation` a bit * Fix `Arena` initialization * Rename `NativeList<T>` to `ArenaList<T>` * Reduce `Operand` size from 88 to 56 bytes * Reduce `Operation` size from 56 to 40 bytes * Add optimistic interning of Register & Constant operands * Optimize `RegisterUsage` pass a bit * Optimize `RemoveUnusedNodes` pass a bit Iterating in reverse-order allows killing dependency chains in a single pass. * Fix PPTC symbols * Optimize `BasicBlock` a bit Reduce allocations from `_successor` & `DominanceFrontiers` * Fix `Operation` resize * Make `Arena` expandable Change the arena allocator to be expandable by allocating in pages, with some of them being pooled. Currently 32 pages are pooled. An LRU removal mechanism should probably be added to it. Apparently MHR can allocate bitmaps large enough to exceed the 16MB limit for the type. * Move `Arena` & `ArenaList` to `Common` * Remove `ThreadStaticPool` & co * Add `PhiOperation` * Reduce `Operand` size from 56 from 48 bytes * Add linear-probing to `Operand` intern table * Optimize `HybridAllocator` a bit * Add `Allocators` class * Tune `ArenaAllocator` sizes * Add page removal mechanism to `ArenaAllocator` Remove pages which have not been used for more than 5s after each reset. I am on fence if this would be better using a Gen2 callback object like the one in System.Buffers.ArrayPool<T>, to trim the pool. Because right now if a large translation happens, the pages will be freed only after a reset. This reset may not happen for a while because no new translation is hit, but the arena base sizes are rather small. * Fix `OOM` when allocating larger than page size in `ArenaAllocator` Tweak resizing mechanism for Operand.Uses and Assignemnts. * Optimize `Optimizer` a bit * Optimize `Operand.Add<T>/Remove<T>` a bit * Clean up `PreAllocator` * Fix phi insertion order Reduce codegen diffs. * Fix code alignment * Use new heuristics for degree of parallelism * Suppress warnings * Address gdkchan's feedback Renamed `GetValue()` to `GetValueUnsafe()` to make it more clear that `Operand.Value` should usually not be modified directly. * Add fast path to `ArenaAllocator` * Assembly for `ArenaAllocator.Allocate(ulong)`: .L0: mov rax, [rcx+0x18] lea r8, [rax+rdx] cmp r8, [rcx+0x10] ja short .L2 .L1: mov rdx, [rcx+8] add rax, [rdx+8] mov [rcx+0x18], r8 ret .L2: jmp ArenaAllocator.AllocateSlow(UInt64) A few variable/field had to be changed to ulong so that RyuJIT avoids emitting zero-extends. * Implement a new heuristic to free pooled pages. If an arena is used often, it is more likely that its pages will be needed, so the pages are kept for longer (e.g: during PPTC rebuild or burst sof compilations). If is not used often, then it is more likely that its pages will not be needed (e.g: after PPTC rebuild or bursts of compilations). * Address riperiperi's feedback * Use `EqualityComparer<T>` in `IntrusiveList<T>` Avoids a potential GC hole in `Equals(T, T)`.
2021-05-29Add multi-level function table (#2228)FICTURE7
* Add AddressTable<T> * Use AddressTable<T> for dispatch * Remove JumpTable & co. * Add fallback for out of range addresses * Add PPTC support * Add documentation to `AddressTable<T>` * Make AddressTable<T> configurable * Fix table walk * Fix IsMapped check * Remove CountTableCapacity * Add PPTC support for fast path * Rename IsMapped to IsValid * Remove stale comment * Change format of address in exception message * Add TranslatorStubs * Split DispatchStub Avoids recompilation of stubs during tests. * Add hint for 64bit or 32bit * Add documentation to `Symbol` * Add documentation to `TranslatorStubs` Make `TranslatorStubs` disposable as well. * Add documentation to `SymbolType` * Add `AddressTableEventSource` to monitor function table size Add an EventSource which measures the amount of unmanaged bytes allocated by AddressTable<T> instances. dotnet-counters monitor -n Ryujinx --counters ARMeilleure * Add `AllowLcqInFunctionTable` optimization toggle This is to reduce the impact this change has on the test duration. Before everytime a test was ran, the FunctionTable would be initialized and populated so that the newly compiled test would get registered to it. * Implement unmanaged dispatcher Uses the DispatchStub to dispatch into the next translation, which allows execution to stay in unmanaged for longer and skips a ConcurrentDictionary look up when the target translation has been registered to the FunctionTable. * Remove redundant null check * Tune levels of FunctionTable Uses 5 levels instead of 4 and change unit of AddressTableEventSource from KB to MB. * Use 64-bit function table Improves codegen for direct branches: mov qword [rax+0x408],0x10603560 - mov rcx,sub_10603560_OFFSET - mov ecx,[rcx] - mov ecx,ecx - mov rdx,JIT_CACHE_BASE - add rdx,rcx + mov rcx,sub_10603560 + mov rdx,[rcx] mov rcx,rax Improves codegen for dispatch stub: and rax,byte +0x1f - mov eax,[rcx+rax*4] - mov eax,eax - mov rcx,JIT_CACHE_BASE - lea rax,[rcx+rax] + mov rax,[rcx+rax*8] mov rcx,rbx * Remove `JitCacheSymbol` & `JitCache.Offset` * Turn `Translator.Translate` into an instance method We do not have to add more parameter to this method and related ones as new structures are added & needed for translation. * Add symbol only when PTC is enabled Address LDj3SNuD's feedback * Change `NativeContext.Running` to a 32-bit integer * Fix PageTable symbol for host mapped
2021-05-17Allow `LocalVariable` to be assigned more than once (#2288)FICTURE7
* Allow `LocalVariable` to be assigned more than once This allows us to write flow controls like loops and if-elses with LocalVariables participating in phi nodes. * Add `GetLocalNumber` to operand
2021-05-13Fold constant offsets and group constant addresses (#2285)gdkchan
* Fold constant offsets and group constant addresses * PPTC version bump
2021-04-07(CPU) Fix CRC32 instruction when constant values are used as input (#2183)gdkchan
2021-02-23PPTC: Fix unwanted propagation of a relocatable constant in a specific case. ↵LDj3SNuD
(#1990) * Fix unwanted propagation of a relocatable constant in a specific case. * Ptc.InternalVersion = 1990 * Nit to retrigger the Checks.
2021-02-22PPTC & Pool Enhancements. (#1968)LDj3SNuD
* PPTC & Pool Enhancements. * Avoid buffer allocations in CodeGenContext.GetCode(). Avoid stream allocations in PTC.PtcInfo. Refactoring/nits. * Use XXHash128, for Ptc.Load & Ptc.Save, x10 faster than Md5. * Why not a nice Span. * Added a simple PtcFormatter library for deserialization/serialization, which does not require reflection, in use at PtcJumpTable and PtcProfiler; improves maintainability and simplicity/readability of affected code. * Nits. * Revert #1987. * Revert "Revert #1987." This reverts commit 998be765cf7f7da5ff0c1c08de704c9012b0f49c.
2021-02-21Turn Copy into Fill in HybridAllocator (#2010)FICTURE7
* Turn Copy into Fill in HybridAllocator * Set PTC internal verison
2021-02-08Optimization | Modify Add (Integer) Instruction to use LEA instead. (#1971)sharmander
* Optimization | Modify Add Instruction to use LEA instead. Currently, the add instruction requires 4 registers to take place. By using LEA, we can effectively perform the same working using 3 registers, reducing memory spills and improving translation efficiency. * Fix IsSameOperandDestSrc1 Check for Add * Use LEA if Dest != SRC1 * Update IsSameOperandDestSrc1 to account for Cases where Dest and Src1 can be same for add * Fix error in logic * Typo * Add paranthesis for clarity * Compare registers as requested. * Cleanup if statement, use same comparison method as generateCopy * Make change as recommended by gdk * Perform check only when Add calls are made * use ensure sametype for lea, fix else * Update comment * Update version #
2021-01-20CPU (A64): Add Fmaxnmp & Fminnmp Scalar Inst.s, Fast & Slow Paths; with ↵LDj3SNuD
Tests. (#1894)
2020-12-17Fix Vnmls_S fast path (F64: losing input d value). Fix Vnmla_S & Vnmls_S ↵LDj3SNuD
slow paths (using fused inst.s). Fix Vfma_V slow path not using StandardFPSCRValue(). (#1775) * Fix Vnmls_S fast path (F64: losing input d value). Fix Vnmla_S & Vnmls_S slow paths (using fused inst.s). Add Vfma_S & Vfms_S Fma fast paths. Add Vfnma_S inst. with Fma/Sse fast paths and slow path. Add Vfnms_S Sse fast path. Add Tests for affected inst.s. Nits. * InternalVersion = 1775 * Nits. * Fix Vfma_V slow path not using StandardFPSCRValue(). * Nit: Fix Vfma_V order. * Add Vfms_V Sse fast path and slow path. * Add Vfma_V and Vfms_V Test.
2020-12-17PPTC Follow-up. (#1712)LDj3SNuD
* Added support for offline invalidation, via PPTC, of low cq translations replaced by high cq translations; both on a single run and between runs. Added invalidation of .cache files in the event of reuse on a different user operating system. Added .info and .cache files invalidation in case of a failed stream decompression. Nits. * InternalVersion = 1712; * Nits. * Address comment. * Get rid of BinaryFormatter. Nits. * Move Ptc.LoadTranslations(). Nits. * Nits. * Fixed corner cases (in case backup copies have to be used). Added save logs. * Not core fixes. * Complement to the previous commit. Added load logs. Removed BinaryFormatter leftovers. * Add LoadTranslations log. * Nits. * Removed the search and management of LowCq overlapping functions. * Final increment of .info and .cache flags. * Nit. * GetIndirectFunctionAddress(): Validate that writing actually takes place in dynamic table memory range (and not elsewhere). * Fix Ptc.UpdateInfo() due to rebase. * Nit for retrigger Checks. * Nit for retrigger Checks.
2020-12-15CPU: Implement VFMA (Vector) (#1762)sharmander
* Implement VFMA.F64 * Simplify switch * Simplify FMA Instructions into their own IntrinsicType. * Remove whitespace * Fix indentation * Change tests for Vfnms -- disable inf / nan * Move args up, not description ;) * Implementation Complete. All Tests Pass (Slow / Fast Path) * Move location of function in assembler + test updates. * Shift params upwards * Remove unused function * Update PTC version. * Add comments / re-oreder opcode table. * Remove whitespace * Fix nit * Fix nit. * Fix whitespace * Wrong opcode was used by a bad merge. * Addressed rip's comments.
2020-12-14Fix pre-allocator shift instruction copy on a specific case (#1752)gdkchan
* Fix pre-allocator shift instruction copy on a specific case * Fix to make shift use int rather than long for constants
2020-12-07CPU: Implement VFNMA.F32 | F.64 (#1783)sharmander
* Implement VFNMA.F<32/64> * Update PTC Version * Update Implementation & Renames & Correct Order * Fix alignment * Update implementation to not trigger assert * Actually use the intrinsic that makes sense :)
2020-12-07Add support for guest Fz (Fpcr) mode through host Ftz and Daz (Mxcsr) modes ↵LDj3SNuD
(fast paths). (#1630) * Add support for guest Fz (Fpcr) mode through host Ftz and Daz (Mxcsr) modes (fast paths). * Ptc.InternalVersion = 1630 * Nits. * Address comments. * Update Ptc.cs * Address comment.
2020-12-03CPU: Implement VFNMS.F32/64 (#1758)sharmander
* Add necessary methods / op-code * Enable Support for FMA Instruction Set * Add Intrinsics / Assembly Opcodes for VFMSUB231XX. * Add X86 Instructions for VFMSUB231XX * Implement VFNMS * Implement VFNMS Tests * Add special cases for FMA instructions. * Update PPTC Version * Remove unused Op * Move Check into Assert / Cleanup * Rename and cleanup * Whitespace * Whitespace / Rename * Re-sort * Address final requests * Implement VFMA.F64 * Simplify switch * Simplify FMA Instructions into their own IntrinsicType. * Remove whitespace * Fix indentation * Change tests for Vfnms -- disable inf / nan * Move args up, not description ;) * Undo vfma * Completely remove vfms code., * Fix order of instruction in assembler
2020-11-18CPU (A64): Add FP16/FP32 fast paths (F16C Intrinsics) for Fcvt_S, Fcvtl_V & ↵LDj3SNuD
Fcvtn_V Instructions. Now HardwareCapabilities uses CpuId. (#1650) * net5.0 * CPU (A64): Add FP16/FP32 fast paths (F16C Intrinsics) for Fcvt_S, Fcvtl_V & Fcvtn_V Instructions. Switch to .NET 5.0. Nits. Tests performed successfully in both debug and release mode (for all instructions involved). * Address comment. * Update appveyor.yml * Revert "Update appveyor.yml" This reverts commit 27cdd59e8b90e227e6924d9c162af26c00a89013. * Remove Assembler CpuId. * Update appveyor.yml * Address comment.
2020-11-04Fix LiveInterval.Split (#1660)FICTURE7
Before when splitting intervals, the end of the range would be included in the split check, this can produce empty ranges in the child split. This in turn can affect spilling decisions since the child split will have a different start position and this empty range will get a register and move to the active set for a brief moment. For example: A = [153, 172[; [1899, 1916[; [1991, 2010[; [2397, 2414[; ... Split(A, 1916) A0 = [153, 172[; [1899, 1916[ A1 = [1916, 1916[; [1991, 2010[; [2397, 2414[; ...
2020-09-19Implement block placement (#1549)FICTURE7
* Implement block placement Implement a simple pass which re-orders cold blocks at the end of the list of blocks in the CFG. * Set PPTC version * Use Array.Resize Address gdkchan's feedback
2020-09-12Relax block ordering constraints (#1535)FICTURE7
* Relax block ordering constraints Before `block.Next` had to follow `block.ListNext`, now it does not. Instead `CodeGenerator` will now emit the necessary jump instructions to ensure control flow. This makes control flow and block order modifications easier. It also eliminates some simple cases of redundant branches. * Set PPTC version
2020-08-05Improve branch operations (#1442)Ficture Seven
* Add Compare instruction * Add BranchIf instruction * Use test when BranchIf & Compare against 0 * Propagate Compare into BranchIfTrue/False use - Propagate Compare operations into their BranchIfTrue/False use and turn these into a BranchIf. - Clean up Comparison enum. * Replace BranchIfTrue/False with BranchIf * Use BranchIf in EmitPtPointerLoad - Using BranchIf early instead of BranchIfTrue/False improves LCQ and reduces the amount of work needed by the Optimizer. EmitPtPointerLoader was a/the big producer of BranchIfTrue/False. - Fix asserts firing when assembling BitwiseAnd because of type mismatch in EmitStoreExclusive. This is harmless and should not cause any diffs. * Increment PPTC interval version * Improve IRDumper for BranchIf & Compare * Use BranchIf in EmitNativeCall * Clean up * Do not emit test when immediately preceded by and
2020-07-30 Implement inline memory load/store exclusive and ordered (#1413)gdkchan
* Implement inline memory load/store exclusive * Fix missing REX prefix on 8-bits CMPXCHG * Increment PTC version due to bugfix * Remove redundant memory checks * Address PR feedback * Increment PPTC version
2020-07-30Use movd,movq for i32/64 VectorExtract %x, 0x0 (#1439)Ficture Seven
* Use movd,movq for i32/64 VectorExtract %x, 0x0 * Increment PPTC interval version * Use else-if instead - Address gdkchan's feedback. - Clean up Debug.Assert calls * Inline `count` expression into Debug.Assert Apparently the CoreCLR JIT will not eliminate this. :(
2020-07-13Add SSE4.2 Path for CRC32, add A32 variant, add tests for non-castagnoli ↵riperiperi
variants. (#1328) * Add CRC32 A32 instructions. * Fix CRC32 instructions. * Add CRC intrinsic and fast path. Loop is currently unrolled, will look into adding temp vars after tests are added. * Begin work on Crc tests * Fix SSE4.2 path for CRC32C, finialize tests. * Remove unused IR path. * Fix spacing between prefix checks. * This should be Src. * PTC Version * OpCodeTable Order * Integer check improvement. Value and Crc can be either 32 or 64 size. * This wasn't necessary... * If size is 3, value type must be I64. * Fix same src+dest handling for non crc intrinsics. * Pre-fix (ha) issue with vex encodings
2020-07-13Fix folding of ConvertI64ToI32 imm64 (#1383)Ficture Seven
* Fix folding of ConvertI64ToI32 imm64 * Increment PTC internal version * Clean up
2020-07-11Mask shift constants on x86 backend (#1382)gdkchan
* Mask shift constants on x86 backendd * Version bump
2020-07-11Fold ZeroExtend8/16/32 imm32/64 (#1358)Ficture Seven
* Fold ZeroExtend8/16/32 imm32/64 * Increment PTC version
2020-07-11Fold ConvertI64ToI32 imm64 (#1359)Ficture Seven
* Fold ConvertI64ToI32 imm64 * Increment PTC version * Bump PPTC InternalVersion Co-authored-by: jduncanator <1518948+jduncanator@users.noreply.github.com>
2020-07-09Fix PPTC on Windows 7. (#1369)LDj3SNuD
* Fix PPTC on Windows 7. * Address gdkchan comment.
2020-06-16Add Profiled Persistent Translation Cache. (#769)LDj3SNuD
* Delete DelegateTypes.cs * Delete DelegateCache.cs * Add files via upload * Update Horizon.cs * Update Program.cs * Update MainWindow.cs * Update Aot.cs * Update RelocEntry.cs * Update Translator.cs * Update MemoryManager.cs * Update InstEmitMemoryHelper.cs * Update Delegates.cs * Nit. * Nit. * Nit. * 10 fewer MSIL bytes for us * Add comment. Nits. * Update Translator.cs * Update Aot.cs * Nits. * Opt.. * Opt.. * Opt.. * Opt.. * Allow to change compression level. * Update MemoryManager.cs * Update Translator.cs * Manage corner cases during the save phase. Nits. * Update Aot.cs * Translator response tweak for Aot disabled. Nit. * Nit. * Nits. * Create DelegateHelpers.cs * Update Delegates.cs * Nit. * Nit. * Nits. * Fix due to #784. * Fixes due to #757 & #841. * Fix due to #846. * Fix due to #847. * Use MethodInfo for managed method calls. Use IR methods instead of managed methods about Max/Min (S/U). Follow-ups & Nits. * Add missing exception messages. Reintroduce slow path for Fmov_Vi. Implement slow path for Fmov_Si. * Switch to the new folder structure. Nits. * Impl. index-based relocation information. Impl. cache file version field. * Nit. * Address gdkchan comments. Mainly: - fixed cache file corruption issue on exit; - exposed a way to disable AOT on the GUI. * Address AcK77 comment. * Address Thealexbarney, jduncanator & emmauss comments. Header magic, CpuId (FI) & Aot -> Ptc. * Adaptation to the new application reloading system. Improvements to the call system of managed methods. Follow-ups. Nits. * Get the same boot times as on master when PTC is disabled. * Profiled Aot. * A32 support (#897). * #975 support (1 of 2). * #975 support (2 of 2). * Rebase fix & nits. * Some fixes and nits (still one bug left). * One fix & nits. * Tests fix (by gdk) & nits. * Support translations not only in high quality and rejit. Nits. * Added possibility to skip translations and continue execution, using `ESC` key. * Update SettingsWindow.cs * Update GLRenderer.cs * Update Ptc.cs * Disabled Profiled PTC by default as requested in the past by gdk. * Fix rejit bug. Increased number of parallel translations. Add stack unwinding stuffs support (1 of 2). Nits. * Add stack unwinding stuffs support (2 of 2). Tuned number of parallel translations. * Restored the ability to assemble jumps with 8-bit offset when Profiled PTC is disabled or during profiling. Modifications due to rebase. Nits. * Limited profiling of the functions to be translated to the addresses belonging to the range of static objects only. * Nits. * Nits. * Update Delegates.cs * Nit. * Update InstEmitSimdArithmetic.cs * Address riperiperi comments. * Fixed the issue of unjustifiably longer boot times at the second boot than at the first boot, measured at the same time or reference point and with the same number of translated functions. * Implemented a simple redundant load/save mechanism. Halved the value of Decoder.MaxInstsPerFunction more appropriate for the current performance of the Translator. Replaced by Logger.PrintError to Logger.PrintDebug in TexturePool.cs about the supposed invalid texture format to avoid the spawn of the log. Nits. * Nit. Improved Logger.PrintError in TexturePool.cs to avoid log spawn. Added missing code for FZ handling (in output) for fp max/min instructions (slow paths). * Add configuration migration for PTC Co-authored-by: Thog <me@thog.eu>
2020-06-05Faster crc32 implementation (#1294)merry
* Add Pclmulqdq intrinsic * Implement crc32 in terms of pclmulqdq * Address PR comments
2020-05-15Unwinding Follow-up. Fix a bug in JitUnwindWindows where ... (#1238)LDj3SNuD
... in case of "Vector" unwind codes the remaining unwind codes could be corrupted. Nits.
2020-05-13Remove CpuId IR instruction (#1227)gdkchan
2020-05-04Implement a new physical memory manager and replace DeviceMemory (#856)gdkchan
* Implement a new physical memory manager and replace DeviceMemory * Proper generic constraints * Fix debug build * Add memory tests * New CPU memory manager and general code cleanup * Remove host memory management from CPU project, use Ryujinx.Memory instead * Fix tests * Document exceptions on MemoryBlock * Fix leak on unix memory allocation * Proper disposal of some objects on tests * Fix JitCache not being set as initialized * GetRef without checks for 8-bits and 16-bits CAS * Add MemoryBlock destructor * Throw in separate method to improve codegen * Address PR feedback * QueryModified improvements * Fix memory write tracking not marking all pages as modified in some cases * Simplify MarkRegionAsModified * Remove XML doc for ghost param * Add back optimization to avoid useless buffer updates * Add Ryujinx.Cpu project, move MemoryManager there and remove MemoryBlockWrapper * Some nits * Do not perform address translation when size is 0 * Address PR feedback and format NativeInterface class * Remove ghost parameter description * Update Ryujinx.Cpu to .NET Core 3.1 * Address PR feedback * Fix build * Return a well defined value for GetPhysicalAddress with invalid VA, and do not return unmapped ranges as modified * Typo
2020-04-25Do temp constant copy for CompareAndSwap, other improvements to PreAllocator ↵gdkchan
(#1126) * Do temp constant copy for CompareAndSwap, other improvements to PreAllocator * Nit
2020-04-20Avoid temporaries when pre-allocating Store %x, imm8/16/32 (#1123)Ficture Seven
2020-04-09Optimize %x ^ %x = 0 (#1094)Ficture Seven
* JIT: Optimize %x ^ %x = 0 * Address feedback * Fix typo
2020-03-25Add Fast Paths for Crypto instructions (A32/A64) (#1026)riperiperi
* Add Fast Paths for Crypto instructions (A32/A64) * Replace additional XOR with passing in const zero.
2020-03-18CodeGen Optimisations (LSRA and Translator) (#978)riperiperi
* Start of JIT garbage collection improvements - thread static pool for Operand, MemoryOperand, Operation - Operands and Operations are always to be constructed via their static helper classes, so they can be pooled. - removing LinkedList from Node for sources/destinations (replaced with List<>s for now, but probably could do arrays since size is bounded) - removing params constructors from Node - LinkedList<> to List<> with Clear() for Operand assignments/uses - ThreadStaticPool is very simple and basically just exists for the purpose of our specific translation allocation problem. Right now it will stay at the worst case allocation count for that thread (so far) - the pool can never shrink. - Still some cases of Operand[] that haven't been removed yet. Will need to evaluate them (eg. is there a reasonable max number of params for Calls?) * ConcurrentStack instead of ConcurrentQueue for Rejit * Optimize some parts of LSRA - BitMap now operates on 64-bit int rather than 32-bit - BitMap is now pooled in a ThreadStatic pool (within lrsa) - BitMap now is now its own iterator. Marginally speeds up iterating through the bits. - A few cases where enumerators were generated have been converted to forms that generate less garbage. - New data structure for sorting _usePositions in LiveIntervals. Much faster split, NextUseAfter, initial insertion. Random insertion is slightly slower. - That last one is WIP since you need to insert the values backwards. It would be ideal if it just flipped it for you, uncomplicating things on the caller side. * Use a static pool of thread static pools. (yes.) Prevents each execution thread creating its own lowCq pool and making me cry. * Move constant value to top, change naming convention. * Fix iteration of memory operands. * Increase max thread count. * Address Feedback
2020-03-12Use a Jump Table for direct and indirect calls/jumps, removing transitions ↵riperiperi
to managed (#975) * Implement Jump Table for Native Calls NOTE: this slows down rejit considerably! Not recommended to be used without codegen optimisation or AOT. - Does not work on Linux - A32 needs an additional commit. * A32 Support (WIP) * Actually write Direct Call pointers to the table That would help. * Direct Calls: Rather than returning to the translator, attempt to keep within the native stack frame. A return to the translator can still happen, but only by exceptionally bubbling up to it. Also: - Always translate lowCq as a function. Faster interop with the direct jumps, and this will be useful in future if we want to do speculative translation. - Tail Call Detection: after the decoding stage, detect if we do a tail call, and avoid translating into it. Detected if a jump is made to an address outwith the contiguous sequence of blocks surrounding the entry point. The goal is to reduce code touched by jit and rejit. * A32 Support * Use smaller max function size for lowCq, fix exceptional returns When a return has an unexpected value and there is no code block following this one, we now return the value rather than continuing. * CompareAndSwap (buggy) * Ensure CompareAndSwap does not get optimized away. * Use CompareAndSwap to make the dynamic table thread safe. * Tail call for linux, throw on too many arguments. * Combine CompareAndSwap 128 and 32/64. They emit different IR instructions since their PreAllocator behaviour is different, but now they just have one function on EmitterContext. * Fix issues separating from optimisations. * Use a stub to find and execute missing functions. This allows us to skip doing many runtime comparisons and branches, and reduces the amount of code we need to emit significantly. For the indirect call table, this stub also does the work of moving in the highCq address to the table when one is found. * Make Jump Tables and Jit Cache dynmically resize Reserve virtual memory, commit as needed. * Move TailCallRemover to its own class. * Multithreaded Translation (based on heuristic) A poor one, at that. Need to get core count for a better one, which means a lot of OS specific garbage. * Better priority management for background threads. * Bound core limit a bit more Past a certain point the load is not paralellizable and starts stealing from the main thread. Likely due to GC, memory, heap allocation thread contention. Reduce by one core til optimisations come to improve the situation. * Fix memory management on linux. * Temporary solution to some sync problems. This will make sure threads exit correctly, most of the time. There is a potential race where setting the sync counter to 0 does nothing (counter stays at what it was before, thread could take too long to exit), but we need to find a better way to do this anyways. Synchronization frequency has been tightened as we never enter blockwise segments of code. Essentially this means, check every x functions or loop iterations, before lowcq blocks existed and were worth just as much. Ideally it should be done in a better way, since functions can be anywhere from 1 to 5000 instructions. (maybe based on host timer, or an interrupt flag from a scheduler thread) * Address feedback minus CompareAndSwap change. * Use default ReservedRegion granularity. * Merge CompareAndSwap with its V128 variant. * We already got the source, no need to do it again. * Make sure all background translation threads exit. * Fix CompareAndSwap128 Detection criteria was a bit scuffed. * Address Comments.
2020-03-10Optimize x64 loads and stores using complex addressing modes (#972)gdkchan
* Optimize x64 loads and stores using complex addressing modes * This was meant to be used for testing
2020-03-05Implement Fast Paths for most A32 SIMD instructions (#952)jduncanator
* Begin work on A32 SIMD Intrinsics * More instructions, some cleanup. * Intrinsics for Move instructions (zip etc) These pass the existing tests. * Intrinsics for some of Cvt While doing this I noticed that the conversion for int/fp was incorrect in the slow path. I'll fix this in the original repo. * Intrinsics for more Arithmetic instructions. * Intrinsics for Vext * Fix VEXT Intrinsic for double words. * Use InsertPs to move scalar values. * Cleanup, fix VPADD.f32 and VMIN signed integer. * Cleanup, add SSE2 support for scalar insert. Works similarly to the IR scalar insert, but obviously this one works directly on V128. * Minor cleanup. * Enable intrinsic for FP64 to integer conversion. * Address feedback apart from splitting out intrinsic float abs Also: bad VREV encodings as undefined rather than throwing in translation. * Move float abs to helper, fix bug with cvt * Rename opc2 & 3 to match A32 docs, use ArgumentOutOfRangeException appropriately. * Get name of variable at compilation rather than string literal. * Use correct double sign mask.
2020-02-17Replace LinkedList by IntrusiveList to avoid allocations on JIT (#931)gdkchan
* Replace LinkedList by IntrusiveList to avoid allocations on JIT * Fix wrong replacements