-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-84436: Implement Immortal Objects #19474
gh-84436: Implement Immortal Objects #19474
Conversation
This is ready to review, the CI is finally green. Really no idea why the newly added GC tests are failing on Windows and unfortunately I don't have a Windows machine to debug this. |
Also looping in @vstinner. Finally got around upstreaming this patch since you recently wrote about this on your C-API Improvement Docs |
My first reaction is that this shouldn't become part of the default build because most Python users will not make use of it and then it becomes pure extra overhead. However, I know for some people that it is a useful feature (e.g. pre-fork server architecture that exploits copy-on-write OS memory management). I would use it myself since I write web applications with that style. Would it be okay to make this a compile time option, disabled by default? I think in general it is a bad idea to have too many of those types of build options. It makes code maintenance and testing more difficult. Some example build variations from the past that caused issues: thread/no-threads, Unicode width, various debug options (@vstinner removed some of those). So, I'm not super excited about introducing a new build option. Is it possible we can leverage this extra status bit on objects to recover the lost performance somehow? A couple years ago I did a "tagged pointer" experiment that used a similar bit. In that case, small integers became one machine word in size and also become immortal. Another thought: when you did your testing, were any objects made immortal? I would imagine that, by default, you could make everything immortal after initial interpreter startup. You are paying for an extra test+branch in INCREF and DECREF but for many objects (e.g. None, True, False, types) you avoid dirtying the memory/cache with writes to the reference count. |
@nascheme you should definitely join the conversation happening in the bug report of this PR https://bugs.python.org/issue40255
Exactly, this change might be a feature for CPython power users
Yeah, that's probably the best option. That's also the consensus in the bug report thread (if the change is approved)
Yeah that's one of the drawbacks. That being said, I can help with setting up the travis build to integrate this change if needed (cc @vstinner).
We can indeed, I think somebody also mentioned that in the bug report. A potentially good place could be In theory we could optimize even further to reduce the perf cost. By leveraging saturated adds and conditional moves we could remove the branching instruction. I haven't explored this further since the current PR was good enough. Personally, I favor the current PR, but this could be changed to:
|
Not only that, we would need specialized buildbots to test the code base with this option activated in a bunch of supported platforms and that raises the maintainance costs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature sounds controversial, so I block it until a consensus can be reached.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Three years in the making and it's finally there! Huge shoutouts to @ericsnowcurrently for working together with me throughout the years to get this all the way through, you rock!! Also, thanks @markshannon @gvanrossum and @pablogsal for being a sounding board for ideas, reviews, and coaching on the messaging of the PR / PEP! |
Congrats @eduardo-elizondo! 🔥 I followed this PR from the beginning, even needed to port the changes (in 2020) into python 3.9 due to the memory issue with multiprocessing for my previous company. At some point, I felt this was not going to be in Python due to the inactivity here (especially the review). |
@ericsnowcurrently @eduardo-elizondo One of the acceptance conditions for pep 683 was to update the pep with final benchmark results. It appears that has not been done. Are the numbers available somewhere? I want to determine whether the performance regressions reported in #109049 are due to this PR (or perhaps other PRs as well). In discourse: pep-683-immortal-objects-using-a-fixed-refcount-round-4-last-call there is some discussion on whether to add the performance numbers or not. After that message the pep has been updated, but not the performance numbers. |
@eduardo-elizondo, do you have time to update the PEP with final benchmark results for ea2c001? |
Note: https://speed.python.org/ shows an important performance regression in maybe 1/3rd of all benchmarks, dated Apr 22, which is the date this PR was merged. The most significant I've seen so far (which, if reproduced locally, could help figure out the root cause) is unpack_sequence. |
@ericsnowcurrently we actually already have the benchmark numbers here: #19474 (comment) which I ran right before merging and there's only test/lint fixes on top of that. This shows roughly a ~1.02x geometric mean regression (~1.03x on MSVC). Let me know if this is what we are looking for! To be clear - there are will be both slower and faster benchmarks. However, we should focus on the geometric mean (rather than a single benchmark) which is our best proxy when benchmarks move in both ways. Separately, performance measurements can come up with wildly different results on different environments. For these experiments I used gcc-11.1 and MSVC v14.33 on 'lab-like' bare-metal machines which resulted in consistently reproducible results. |
There is a discrepancy between the text in PEP 683 and the actual implementation, sections Accidental Immortality and Accidental De-Immortalizing, on 64-bit machines. If it is already mentioned somewhere, sorry about that; but it might be useful to mention that inside the PEP itself. The problem is that, unless I'm missing something, the PEP says that 64-bit machines don't have any accidental immortality problems in practice because a 64-bit or close-to-64-bit refcount never overflows; but the implementation instead starts to consider objects immortal as soon as their refcount reaches 31 bits, a much more reachable value on 64-bit machines. For example, a The issues listed in Accidental De-Immortalizing are particularly problematic in this situation. I have not tested it, but looking at the source code it seems that this kind of code would crash CPython on 64-bit machines:
Again, this is not a real bug report, because it seems that this kind of issue was considered and is supposedly passed as an acceptable trade-off for using stable-ABI modules compiled with older versions of CPython. This is more a missing documentation issue. |
Note: a potential fix for this issue would be to change Py_INCREF on 64-bit platforms: instead of checking if The value to initialize immortal objects with would be something like |
...or, change _Py_IsImmortal to (Py_ssize_t)refcount<0 on 64-bit, and then just use _Py_IsImmortal in both INCREF and DECREF on both 32- and 64-bit platforms, and be done with it? This should Always Just Work(tm) if we reasonably assume that it's completely impossible to repeat a loop (Here I'm working with the implicit never-documented assumption that people first tried to use the 32-bit code directly on 64-bit, with the immortal value 0x3fffffffffffffff, but found that it has some performance impact on Intel. That would be because a constant value that doesn't fit 32 bits does indeed have a cost. That's why I'm suggesting here to use (Py_ssize_t)refcount < 0, which is simpler and might be even cheaper than the current 32- and 64-bit mix of arithmetic on the same refcount.) |
@arigo, thanks for the feedback, both about speed.python.org and about accidental de-immortalization. @eduardo-elizondo has the insight we need (which I don't) in both cases, so I'll defer to what he has to say. @markshannon may be have some thoughts as well, at least about the refcount corner case. |
@ariago Thanks for the reply, just went through it in detail. First of all, pretty much all of what you are saying in the first message is correct, though there are some additional details that would help complement what you wrote above. Let me try to reply to your message by breaking it down into what I believe are the main questions:
Overall, yes, the PEP talks about a very high value which is what we wanted to reflect there and left the value to be an implementation detail. Over here, we ended up using a saturated 32-bits because it provides a wide number of benefits:
No need to test it, it will play out exactly as you mentioned! For this one, Eric and I talked about this exact scenario and even surfaced this during a language summit to the core-devs. However, we believe that it’s a very contrived example as it would require a large amount of asymmetric decrefs. In reality, what happens with very large objects such as these is that they either live throughout the entire execution of the application or there’s a combination of symmetric increfs and decrefs that prevent this from happening. I did end up testing this in a very large machine with a large application where 16GB is relatively small with hundreds of older stable-ABI modules and didn’t see this issue materialize. Of course, this is just a single application but I was indeed on the lookout to make sure that this precise scenario was not prevalent. The good thing for this one though is that as we go into newer python versions and we keep updating our C-Extensions the risk here will become less and less prevalent.
At some point I did indeed try a solution similar to this, however, this ended up being in conflict with some of the refcount manipulations that the GC does on the two most significant bits and would ignore the fact that an object is immortal causing incorrect behavior. Not only that but also, for the reasons mentioned above, we wanted to keep the entire refcount arithmetic with just 32-bits. A considered alternative (but never implemented) solution would be to make the GC immortal object aware and then go for the 64-bit solution, but this would required a bit more work on the gcmodule not to mention the added complexity in the module. There might be a simple solution there that I never figured out! However, this would still imply that we would all the bits that we now freed for future use cases. Given the acceptance of 703 we might as well keep refcounts as 32-bits.
Using all 64-bits causes the issue that I pointed above with the GC. But we could indeed use this check for the lower 32 bits (which we already do in decref). This was actually the original implementation and as you mentioned, this is what we tried first. The reason we did a more specialized check in incref is just due to improved perf. |
@ariago Overall though these are great questions and I'll be happy to update and revise the wording of the PEP if you think it'll be useful! Let me know what you think! 🙂 |
pinging @arigo because there was a typo in the account name in the most recent messages |
Adding a known, rare, hard-to-debug crash in CPython would be worth 1-2% of performance? I personally disagree with that but I haven't been a CPython contributor for many years now, so I have nothing to add. |
#define PyObject_HEAD_INIT(type) \ | ||
{ \ | ||
_PyObject_EXTRA_INIT \ | ||
{ 1 }, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems incorrect. A statically allocated object is immortal, regardless of whether it is in the core or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right! I even thought about making this the default behavior.
However, there could be cases in extension code with asserts/tests on exact refcount values. This would break those builds/tests. You can even see it in this PR where I had to change _testembed.c to modify the Py_REFCNT(str1) == 1
check to _Py_IsImmortal(str1)
There's an argument on whether or not having any exact refcounts checks are correct, but I didn't want to challenge it at the time. Thus, to avoid any problems with extension code and reduce the surface area of impact, I localized it to just affect the CPython build.
This is the implementation of PEP683
Motivation
The PR introduces the ability to immortalize instances in CPython which bypasses reference counting. Tagging objects as immortal allows up to skip certain operations when we know that the object will be around for the entire execution of the runtime.
Note that this by itself will bring a performance regression to the runtime due to the extra reference count checks. However, this brings the ability of having truly immutable objects that are useful in other contexts such as immutable data sharing between sub-interpreters.
https://bugs.python.org/issue40255