Diagnosing and Fixing Reference Leaks in CPython

CPython’s garbage collection relies on each object’s reference count. Each object has their own reference count, when the object is referenced by others, then we will need to increase object’s reference count by the Py_INCREF macro. In another way, when the referencer don’t need the object anymore, it will need to decrease object’s reference count by the Py_DECREF macro. When object’s reference count down to 0, it will be collect by CPython’s GC.

/* Nothing is actually declared to be a PyObject, but every pointer to
 * a Python object can be cast to a PyObject*.  This is inheritance built
 * by hand.  Similarly every pointer to a variable-size Python object can,
 * in addition, be cast to PyVarObject*.
 */
typedef struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

/* Nothing is actually declared to be a PyObject, but every pointer to

* a Python object can be cast to a PyObject*. This is inheritance built

* by hand. Similarly every pointer to a variable-size Python object can,

* in addition, be cast to PyVarObject*.

typedef struct _object {

_PyObject_HEAD_EXTRA

Py_ssize_t ob_refcnt;

struct _typeobject *ob_type;

} PyObject;

So the problem is, when the programmer didn’t manage object’s reference count correctly, it will leak out the reference, that says the GC won’t collect the object forever — since there will still have a reference count on the object.

How to diagnose reference leaks?

In CPython, there is a module called “test“, it can test the reference leak in -R option:

-R runs each test several times and examines sys.gettotalrefcount() to see if the test appears to be leaking references. The argument should be of the form stab:run:fname where ‘stab’ is the number of times the test is run to let gettotalrefcount settle down, ‘run’ is the number of times further it is run and ‘fname’ is the name of the file the reports are written to. These parameters all have defaults (5, 4 and “reflog.txt” respectively), and the minimal invocation is ‘-R :’.

For example, we can run all unit test in CPython to check the reference leak:

$ ./python -m test -R 3:3                                                  
== CPython 3.7.0a0 (heads/master:ff48739ed0, Jun 7 2017, 10:55:13) [GCC 6.3.1 20170306]
== Linux-4.11.3-1-ARCH-x86_64-with-arch little-endian
== hash algorithm: siphash24 64bit
== cwd: /home/grd/Python/cpython/build/test_python_19384
== CPU count: 4
== encodings: locale=UTF-8, FS=utf-8
Testing with flags: sys.flags(debug=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0, no_site=0, ignore_environment=0, verbose=0, bytes_warning=0, quiet=0, hash_randomization=1, isolated=0)
Run tests sequentially
0:00:00 load avg: 0.77 [  1/405] test_grammar
beginning 6 repetitions
123456
……
0:00:00 load avg: 0.77 [  2/405] test_opcodes
beginning 6 repetitions
123456
……
0:00:00 load avg: 0.77 [  3/405] test_dict

$ ./python –m test –R 3:3

== CPython 3.7.0a0 (heads/master:ff48739ed0, Jun 7 2017, 10:55:13) [GCC 6.3.1 20170306]

== Linux–4.11.3–1–ARCH–x86_64–with–arch little–endian

== hash algorithm: siphash24 64bit

== cwd: /home/grd/Python/cpython/build/test_python_19384

== CPU count: 4

== encodings: locale=UTF–8, FS=utf–8

Testing with flags: sys.flags(debug=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0, no_site=0, ignore_environment=0, verbose=0, bytes_warning=0, quiet=0, hash_randomization=1, isolated=0)

Run tests sequentially

0:00:00 load avg: 0.77 [ 1/405] test_grammar

beginning 6 repetitions

123456

......

0:00:00 load avg: 0.77 [ 2/405] test_opcodes

beginning 6 repetitions

123456

......

0:00:00 load avg: 0.77 [ 3/405] test_dict

Or run only on one unit test file:

$ ./python -m test -R 3:3 test_threading
Run tests sequentially
0:00:00 load avg: 0.29 [1/1] test_threading
beginning 6 repetitions
123456
..

$ ./python –m test –R 3:3 test_threading

Run tests sequentially

0:00:00 load avg: 0.29 [1/1] test_threading

beginning 6 repetitions

123456

Or using -m option to run only match on test method’s name:

$ ./python -m test -R 3:3 test_threading -m test_various_ops 
Run tests sequentially
0:00:00 load avg: 0.47 [1/1] test_threading
beginning 6 repetitions
123456
……
1 test OK.

Total duration: 189 ms
Tests result: SUCCESS

$ ./python –m test –R 3:3 test_threading –m test_various_ops

Run tests sequentially

0:00:00 load avg: 0.47 [1/1] test_threading

beginning 6 repetitions

123456

......

1 test OK.

Total duration: 189 ms

Tests result: SUCCESS

When the test run failed, it will show up the leaked reference count:

$ ./python -m test -R 3:3 test_threading -m test_threads_join_2  
Run tests sequentially
0:00:00 load avg: 0.24 [1/1] test_threading
beginning 6 repetitions
123456
……
test_threading leaked [3, 3, 3] references, sum=9
test_threading failed

1 test failed:
    test_threading

Total duration: 1 sec
Tests result: FAILURE

$ ./python –m test –R 3:3 test_threading –m test_threads_join_2

Run tests sequentially

0:00:00 load avg: 0.24 [1/1] test_threading

beginning 6 repetitions

123456

......

test_threading leaked [3, 3, 3] references, sum=9

test_threading failed

1 test failed:

test_threading

Total duration: 1 sec

Tests result: FAILURE

How to fixed – leak in test.support.run_in_subinterp

The leak information is provided by Victor Stinner in Python core-mentorship mailing list – New easy C issues: reference leaks with bpo-30536, bpo-30547. Anyone who want to reproduce the leaks, you may checkout with commit 65ece7ca2366308fa91a39a8dfa255e6bdce3cca.

The strategy to fix reference leak in CPython has two step. First is to comment out as more as possible code to get the minimal code to reproduce the leak. Second, when identify out the leak point, use git bisect to find the first bad commit.

Let us apply the methodology.

First: comment as more as possible code.

The full code of test_threads_join_2 is in test_threading.py , and the full test is here:

def test_threads_join_2(self):
    # Same as above, but a delay gets introduced after the thread’s
    # Python code returned but before the thread state is deleted.
    # To achieve this, we register a thread-local object which sleeps
    # a bit when deallocated.
    r, w = os.pipe()
    self.addCleanup(os.close, r)
    self.addCleanup(os.close, w)
    code = r”””if 1:
        import os
        import threading
        import time

        class Sleeper:
            def __del__(self):
                time.sleep(0.05)

        tls = threading.local()

        def f():
            # Sleep a bit so that the thread is still running when
            # Py_EndInterpreter is called.
            time.sleep(0.05)
            tls.x = Sleeper()
            os.write(%d, b”x”)
        threading.Thread(target=f).start()
        “”” % (w,)
    ret = test.support.run_in_subinterp(code)
    self.assertEqual(ret, 0)
    # The thread was joined properly.
    self.assertEqual(os.read(r, 1), b”x”)

def test_threads_join_2(self):

# Same as above, but a delay gets introduced after the thread’s

# Python code returned but before the thread state is deleted.

# To achieve this, we register a thread-local object which sleeps

# a bit when deallocated.

r, w = os.pipe()

self.addCleanup(os.close, r)

self.addCleanup(os.close, w)

code = r“””if 1:

import os

import threading

import time

class Sleeper:

def __del__(self):

time.sleep(0.05)

tls = threading.local()

def f():

# Sleep a bit so that the thread is still running when

# Py_EndInterpreter is called.

time.sleep(0.05)

tls.x = Sleeper()

os.write(%d, b”x”)

threading.Thread(target=f).start()

“”” % (w,)

ret = test.support.run_in_subinterp(code)

self.assertEqual(ret, 0)

# The thread was joined properly.

self.assertEqual(os.read(r, 1), b“x”)

To comment as much as possible code, we can comment all code first, then uncomment from the top to bottom. After some try, we will get that the leak was came from the line of code:

ret = test.support.run_in_subinterp(code)

1	ret = test.support.run_in_subinterp(code)

Digging into it, then apply the same strategy in the code:

def run_in_subinterp(code):
    “””
    Run code in a subinterpreter. Raise unittest.SkipTest if the tracemalloc
    module is enabled.
    “””
    # Issue #10915, #15751: PyGILState_*() functions don’t work with
    # sub-interpreters, the tracemalloc module uses these functions internally
    try:
        import tracemalloc
    except ImportError:
        pass
    else:
        if tracemalloc.is_tracing():
            raise unittest.SkipTest(“run_in_subinterp() cannot be used ”
                                     “if tracemalloc module is tracing ”
                                     “memory allocations”)
    import _testcapi
    return _testcapi.run_in_subinterp(code)

def run_in_subinterp(code):

“””

Run code in a subinterpreter. Raise unittest.SkipTest if the tracemalloc

module is enabled.

“””

# Issue #10915, #15751: PyGILState_*() functions don’t work with

# sub-interpreters, the tracemalloc module uses these functions internally

try:

import tracemalloc

except ImportError:

pass

else:

if tracemalloc.is_tracing():

raise unittest.SkipTest(“run_in_subinterp() cannot be used “

“if tracemalloc module is tracing “

“memory allocations”)

import _testcapi

return _testcapi.run_in_subinterp(code)

The critical line is here:

return _testcapi.run_in_subinterp(code)

1	return _testcapi.run_in_subinterp(code)

It call the C-extension module of testcapi, which can be found at Modules/_testcapimodule.c.

Another method to verify that is leaked in functionrun_in_subinterp, is to found other methods that have used the same function and leaked too. We can verify this in test_atexit, test_capi, test_threading where leak in the same place.

Second: git bisect to find the bad commit

We must know one thing, that the bug may not be introduced in _testcapimodule.c/run_in_subinterp, it may be introduced in the function that it used in different place.

So, use git log Modules/_testcapimodule.c to find a far away commit, then checkout to build it, retry the test to check this commit is good or not:

# I choose commit 13e602ea0f08e8c04d635356375d1d2ab5a9b964
$ git checkout 13e602ea0f08e8c04d635356375d1d2ab5a9b964
$ cp Modules/Setup.dist Modules/Setup
$ make -j8
$ ./python -m test -R 3:3 test_threading -m test_threads_join_2
Run tests sequentially
0:00:00 load avg: 0.49 [1/1] test_threading
beginning 6 repetitions
123456
……
1 test OK.

Total duration: 1 sec
Tests result: SUCCESS
$

# I choose commit 13e602ea0f08e8c04d635356375d1d2ab5a9b964

$ git checkout 13e602ea0f08e8c04d635356375d1d2ab5a9b964

$ cp Modules/Setup.dist Modules/Setup

$ make –j8

$ ./python –m test –R 3:3 test_threading –m test_threads_join_2

Run tests sequentially

0:00:00 load avg: 0.49 [1/1] test_threading

beginning 6 repetitions

123456

......

1 test OK.

Total duration: 1 sec

Tests result: SUCCESS

Great, this is a good commit, then we can apply git bisect to find the first bad commit:

$ git bisect start
$ git bisect bad master
$ git bisect good 13e602ea0f08e8c04d635356375d1d2ab5a9b964
Bisecting: 3781 revisions left to test after this (roughly 12 steps)
[4b9abf3a27185aaceb6db39ef1e1fa784f420b4f] merge 3.5
$

$ git bisect start

$ git bisect bad master

$ git bisect good 13e602ea0f08e8c04d635356375d1d2ab5a9b964

Bisecting: 3781 revisions left to test after this (roughly 12 steps)

[4b9abf3a27185aaceb6db39ef1e1fa784f420b4f] merge 3.5

Then we will need to rebuild, test and check if this build has refleak or not, if it is leaked, then type git bisect bad, otherwise type git bisect good. At the end, you will get the first bad commit. (Remember to use git bisect reset at the end.)

There have two commit to introduce refleak:

6b4be195cd8868b76eb6fbe166acc39beee8ce36 bad
f9169ce6b48c7cc7cc62d9eb5e4ee1ac7066d14b good

1abcf6700b4da6207fe859de40c6c1bada6b4fec bad
c842efc6aedf979b827a9473192f46cec53d58db good

6b4be195cd8868b76eb6fbe166acc39beee8ce36 bad

f9169ce6b48c7cc7cc62d9eb5e4ee1ac7066d14b good

1abcf6700b4da6207fe859de40c6c1bada6b4fec bad

c842efc6aedf979b827a9473192f46cec53d58db good

Check if you got it or not!

Thrid: find where got the refleak

Thanks to workflow are migrated to GitHub, we can saw the result on GitHub. We can found the refleak point at initexternalimport, and here. (how to found it? Apply the first step, find and comment as more as possible to find the point).

This will be the hardest part, because you will need to test on different place, and the changed in pull request may be huge like this one.

After that, re-build, re-run the test where leak, then run the full-test to check other place didn’t be affected!

Conclusion

The full fixed of these refleak was sent by @matrixise at #1995, Co-Authored-By: Victor Stinner and Louie Lu.

If you found and question about fixing reference leaked, tweet me or leave the reply.

Diagnosing and Fixing Reference Leaks in CPython

How to diagnose reference leaks?

How to fixed – leak in test.support.run_in_subinterp

First: comment as more as possible code.

Second: git bisect to find the bad commit

Thrid: find where got the refleak

Conclusion

Comments

Leave a Reply Cancel reply