glibc:release/2.30/master

Last commit made on 2024-02-08
Get this branch:
git clone -b release/2.30/master https://git.launchpad.net/glibc

Branch merges

Branch information

Name:
release/2.30/master
Repository:
lp:glibc

Recent commits

5a7440f... by Sunil K Pandey <email address hidden>

x86_64: Optimize ffsll function code size.

Ffsll function randomly regress by ~20%, depending on how code gets
aligned in memory. Ffsll function code size is 17 bytes. Since default
function alignment is 16 bytes, it can load on 16, 32, 48 or 64 bytes
aligned memory. When ffsll function load at 16, 32 or 64 bytes aligned
memory, entire code fits in single 64 bytes cache line. When ffsll
function load at 48 bytes aligned memory, it splits in two cache line,
hence random regression.

Ffsll function size reduction from 17 bytes to 12 bytes ensures that it
will always fit in single 64 bytes cache line.

This patch fixes ffsll function random performance regression.

Reviewed-by: Carlos O'Donell <email address hidden>
(cherry picked from commit 9d94997b5f9445afd4f2bccc5fa60ff7c4361ec1)

86418cb... by Noah Goldstein <email address hidden>

x86: Fix incorrect scope of setting `shared_per_thread` [BZ# 30745]

The:

```
    if (shared_per_thread > 0 && threads > 0)
      shared_per_thread /= threads;
```

Code was accidentally moved to inside the else scope. This doesn't
match how it was previously (before af992e7abd).

This patch fixes that by putting the division after the `else` block.

(cherry picked from commit 084fb31bc2c5f95ae0b9e6df4d3cf0ff43471ede)

1ebb55d... by Noah Goldstein <email address hidden>

x86: Use `3/4*sizeof(per-thread-L3)` as low bound for NT threshold.

On some machines we end up with incomplete cache information. This can
make the new calculation of `sizeof(total-L3)/custom-divisor` end up
lower than intended (and lower than the prior value). So reintroduce
the old bound as a lower bound to avoid potentially regressing code
where we don't have complete information to make the decision.
Reviewed-by: DJ Delorie <email address hidden>

(cherry picked from commit 8b9a0af8ca012217bf90d1dc0694f85b49ae09da)

863fc57... by Noah Goldstein <email address hidden>

x86: Fix slight bug in `shared_per_thread` cache size calculation.

After:
```
    commit af992e7abdc9049714da76cae1e5e18bc4838fb8
    Author: Noah Goldstein <email address hidden>
    Date: Wed Jun 7 13:18:01 2023 -0500

        x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4`
```

Split `shared` (cumulative cache size) from `shared_per_thread` (cache
size per socket), the `shared_per_thread` *can* be slightly off from
the previous calculation.

Previously we added `core` even if `threads_l2` was invalid, and only
used `threads_l2` to divide `core` if it was present. The changed
version only included `core` if `threads_l2` was valid.

This change restores the old behavior if `threads_l2` is invalid by
adding the entire value of `core`.
Reviewed-by: DJ Delorie <email address hidden>

(cherry picked from commit 47f747217811db35854ea06741be3685e8bbd44d)

e813239... by Noah Goldstein <email address hidden>

x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4`

Current `non_temporal_threshold` set to roughly '3/4 * sizeof_L3 /
ncores_per_socket'. This patch updates that value to roughly
'sizeof_L3 / 4`

The original value (specifically dividing the `ncores_per_socket`) was
done to limit the amount of other threads' data a `memcpy`/`memset`
could evict.

Dividing by 'ncores_per_socket', however leads to exceedingly low
non-temporal thresholds and leads to using non-temporal stores in
cases where REP MOVSB is multiple times faster.

Furthermore, non-temporal stores are written directly to main memory
so using it at a size much smaller than L3 can place soon to be
accessed data much further away than it otherwise could be. As well,
modern machines are able to detect streaming patterns (especially if
REP MOVSB is used) and provide LRU hints to the memory subsystem. This
in affect caps the total amount of eviction at 1/cache_associativity,
far below meaningfully thrashing the entire cache.

As best I can tell, the benchmarks that lead this small threshold
where done comparing non-temporal stores versus standard cacheable
stores. A better comparison (linked below) is to be REP MOVSB which,
on the measure systems, is nearly 2x faster than non-temporal stores
at the low-end of the previous threshold, and within 10% for over
100MB copies (well past even the current threshold). In cases with a
low number of threads competing for bandwidth, REP MOVSB is ~2x faster
up to `sizeof_L3`.

The divisor of `4` is a somewhat arbitrary value. From benchmarks it
seems Skylake and Icelake both prefer a divisor of `2`, but older CPUs
such as Broadwell prefer something closer to `8`. This patch is meant
to be followed up by another one to make the divisor cpu-specific, but
in the meantime (and for easier backporting), this patch settles on
`4` as a middle-ground.

Benchmarks comparing non-temporal stores, REP MOVSB, and cacheable
stores where done using:
https://github.com/goldsteinn/memcpy-nt-benchmarks

Sheets results (also available in pdf on the github):
https://docs.google.com/spreadsheets/d/e/2PACX-1vS183r0rW_jRX6tG_E90m9qVuFiMbRIJvi5VAE8yYOvEOIEEc3aSNuEsrFbuXw5c3nGboxMmrupZD7K/pubhtml
Reviewed-by: DJ Delorie <email address hidden>
Reviewed-by: Carlos O'Donell <email address hidden>

(cherry picked from commit af992e7abdc9049714da76cae1e5e18bc4838fb8)

a0350c8... by Florian Weimer

debug: Mark libSegFault.so as NODELETE

The signal handler installed in the ELF constructor cannot easily
be removed again (because the program may have changed handlers
in the meantime). Mark the object as NODELETE so that the registered
handler function is never unloaded.

Reviewed-by: Carlos O'Donell <email address hidden>
(cherry picked from commit 23ee92deea4c99d0e6a5f48fa7b942909b123ec5)

706c8a7... by Noah Goldstein <email address hidden>

x86: Fix wcsnlen-avx2 page cross length comparison [BZ #29591]

Previous implementation was adjusting length (rsi) to match
bytes (eax), but since there is no bound to length this can cause
overflow.

Fix is to just convert the byte-count (eax) to length by dividing by
sizeof (wchar_t) before the comparison.

Full check passes on x86-64 and build succeeds w/ and w/o multiarch.

(cherry picked from commit b0969fa53a28b4ab2159806bf6c99a98999502ee)

5921617... by Sunil K Pandey <email address hidden>

x86-64: Require BMI2 for avx2 functions [BZ #29611]

This patch fixes BZ #29611

286c89f... by "H.J. Lu" <email address hidden>

x86-64: Require BMI2 for strchr-avx2.S [BZ #29611]

Since strchr-avx2.S updated by

commit 1f745ecc2109890886b161d4791e1406fdfc29b8
Author: noah <email address hidden>
Date: Wed Feb 3 00:38:59 2021 -0500

    x86-64: Refactor and improve performance of strchr-avx2.S

uses sarx:

c4 e2 72 f7 c0 sarx %ecx,%eax,%eax

for strchr-avx2 family functions, require BMI2 in ifunc-impl-list.c and
ifunc-avx2.h.

This fixes BZ #29611.

(cherry picked from commit 83c5b368226c34a2f0a5287df40fc290b2b34359)

d0aac8b... by Andreas Schwab <email address hidden>

Add test for bug 29530

This tests for a bug that was introduced in commit edc1686af0 ("vfprintf:
Reuse work_buffer in group_number") and fixed as a side effect of commit
6caddd34bd ("Remove most vfprintf width/precision-dependent allocations
(bug 14231, bug 26211).").

(cherry picked from commit ca6466e8be32369a658035d69542d47603e58a99)