Instead of starting from the first image every time we generate a mip,
skip 3 levels (this results in 8x8 => 1x1 reduction).
This provides a good compromise between performance and quality - beyond
8x8 the extra rounding errors introduced by temporary copies become
insignificant.
This makes the process of generating a mip chain much faster; the full
2K color texture compression drops from 4.7 to 3.7 seconds after this.
This change optimizes the perceptual color distance function in a few
steps:
1. Instead of converting each color to YCrCb, and then taking the
deltas, we compute the deltas in RGB and then convert deltas to YCrCb
space. This is using the fact that (R1 - Y1) - (R2 - Y2) = (R1 - R2) -
(Y1 - Y2), and Y delta can be computed from R/G/B deltas using the same
coefficients.
2. Instead of computing color distances one at a time, we introduce an
SSE2-optimized function that computes the distance from one color to a
block of 4. It also returns the index of the minimum error since that's
used by almost every caller. All performance-sensitive locations where
this is feasible are converted to color_distance4.
3. In find_optimal_selector_clusters_for_each_block, a lot of time was
taken computing color distances, and they didn't form a neat groups of
4. However, because we only have 4 colors to compare against, and 16
colors to choose from, we have to compute a grand total of 64 unique
color deltas; the cluster count, however, is typically ~200 and we
computed 16 deltas for each cluster - so it's cheaper to precompute all
64 deltas and just add them up.
Together these optimizations reduce the time to compress a 2Kx2K color
image from 6.6 sec (post #105) to 4.7 sec, including image load and mipmap
generation time.
Note that the results aren't bit-exact due to the small floating point
drift in perceptually correct color distance, but PSNR is the same as
before.
Using a similar observation to the previous change, we don't need to
evaluate up to 16 color distances for each of the 64 palette entries -
instead, we can precompute the per-pixel errors introduced by changing
the selector ahead of time.
This reduces the amount of color_distance evaluation which is pretty
expensive, and further improves the encode times - on a 2Kx2K image on a
Core i7-8700K, previous change reduced the time from 8.5 seconds to 7.3
seconds, and this one further reduces it to 6.6 seconds.
Right now when compression level isn't 0, we take each 4x4 block and try
to replace the selector indices with one of the 64 entries in the
palette. If we find a resulting selector block that has a low enough
error, we use that instead.
The way the code is implemented right now, we decode each block from
scratch for each entry, which means we decode each pixel 64 times.
However, each pixel's decoded result only depends on the endpoints (that
are static) and on the selector in that pixel (which has only 4 possible
values), so for each pixel we need at most 4 decodes.
This change switches the code around so that we pre-decode all 4
variants of the entire block ahead of time, and then just index into
this when evaluating matches.
This results in bit-identical images since we look at the exact same
data, but the encoding is faster - on a 2Kx2K PNG on a Core i7-8700K,
this change reduces the encoding time including a full mip chain from
8.5 seconds to 7.3 seconds.
e917bd8...
by
Andrei Alexeyev <email address hidden>
Add some missing C API wrappers
5dad062...
by
Andrei Alexeyev <email address hidden>
fix C++ crap
43d4103...
by
Andrei Alexeyev <email address hidden>
Add a C wrapper for the transcoder (untested)
060e52b...
by
Andrei Alexeyev <email address hidden>