Merge lp:~simdgenius/inkscape/inkscape into lp:~inkscape.dev/inkscape/trunk

Proposed by Yale Zhang
Status: Needs review
Proposed branch: lp:~simdgenius/inkscape/inkscape
Merge into: lp:~inkscape.dev/inkscape/trunk
Diff against target: 4832 lines (+4736/-8)
3 files modified
src/display/SimpleImage.h (+80/-0)
src/display/gaussian_blur_templates.h (+4006/-0)
src/display/nr-filter-gaussian.cpp (+650/-8)
To merge this branch: bzr merge lp:~simdgenius/inkscape/inkscape
Reviewer Review Type Date Requested Status
Mc Abstain
Review via email: mp+307251@code.launchpad.net

Description of the change

This vectorized version of Gaussian blur has a speed up of about 5x for IIR filtering (more expected) and ~20x for FIR on a modern processor supporting AVX2.

Searching the mailing lists, users have been complaining about blurs being very since the beginning of Inkscape. I think Jasper's use of recursive filters (IIR) was a big help and so was OpenMP multithreading. But more is better, especially for artists who can't be interrupted by slow, refreshing screens.

The code is a monstrosity in terms of size, but it kind of has to be given all the different cases ({FIR, IIR} x {int16, float, double} x {grayscale, RGBA}. Plus, it needs to support SSE2 (any x86_64 has SSE2), AVX, and AVX2 through runtime dispatch. So, I fear about the maintainability.

I think I can rewrite maybe 1/2 of the code to use GCC vector extensions instead of direct intrinsics. There was a suggestion from Boudewijn Rempt to use the Vc library that Krita uses, but I don't see much benefit because you can't expect to efficiently abstract all the SIMD operations, especially the ones that operate horizontally across vector lanes (e.g. dot product, horizontal adds, permutes/shuffles).

The other concern is how SIMD remainders are handled. The code currently uses partial vector stores so that it never writes past the end of an image, so it should never crash. But it can do whole vector loads that go past the end all the time. So, this will render memory checking tools like AddressSanitizer and Valgrind useless from all the false positives. Ideally, I'd like all images rows to be padded to 16 or 32 bytes.

-----------------other tasks--------------------------
*further speed up IIR - the current cost is ~44 cycles/pixel for RGBA. This is much slower than a lower bound of 7 cycles/pixel = (4 multiplies + 3 adds) x 2 passes / (2 pixels/iteration for AVX), assuming 1 instruction/cycle.

*rewrite the stand alone unit test/benchmark as a Google GTest.

*document the functions with an SVG animation

To post a comment you must log in.
lp:~simdgenius/inkscape/inkscape updated
15139. By Martin Owens

Remove the reset on the glyph tangent, it breaks text on path (bug lp:1627523)

15140. By Tavmjong Bah

Update attributes list for rename of 'mesh' to 'meshgradient'.

15141. By Martin Owens

Add a prune method to saving svg files that removes Adobe's i:pgf tag.

15142. By Martin Owens

Adjust dock size to minimum width during canvas table size allocation signal.

15143. By Martin Owens

Merge in the new eraser mode (2) which uses clip instead of cut.

15144. By Tavmjong Bah

Prevent a crash if a mesh is defined in bounding box coordinates.

15145. By Jabiertxof

Fix a bug on eraser mode when previous clip are shapes not paths

15146. By Martin Owens

Merge in jabiertxof's hover information for measure tool

15147. By Mc

xverbs branch merge by dmitry-zhulanov
Add somes "xverbs" that are like verbs but can take arguments, and yaml file parsing for batch commandline operations.

15148. By Martin Owens

Merge jabier's Measure line LPE

15149. By Jabiertxof

Fix bug:1630821 on LPE selected nodes

15150. By Jabiertxof

Fix bug:1630796 on flatten button

15151. By Jabiertxof

Fix bug:1605334 FeImage X and Y position

15152. By Jabiertxof

Fix bug:1630796 on flatten button. Attemp 2

15153. By Jabiertxof

Remove unnecesary header from last push

15154. By Jabiertxof

Fix bug:1622321 on powerstroke

15155. By Sandra Snan

[Bug #770681] KEY MAPPING: Comma and period hijacked by scaling.

15156. By Mc

merge trunk-refactoring

15157. By Tavmjong Bah

Provide simple "preview" for mesh gradients.

15158. By Liam P. White

Fix palette flickering, probably.

15159. By Mc

merge a fix and refactor

15160. By victor.westmann

[Packaging] NSIS translation update for pt_br

15161. By victor.westmann

[Bug #1627166] Brazilian Portuguese translation for 0.92.

15162. By Sveinn í Felli

[Bug #1426423] Updated Icelandic translation for 0.92.

15163. By Tavmjong Bah

Render mesh gradients that reference other mesh gradients.

15164. By marenhachmann

[Bug #1630635] Wrong tool tip for new text line height setting.

15165. By Tavmjong Bah

Better handling of mesh gradients in Paint Selector dialog.

15166. By Tavmjong Bah

Remove unused/undefined function.

15167. By Tavmjong Bah

Ensure newly created meshes have correct 'gradientUnits'.

15168. By Tavmjong Bah

Do not create unused "vector" gradient when creating mesh gradient.

15169. By Tavmjong Bah

Code cleanup: simplify initial mesh color calculation.

15170. By Jabiertxof

Fix bug:1633521 on powerstroke

15171. By Tavmjong Bah

Implement copying of objects with mesh gradients.

15172. By victor.westmann

[Bug #1627166] Brazilian Portuguese translation for 0.92.

15173. By Tavmjong Bah

Add option to scale mesh to fit in bounding box.

15174. By jazzynico

[Bug #1633999] xcf export fails if layer names contain non-ASCII characters.

15175. By Tavmjong Bah

Use geometric bounding box for fill, visual for stroke in creating mesh.

15176. By Mc

update author list in about dialog from AUTHORS file

15177. By Tavmjong Bah

Implement 'vector-effect' value 'non-scaling-stroke'. No GUI yet.

15178. By FirasH

[Bug #1574561] Italian translation update.

15179. By Jabiertxof

Fix bug:1634641 crash on delete

15180. By Mc

Fix CMake dependency order

15181. By Tavmjong Bah

Add 'vector-effect' to attributes test.

15182. By Kris

cosmetic change

15183. By Mc

Fix gradient comparison.

15184. By Mc

update translators

15185. By marenhachmann

[Bug #1635332] Update for German translation.

15186. By Jabiertxof <email address hidden>

Fix bug#1635442

15187. By Patrick Storz

CMake: inkscape.com needs the "-mconsole" linker flag to be useful

15188. By Jordi Mas

[Bug #1636086] Update Catalan translation for Inkscape 0.92.

15189. By Mc

CPPification: almost all sp_object_set_whatever and sp_selection_whatever global functions are now methods of ObjectSet*, with these additional benefits:
- They can now act on any SelectionSet, not just the current selection;
- Whenever possible, they don't need a desktop anymore and can run if called from GUI.

I hope I did not break too many things in the process.

*: So instead of callink sp_selection_move(desktop,x,y), you call myobjectset->move(x,y)

15190. By Mc

Fix test

15191. By Mc

Fix signals

15192. By Mc

Prevent image drag/drop from grouping

15193. By Mc

Fix regression in loop prevention

15194. By Jordi Mas

[Bug #1636086] Update Catalan translation for Inkscape 0.92.

15195. By Mc

allows for denser screens in zoom correction factors

15196. By FirasH

[Bug #1574561] Italian translation update.

15197. By Jabiertxof

Close the bounding box path LPE

15198. By Jabiertxof

Fix fill between many LPE to start up with current path

15199. By Jabiertxof

Update branding folder

15200. By Jabiertxof

Fix bug:1013141 crash deleting LPE

15201. By houz

fix none color in palettes with scrollbars

15202. By houz

fix prefs icon

15203. By Mc

Add some unit tests for object-set cppification

15204. By Mc

Revert two changes from r15177

15205. By Mc

Fix crash in some commandline usage

15206. By Jabiertxof

Execution of update_po_files.sh

15207. By Tavmjong Bah

Render meshes with old syntax using camelCase.

15208. By Jabiertxof

Reformat branding folder

15209. By Jabiertxof

Update branding folder, remove fonts

15210. By Jabiertxof

Fix bug:1639083 crach closing segment with shortcut LPE

15211. By Jabiertxof

Fix bug:1639098

15212. By Jabiertxof

Fix change between multiples LPE in the same item

15213. By Jabiertxof

Fix a bug on duplicate item with multiples LPE on it. previously the LPE become "clones" if more than 1 LPE on the item.
Also wee need to discuss what happends on LPE copied what are inside a group, fork them or clone, currently are cloned
This can be a feature or a bug in the same user with diferent works. My proposal is fork it and add a item in paste LPEs to allow cloned LPE on paste

15214. By Jabiertxof

Fix last commit not working, LPE are cloned on copies

15215. By Jabiertxof

Move a header place

15216. By Jabiertxof

Fix bug on apply bend LPE from pen/cil without cliboard, nothin happends previously

15217. By Jabiertxof

Minor tweak

15218. By Mc

further cppification

15219. By Jabiertxof

Fix some bugs on pen/cil dropdown shapes

15220. By Mc

merge recursive unlink clones branch

15221. By mpasteven

fix cursor on big endian systems

15222. By Jabiertxof

1639832 Blend and blur unspected results

15223. By Mc

annotate custom builds, and add correct revno into make dist tarballs

15224. By Mattia Rizzolo

reproducible builds

15225. By jazzynico

[Bug #262341] Tooltips for LPE tool modes do not show up as translated.

15226. By suv-lp

[Bug #1638472] Quadrant points of ellipse/circle fail to snap (as source or target).

15227. By gordcaswell

[Bug #1639081] recently-used.xbel remaining when run portably.

15228. By FirasH

[Bug #1574561] Italian translation update.

15229. By Tavmjong Bah

Improve mesh handling in Fill and Stroke dialog.
Create new meshes with alternating color/white pattern
(makes it more obvious a mesh has been created).

15230. By Tavmjong Bah

Enable swapping of fill and stroke when one is a mesh.

15231. By Tavmjong Bah

Click-drag selects nodes rather than creates new mesh if mesh already exists.

15232. By Mc

merge boolop branch: Move boolop functions from sp_selected_path_<op> to ObjectSet::path<op>

15233. By Mc

resizable undocked dialogs

15234. By Jordi Mas

[Bug #1636086] Update Catalan translation for Inkscape 0.92.

15235. By mathog

patch for bug 1405292, start clipping with COPY instead of OR so GDI clipping works

15236. By Mc

fix test

15237. By Mc

fix automatic dockbar resizing

15238. By Mc

Fix filter editor update

15239. By Mc

Add a make inkscape_pot to regen potfile

15240. By Mc

Fix selection toolbar icons missing on start

15241. By Mc

Fix rare crash on undo break apart

15242. By Mc

update potfile

15243. By Tavmjong Bah

Fit to bounding box: correct transform when mesh has a non-identity gradient transform.

15244. By Mc

fix build

15245. By Patrick Storz

Packaging: Merge all fixes from 0.92.x branch for NSIS and WiX installers (Windows .exe and .msi)

15246. By Patrick Storz

Tutorials: Rename image files to follow "name.lang_id.ext" scheme
(Allows NSIS installer to autodetect localized files per locale and adjust installation components accordingly)

15247. By Yuri Chornoivan

[Bug #1407331] Ukrainian translation update for 0.92.

15248. By Sylvain Chiron

Translations. French translation update.

15249. By jazzynico

[Bug #1590529] Italian Updates for inkscape docs (0.92.x)

15250. By Tavmjong Bah

Implement tweaking of mesh handle colors.

15251. By Tavmjong Bah

Split selected rows/columns in half using Insert key.

15252. By Jordi Mas

[Bug #1636086] Update Catalan translation for Inkscape 0.92.

15253. By Tavmjong Bah

Ensure getVector() and getArray() return a valid gradient pointer.

15254. By Tavmjong Bah

Do not return invalid vector gradient when switching from mesh to linear/radial gradient.

15255. By Tavmjong Bah

Fix path outline function for meshes with nrow != ncolumn.

15256. By Tavmjong Bah

Fix status bar messages for meshes and gradients.

15257. By Tavmjong Bah

Remove debug line from last commit.

15258. By Tavmjong Bah

Another fix for the status bar with mesh gradients.

15259. By Jabiertxof

Fix #1627817. Bug in knot LPE

15260. By Tavmjong Bah

Improve mouse handling for mesh:
  * Double clicking an object will create a new mesh if one does not exist,
    otherwise clicking a line should now reliably divide the row/column.
  * Click-dragging will create a new mesh if one does not exist,
    otherwise it will do a rubberband selection of corner nodes.
    With Shift will add nodes, without will replace selected nodes.

15261. By su_v

Add Shift-I shortcut for insert node.

15262. By Tavmjong Bah

Preserve selection of corner nodes for some corner operations.

15263. By Jabiertxof

Fix #1643408. Bug in pap LPE

15264. By Tavmjong Bah

Keep corner nodes selected when possible for corner operations.

15265. By gordcaswell

[Bug #1643730] Inkscape Portable language selection not maintained.

15266. By helix84

Fix a typo in inkscape-preferences.cpp.

15267. By Tavmjong Bah

Select mesh nodes by clicking on control lines.

15268. By helix84

 * [INTL:zh_TW] Traditional Chinese translation update

15269. By Lucas Vieites

[Bug #1643818] Updated es.po for 0.92.

15270. By FirasH

[Bug #1574561] Italian translation update.

15271. By Tavmjong Bah

Remove deprecated GtkWidget:wide-seperators which is ignored as of 3.20.

15272. By Tavmjong Bah

Remove deprecated GtkWidget-separator-height, ignored as of 3.20.

15273. By Tavmjong Bah

Provide a way to update a legacy document to account for the 90 to 96 dpi change.
This method relies on setting the 'viewBox'.

15274. By Jabiertxof <jtx@jtx>

Add stroke dash empty to allow render only fills and markers. Tested in FF and Chromium

15275. By scootergrisen

[Bug #1644934] Translation to danish.

15276. By Patrick Storz

Translations/Packaging: Convert Danish translation to UTF8

15277. By jazzynico

[Bug #1644886] Color profiles not loaded on Windows (partial fix).

15278. By Patrick Storz

CMake: Add ${INKSCAPE_SHARE_INSTALL}
This is set to "share/inkscape" by default, on Windows we need to be able to install directly into "share" however

15279. By Patrick Storz

CMake: Explicitly call python
At least on Windows this breaks if Python is not associated with .py files (and even if it is an arbitrary Python version that might be installed on the system is used)

15280. By Patrick Storz

Remove unneeded "#include <arpa/inet.h>" in "cairo-utils.cpp"

15281. By jazzynico

[Bug #1641111] extension Visualize Path/Measure path... fails

15282. By scootergrisen

[Bug #1471443] Updated danish translation for 0.92.

15283. By Mc

update filter list when pasting and on import

15284. By Jabiertxof

Fixes transforms bug in meassure line LPE pointed in IRC by CR and suv

15285. By Jabiertxof

Reorganize SVG Structure have clean meassure line structure

15286. By Tavmjong Bah

Give mesh corner nodes a different color from handle nodes (following node tool coloring).

15287. By FirasH

[Bug #1574561] Italian translation update.

15288. By Tavmjong Bah

Fix bug with mesh handle update when corner moved via keys.

15289. By Tavmjong Bah

Add toggles for handle visibility, editing fill, and editing stroke.

15290. By Tavmjong Bah

Ensure new mesh is immediately editable.

15291. By Alexander Brock <email address hidden>

Improve precision of offset_cubic

15292. By Mc

prevent use of string concat for compatibility with old cmake

15293. By Jabiertxof

Add triangle knot.

15294. By Jabiertxof <jtx@jtx>

Improvements and fixes for buds pointed by suv on measure line LPE

15295. By Jabiertxof <jtx@jtx>

Fix a typo

15296. By Jabiertxof

Enable node resizing in mesh tool.

15297. By Jabiertxof <jtx@jtx>

Fix names in measure line LPE

15298. By Jabiertxof

Highlight mesh handles when corner or handle selected.
Highlight mesh control lines when corner/handle hovered over.

15299. By FirasH

[Bug #1574561] Italian translation update.

15300. By Tavmjong Bah

Fix memory leak (incomplete clear).

15301. By Jabiertxof

Add dpiswitcher extension and option to scale legacy documents with it.

15302. By Jabiertxof <jtx@jtx>

Fixes for measure LPE and speed path based LPE operations

15303. By Jabiertxof <jtx@jtx>

Remove obsolete comment

15304. By Jabiertxof <jtx@jtx>

Fix measure LPE to fit future extra objects based LPE

15305. By Jabiertxof

fix bug #1644621 on show handles

15306. By Jabiertxof

fix bug #1644621 on show handles. Fix start knot on closed paths

15307. By Tavmjong Bah

Add option to save a backup when updating file for dpi change.

15308. By Jabiertxof

Fix for reopened bug #1643408

15309. By Jabiertxof

Fix for reopened bug #1643408- minor typyng

15310. By Jabiertxof <jtx@jtx>

Improve measure line to allow similar LPE

15311. By Tavmjong Bah

Correct error messages.

15312. By Tavmjong Bah

Improve working of Type (Smoothing) menu.

15313. By Jabiertxof <jtx@jtx>

'upport' changes to LPE's rotate copies and mirrot symmetry

15314. By Tavmjong Bah

Don't add fortify source flag for debug builds. Avoids tons of warnings.

15315. By Tavmjong Bah

Add button to access outer text style ('font-size', 'line-height'). These determine the minimum line spacing.

15316. By Tavmjong Bah

Correct outer text style input for non-px based files.

15317. By Tavmjong Bah

Fix a bug where initially text has no fill and but has a stroke.

15318. By Jabiertxof <jtx@jtx>

Fix headers on LPE's

15319. By Tavmjong Bah

Fix line-height when converting between different units for flowed text.

15320. By Jabiertxof <jtx@jtx>

Apply suv patch to handle containers https://bugs.launchpad.net/inkscape/+bug/1389723/comments/95

15321. By Tavmjong Bah

Add option to unset 'line-height' (as well as determine where it is set).

15322. By Tavmjong Bah

Add missing 'pt' unit to test of legacy files.

15323. By Jabiertxof

Apply su_v patch to DPISwitcher: https://launchpadlibrarian.net/297886893/0000-fix-dpiswitcher-scaling-v1.diff

15324. By Tavmjong Bah

Add test internal scaling to account for DPI change.

15325. By Tavmjong Bah

Fixes for internal document scaling and add a second test option.

15326. By Tavmjong Bah

Fix crash from last commit due to bad preference path.

15327. By Tavmjong Bah

Save state of backup button.

15328. By Tavmjong Bah

Prevent crash when iterator invalidated after converting shape to path.

15329. By Tavmjong Bah

Fix bug where conical gradient drawn in wrong arc.

15330. By Jabiertxof <jtx@jtx>

Fix a bug on transforms in mirror symmetry

15331. By Jabiertxof <jtx@jtx>

Use Geom::Reflection instead custom method an copy rotate and mirror LPE

15332. By Jabiertxof <jtx@jtx>

Some coding style fixes

15333. By Jabiertxof

Remove 'desktop' usage on measure line LPE

15334. By Jabiertxof

Add translatable strings to trunk

15335. By Jabiertxof

Add update_helperpaths not member of nodetool class to easy call from outside

15336. By Jabiertxof

Remove unneeded static var from previous commit

15337. By Jabiertxof <jtx@jtx>

Remove some ocurrences of desktop in knot functions

15338. By Jabiertxof <jtx@jtx>

Fix strings in mirror symetry and copy rotate LPE

15339. By Jabiertxof <jtx@jtx>

Remove string from tip

15340. By Jabiertxof

Fix undo incosistences in mirror LPE and in copy rotate LPE

15341. By Jabiertxof

Regenerate PO files for new translations

15342. By Yuri Chornoivan

Translation update for Ukranian

15343. By Yuri Chornoivan

[Bug #1407331] Ukrainian translation update for 0.92.

15344. By Jabiertxof <jtx@jtx>

Remove more SPDesktop from LPE's

15345. By Jabiertxof <jtx@jtx>

Add string translatable pointed by Maren

15346. By Jabiertxof <jtx@jtx>

Update po and pot

15347. By Jabiertxof

Update pot files. for extrange thing intltool generate output as untitled.pot instesad inkscape.pot not sure how to fix

15348. By Jabiertxof

Update .pot file generated with cmake and the resulting po. add info to update_po_files.sh added by Mc

15349. By Yale Zhang

q

Revision history for this message
Jabiertxof (jabiertxof) wrote :

Hi Yale could you fix some coding style. I try to do a review but is a complex code and my skills are very low in this things. What about coexistant with current code and give a prefs check to enable?
Inline some diff comments

Revision history for this message
Jabiertxof (jabiertxof) wrote :

Apply the diff clean. Give compiliong errors: https://inkscape.org/en/paste/10710/

Revision history for this message
Yale Zhang (simdgenius) wrote :

Jabier, thanks for reviewing. The compile error is because
_mm_cvtsi64_m64() isn't supported for i686. I can fix it and replace
as much of the SSE intrinsic functions code with GCC vector extensions
over the next few days.

For now, can you test on x86-64?

-yale

On Tue, Feb 21, 2017 at 10:30 AM, Jabiertxof <email address hidden> wrote:
> Apply the diff clean. Give compiliong errors: https://inkscape.org/en/paste/10710/
> --
> https://code.launchpad.net/~simdgenius/inkscape/inkscape/+merge/307251
> You are the owner of lp:~simdgenius/inkscape/inkscape.

Revision history for this message
Jabiertxof (jabiertxof) wrote :

Hi Yale, sorry for the delay. I could compile this weekend at home in a 64Bit debian.
In terms of usage I do a fast try because need to revert the changes to fix a bug, In this little experience blurs go very fast, some portions not render sometimes but not your problem Is usual in trunk. I confirm that in hi zoom levels still very slow.

Thanks for your great work, ping me if want any help or specific test.

Revision history for this message
Yale Zhang (simdgenius) wrote :

Appreciate your testing. Anything for me to do before it can be accepted?

On my side, I still need to do 2 things:

1. rewrite as much of the SSE intrinsic functions with GCC vector
extensions as possible
2. change the way vector remainders are handled - currently it loads
past the end of arrays, but never writes past the end. But even this
can crash, which I've seen a few times.

"some portions not render sometimes but not your problem Is usual in trunk"
Yes, I've noticed it too. It happens when the filter quality isn't set
to maximum.

On Mon, Feb 27, 2017 at 1:47 AM, Jabiertxof <email address hidden> wrote:
> Hi Yale, sorry for the delay. I could compile this weekend at home in a 64Bit debian.
> In terms of usage I do a fast try because need to revert the changes to fix a bug, In this little experience blurs go very fast, some portions not render sometimes but not your problem Is usual in trunk. I confirm that in hi zoom levels still very slow.
>
> Thanks for your great work, ping me if want any help or specific test.
> --
> https://code.launchpad.net/~simdgenius/inkscape/inkscape/+merge/307251
> You are the owner of lp:~simdgenius/inkscape/inkscape.

Revision history for this message
Jabiertxof (jabiertxof) wrote :

Hi Yale.
Also you need to handle 32bit platforms and re-read the Tav message to the list about more comments in your code, he is a too much mode advanced dev than me.

Also "some portions not render sometimes but not your problem Is usual in trunk" is only my opinion maybe Im not right and could be part of your code, please be sure. My intention with the comment is reflect is not the only time withe spaces in render happends, but your comment at high level disapear give me to a recheck it well.

Certainy Im not the best man for review your code and if I approve i tell in the darkness, what is not good. In my opinion you need to reply Tav message and try approve with him.

Cheers and thanks for your hard work, Jabier.

Revision history for this message
Jabiertxof (jabiertxof) wrote :

Also you can drop a line to check if 0.93 has support for 32Bit OS. I think XP is droped but there is others

Revision history for this message
Mc (mc...) wrote :

I had a look at the code and asked a friend more familiar than me at those low-level calls, and it looks like it's doing sane things.

-> If it can compile on {linux,windows,mac}, {32bit,64bit} with {gcc,clang} (and maybe icc, visualstudio) and don't use features that will be removed in future c++ or compilers versions, I'd gladly take a crash course in whatever those arcane "_mm_packus_epi32" (I took one at random) and such actually do^^

Actually, my main fear about this merge is that if you ever leave and something ever goes wrong in that code parts, there may be no one knowing enough about how this works to actually fix stuff...

Marking as "Abstain", but consider it as an "Approve" iff the "If it can compile on all platforms/archs/compilers" above is ok

review: Abstain
Revision history for this message
Yale Zhang (simdgenius) wrote :

 If it can compile on {linux,windows,mac}, {32bit,64bit} with
{gcc,clang} (and maybe icc, visualstudio).
How serious are we about supporting Visual Studio? If it's to be
supported (possible now since we're using CMake), then I can't rewrite
the intrinsics code with GCC vector extensions since that's only
supported on GCC and clang. In theory you could make a C++ class to
abstract the vectors like the VC library that Krita uses, but I think
that's not worth the effort and you can't expect to abstract all
vector operations, especially the ones that operate horizontally
across vector lanes (e.g. shuffles).

"there may be no one knowing enough about how this works to actually
fix stuff..."
Exactly. That's why I want to rewrite as much of the _mm_ intrinsics
with GCC vector extensions.

On Mon, Feb 27, 2017 at 4:11 PM, Mc <email address hidden> wrote:
> Review: Abstain
>
> I had a look at the code and asked a friend more familiar than me at those low-level calls, and it looks like it's doing sane things.
>
> -> If it can compile on {linux,windows,mac}, {32bit,64bit} with {gcc,clang} (and maybe icc, visualstudio) and don't use features that will be removed in future c++ or compilers versions, I'd gladly take a crash course in whatever those arcane "_mm_packus_epi32" (I took one at random) and such actually do^^
>
> Actually, my main fear about this merge is that if you ever leave and something ever goes wrong in that code parts, there may be no one knowing enough about how this works to actually fix stuff...
>
> Marking as "Abstain", but consider it as an "Approve" iff the "If it can compile on all platforms/archs/compilers" above is ok
> --
> https://code.launchpad.net/~simdgenius/inkscape/inkscape/+merge/307251
> You are the owner of lp:~simdgenius/inkscape/inkscape.

Revision history for this message
Mc (mc...) wrote :

gcc and clang are sufficient, I think.

Unmerged revisions

15349. By Yale Zhang

q

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== added file 'src/display/SimpleImage.h'
2--- src/display/SimpleImage.h 1970-01-01 00:00:00 +0000
3+++ src/display/SimpleImage.h 2016-12-22 06:18:34 +0000
4@@ -0,0 +1,80 @@
5+#ifndef SIMPLE_IMAGE_H
6+#define SIMPLE_IMAGE_H
7+
8+#if _MSC_VER
9+ #ifdef _M_IX86
10+ typedef int ssize_t;
11+ #else
12+ typedef __int64 ssize_t;
13+ #endif
14+#else
15+ #include <stddef.h>
16+#endif
17+
18+// a minimal image representation that allows 2D indexing with [][] and that's completely reusable
19+template <typename AnyType>
20+class SimpleImage
21+{
22+public:
23+ SimpleImage()
24+ {
25+ }
26+ SimpleImage(AnyType *b, ssize_t p)
27+ {
28+ buffer = b;
29+ pitch = p;
30+ }
31+ AnyType *operator[](ssize_t y)
32+ {
33+ return (AnyType *)((uint8_t *)buffer + y * pitch);
34+ }
35+ SimpleImage<AnyType> SubImage(ssize_t x, ssize_t y)
36+ {
37+ return SimpleImage<AnyType>(&(*this)[y][x], pitch);
38+ }
39+ AnyType *buffer;
40+ ssize_t pitch;
41+};
42+
43+template <typename IntType>
44+IntType RoundDown(IntType a, IntType b)
45+{
46+ return (a / b) * b;
47+}
48+
49+template <typename IntType>
50+IntType RoundUp(IntType a, IntType b)
51+{
52+ return RoundDown(a + b - 1, b);
53+}
54+
55+#ifdef _WIN32
56+ #define aligned_alloc(a, s) _aligned_malloc(s, a)
57+ #define aligned_free(x) _aligned_free(x)
58+#else
59+ #define aligned_free(x) free(x)
60+#endif
61+
62+template <typename AnyType, ssize_t alignment>
63+class AlignedImage : public SimpleImage<AnyType>
64+{
65+public:
66+ AlignedImage()
67+ {
68+ this->buffer = NULL;
69+ }
70+ void Resize(int width, int height)
71+ {
72+ if (this->buffer != NULL)
73+ aligned_free(this->buffer);
74+ this->pitch = RoundUp(ssize_t(width * sizeof(AnyType)), alignment);
75+ this->buffer = (AnyType *)aligned_alloc(alignment, this->pitch * height);
76+ }
77+ ~AlignedImage()
78+ {
79+ if (this->buffer != NULL)
80+ aligned_free(this->buffer);
81+ }
82+};
83+
84+#endif
85
86=== added file 'src/display/gaussian_blur_templates.h'
87--- src/display/gaussian_blur_templates.h 1970-01-01 00:00:00 +0000
88+++ src/display/gaussian_blur_templates.h 2016-12-22 06:18:34 +0000
89@@ -0,0 +1,4006 @@
90+#ifdef __SSE3__
91+ // not any faster, at least on Haswell?
92+ #define _mm_loadu_pd(p) _mm_castsi128_pd(_mm_lddqu_si128((__m128i *)(p)))
93+ #define _mm_loadu_ps(p) _mm_castsi128_ps(_mm_lddqu_si128((__m128i *)(p)))
94+ #define _mm_loadu_si128(p) _mm_lddqu_si128(p)
95+
96+ #define _mm256_loadu_pd(p) _mm256_castsi256_pd(_mm256_lddqu_si256((__m256i *)(p)))
97+ #define _mm256_loadu_ps(p) _mm256_castsi256_ps(_mm256_lddqu_si256((__m256i *)(p)))
98+ #define _mm256_loadu_si128(p) _mm256_lddqu_si256(p)
99+#else
100+ #undef _mm_loadu_pd
101+ #undef _mm_loadu_ps
102+ #undef _mm_loadu_si128
103+ #undef _mm256_loadu_pd
104+ #undef _mm256_loadu_ps
105+ #undef _mm256_loadu_si128
106+#endif
107+
108+template <typename AnyType>
109+struct MyTraits
110+{
111+};
112+
113+template <>
114+struct MyTraits<float>
115+{
116+#ifdef __AVX__
117+ typedef __m256 SIMDtype;
118+#else
119+ typedef __m128 SIMDtype;
120+#endif
121+};
122+
123+template <>
124+struct MyTraits<int16_t>
125+{
126+#ifdef __AVX2__
127+ typedef __m256i SIMDtype;
128+#else
129+ typedef __m128i SIMDtype;
130+#endif
131+};
132+
133+template <>
134+struct MyTraits<double>
135+{
136+#ifdef __AVX__
137+ typedef __m256d SIMDtype;
138+#else
139+ typedef __m128d SIMDtype;
140+#endif
141+};
142+
143+#if defined(__AVX__) && defined(__GNUC__)
144+FORCE_INLINE __m256 _mm256_setr_m128(__m128 lo, __m128 hi)
145+{
146+ return _mm256_insertf128_ps(_mm256_castps128_ps256(lo), hi, 1);
147+}
148+
149+FORCE_INLINE __m256i _mm256_setr_m128i(__m128i lo, __m128i hi)
150+{
151+ return _mm256_insertf128_si256(_mm256_castsi128_si256(lo), hi, 1);
152+}
153+
154+FORCE_INLINE __m256d _mm256_setr_m128d(__m128d lo, __m128d hi)
155+{
156+ return _mm256_insertf128_pd(_mm256_castpd128_pd256(lo), hi, 1);
157+}
158+#endif
159+
160+#ifdef __FMA__
161+//#pragma GCC push_options
162+//#pragma GCC target("fma")
163+FORCE_INLINE __m128 MultiplyAdd(__m128 a, __m128 b, __m128 c)
164+{
165+ return _mm_fmadd_ps(a, b, c);
166+}
167+
168+FORCE_INLINE __m256 MultiplyAdd(__m256 a, __m256 b, __m256 c)
169+{
170+ return _mm256_fmadd_ps(a, b, c);
171+}
172+
173+FORCE_INLINE __m256d MultiplyAdd(__m256d a, __m256d b, __m256d c)
174+{
175+ return _mm256_fmadd_pd(a, b, c);
176+}
177+//#pragma GCC pop_options
178+#endif
179+
180+#ifndef __GNUC__
181+FORCE_INLINE __m128d operator + (__m128d a, __m128d b)
182+{
183+ return _mm_add_pd(a, b);
184+}
185+
186+FORCE_INLINE __m128d operator - (__m128d a, __m128d b)
187+{
188+ return _mm_sub_pd(a, b);
189+}
190+
191+FORCE_INLINE __m128d operator * (__m128d a, __m128d b)
192+{
193+ return _mm_mul_pd(a, b);
194+}
195+
196+FORCE_INLINE __m256d operator + (__m256d a, __m256d b)
197+{
198+ return _mm256_add_pd(a, b);
199+}
200+
201+FORCE_INLINE __m256d operator - (__m256d a, __m256d b)
202+{
203+ return _mm256_sub_pd(a, b);
204+}
205+
206+FORCE_INLINE __m256d operator * (__m256d a, __m256d b)
207+{
208+ return _mm256_mul_pd(a, b);
209+}
210+
211+FORCE_INLINE __m128 operator + (__m128 a, __m128 b)
212+{
213+ return _mm_add_ps(a, b);
214+}
215+
216+FORCE_INLINE __m128 operator - (__m128 a, __m128 b)
217+{
218+ return _mm_sub_ps(a, b);
219+}
220+
221+FORCE_INLINE __m128 operator * (__m128 a, __m128 b)
222+{
223+ return _mm_mul_ps(a, b);
224+}
225+
226+FORCE_INLINE __m256 operator + (__m256 a, __m256 b)
227+{
228+ return _mm256_add_ps(a, b);
229+}
230+
231+FORCE_INLINE __m256 operator - (__m256 a, __m256 b)
232+{
233+ return _mm256_sub_ps(a, b);
234+}
235+
236+FORCE_INLINE __m256 operator * (__m256 a, __m256 b)
237+{
238+ return _mm256_mul_ps(a, b);
239+}
240+#endif
241+
242+#ifdef __AVX__
243+FORCE_INLINE float ExtractElement0(__m256 x)
244+{
245+ return _mm_cvtss_f32(_mm256_castps256_ps128(x));
246+}
247+
248+FORCE_INLINE double ExtractElement0(__m256d x)
249+{
250+ return _mm_cvtsd_f64(_mm256_castpd256_pd128(x));
251+}
252+#endif
253+
254+FORCE_INLINE float ExtractElement0(__m128 x)
255+{
256+ return _mm_cvtss_f32(x);
257+}
258+
259+FORCE_INLINE double ExtractElement0(__m128d x)
260+{
261+ return _mm_cvtsd_f64(x);
262+}
263+
264+template<int SIZE>
265+static void calcTriggsSdikaInitialization(double const M[N*N], float uold[N][SIZE], float const uplus[SIZE], float const vplus[SIZE], float const alpha, float vold[N][SIZE])
266+{
267+ __m128 v4f_alpha = _mm_set1_ps(alpha);
268+ ssize_t c;
269+ for (c = 0; c + 4 <= SIZE; c += 4)
270+ {
271+ __m128 uminp[N];
272+ for(ssize_t i=0; i<N; i++)
273+ uminp[i] = _mm_loadu_ps(&uold[i][c]) - _mm_loadu_ps(&uplus[c]);
274+
275+ __m128 v4f_vplus = _mm_loadu_ps(&vplus[c]);
276+
277+ for(ssize_t i=0; i<N; i++)
278+ {
279+ __m128 voldf = _mm_setzero_ps();
280+ for(ssize_t j=0; j<N; j++)
281+ {
282+ voldf = voldf + uminp[j] * _mm_set1_ps(M[i*N+j]);
283+ }
284+ // Properly takes care of the scaling coefficient alpha and vplus (which is already appropriately scaled)
285+ // This was arrived at by starting from a version of the blur filter that ignored the scaling coefficient
286+ // (and scaled the final output by alpha^2) and then gradually reintroducing the scaling coefficient.
287+ _mm_storeu_ps(&vold[i][c], voldf * v4f_alpha + v4f_vplus);
288+ }
289+ }
290+ while (c < SIZE)
291+ {
292+ double uminp[N];
293+ for(ssize_t i=0; i<N; i++) uminp[i] = uold[i][c] - uplus[c];
294+ for(ssize_t i=0; i<N; i++) {
295+ double voldf = 0;
296+ for(ssize_t j=0; j<N; j++) {
297+ voldf += uminp[j]*M[i*N+j];
298+ }
299+ // Properly takes care of the scaling coefficient alpha and vplus (which is already appropriately scaled)
300+ // This was arrived at by starting from a version of the blur filter that ignored the scaling coefficient
301+ // (and scaled the final output by alpha^2) and then gradually reintroducing the scaling coefficient.
302+ vold[i][c] = voldf*alpha;
303+ vold[i][c] += vplus[c];
304+ }
305+ ++c;
306+ }
307+}
308+
309+template<int SIZE>
310+static void calcTriggsSdikaInitialization(double const M[N*N], double uold[N][SIZE], double const uplus[SIZE], double const vplus[SIZE], double const alpha, double vold[N][SIZE])
311+{
312+ __m128d v2f_alpha = _mm_set1_pd(alpha);
313+ ssize_t c;
314+ for (c = 0; c <= SIZE - 2; c += 2)
315+ {
316+ __m128d uminp[N];
317+ for(ssize_t i=0; i<N; i++)
318+ uminp[i] = _mm_loadu_pd(&uold[i][c]) - _mm_loadu_pd(&uplus[c]);
319+
320+ __m128d v2f_vplus = _mm_loadu_pd(&vplus[c]);
321+
322+ for(ssize_t i=0; i<N; i++)
323+ {
324+ __m128d voldf = _mm_setzero_pd();
325+ for(ssize_t j=0; j<N; j++)
326+ {
327+ voldf = voldf + uminp[j] * _mm_load1_pd(&M[i*N+j]);
328+ }
329+ // Properly takes care of the scaling coefficient alpha and vplus (which is already appropriately scaled)
330+ // This was arrived at by starting from a version of the blur filter that ignored the scaling coefficient
331+ // (and scaled the final output by alpha^2) and then gradually reintroducing the scaling coefficient.
332+ _mm_storeu_pd(&vold[i][c], voldf * v2f_alpha + v2f_vplus);
333+ }
334+ }
335+ while (c < SIZE)
336+ {
337+ double uminp[N];
338+ for(ssize_t i=0; i<N; i++) uminp[i] = uold[i][c] - uplus[c];
339+ for(ssize_t i=0; i<N; i++) {
340+ double voldf = 0;
341+ for(ssize_t j=0; j<N; j++) {
342+ voldf += uminp[j]*M[i*N+j];
343+ }
344+ // Properly takes care of the scaling coefficient alpha and vplus (which is already appropriately scaled)
345+ // This was arrived at by starting from a version of the blur filter that ignored the scaling coefficient
346+ // (and scaled the final output by alpha^2) and then gradually reintroducing the scaling coefficient.
347+ vold[i][c] = voldf*alpha;
348+ vold[i][c] += vplus[c];
349+ }
350+ ++c;
351+ }
352+}
353+
354+FORCE_INLINE __m128i PartialVectorMask(ssize_t n)
355+{
356+ return _mm_loadu_si128((__m128i *)&PARTIAL_VECTOR_MASK[sizeof(PARTIAL_VECTOR_MASK) / 2 - n]);
357+}
358+
359+#ifdef __AVX__
360+FORCE_INLINE __m256i PartialVectorMask32(ssize_t n)
361+{
362+ return _mm256_loadu_si256((__m256i *)&PARTIAL_VECTOR_MASK[sizeof(PARTIAL_VECTOR_MASK) / 2 - n]);
363+}
364+#endif
365+
366+#if !defined(_WIN32) && !defined(_MSC_VER)
367+ // using _mm_maskmove_si64() is preferable to _mm_maskmoveu_si128(), but for some reason on Windows, it causes memory corruption
368+ // could it be due to mixing x87 and MMX?
369+ #define CAN_USE_MMX
370+#endif
371+
372+#ifdef CAN_USE_MMX
373+// return __m64 so that it can be used by _mm_movemask_si64()
374+FORCE_INLINE __m64 PartialVectorMask8(ssize_t n)
375+{
376+ return _mm_cvtsi64_m64(*(int64_t *)&PARTIAL_VECTOR_MASK[sizeof(PARTIAL_VECTOR_MASK) / 2 - n]);
377+}
378+#else
379+FORCE_INLINE __m128i PartialVectorMask8(ssize_t n)
380+{
381+ return _mm_loadl_epi64((__m128i *)&PARTIAL_VECTOR_MASK[sizeof(PARTIAL_VECTOR_MASK) / 2 - n]);
382+}
383+#endif
384+
385+#ifdef __AVX__
386+FORCE_INLINE __m256d LoadDoubles(__m256d &out, double *x)
387+{
388+ return out = _mm256_loadu_pd(x);
389+}
390+
391+FORCE_INLINE __m256d LoadDoubles(__m256d &out, float *x)
392+{
393+ return out = _mm256_cvtps_pd(_mm_loadu_ps(x));
394+}
395+
396+FORCE_INLINE __m256d LoadDoubles(__m256d &out, uint8_t *x)
397+{
398+ return out = _mm256_cvtepi32_pd(_mm_cvtepu8_epi32(_mm_cvtsi32_si128(*(int32_t *)x)));
399+}
400+
401+FORCE_INLINE __m256d LoadDoubles(__m256d &out, uint16_t *x)
402+{
403+ return out = _mm256_cvtepi32_pd(_mm_cvtepu16_epi32(_mm_loadl_epi64((__m128i *)x)));
404+}
405+
406+FORCE_INLINE __m256 LoadFloats(__m256 &out, float *x) // seriously? compiler needs to be told to inline this when PIC on?
407+{
408+ return out = _mm256_loadu_ps(x);
409+}
410+
411+FORCE_INLINE __m256 LoadFloats(__m256 &out, uint8_t *x)
412+{
413+ __m128i temp = _mm_loadl_epi64((__m128i *)x);
414+#ifdef __AVX2__
415+ out = _mm256_cvtepi32_ps(_mm256_cvtepu8_epi32(temp));
416+#else
417+ out = _mm256_cvtepi32_ps(_mm256_setr_m128i(_mm_cvtepu8_epi32(temp), _mm_cvtepu8_epi32(_mm_shuffle_epi32(temp, _MM_SHUFFLE(0, 0, 0, 1)))));
418+#endif
419+ return out;
420+}
421+
422+FORCE_INLINE __m256 LoadFloats(__m256 &out, uint16_t *x)
423+{
424+ __m128i temp = _mm_loadu_si128((__m128i *)x);
425+ __m256i i32;
426+#ifdef __AVX2__
427+ i32 = _mm256_cvtepu16_epi32(temp);
428+#else
429+ __m128i zero = _mm_setzero_si128();
430+ i32 = _mm256_setr_m128i(_mm_unpacklo_epi16(temp, zero), _mm_unpackhi_epi16(temp, zero));
431+#endif
432+ return out = _mm256_cvtepi32_ps(i32);
433+}
434+
435+template <bool partial = false> // no, this parameter isn't redundant - without it, there will be a redundant n == 4 check when partial = 0
436+FORCE_INLINE void StoreDoubles(double *out, __m256d x, ssize_t n = 4)
437+{
438+ if (partial)
439+ _mm256_maskstore_pd(out, PartialVectorMask32(n * sizeof(double)), x);
440+ else
441+ _mm256_storeu_pd(out, x);
442+}
443+
444+template <bool partial = false>
445+FORCE_INLINE void StoreDoubles(float *out, __m256d x, ssize_t n = 4)
446+{
447+ __m128 f32 = _mm256_cvtpd_ps(x);
448+ if (partial)
449+ _mm_maskstore_ps(out, PartialVectorMask(n * sizeof(float)), f32);
450+ else
451+ _mm_storeu_ps(out, f32);
452+}
453+
454+template <bool partial = false>
455+FORCE_INLINE void StoreDoubles(uint16_t *out, __m256d x, ssize_t n = 4)
456+{
457+ __m128i i32 = _mm256_cvtpd_epi32(x),
458+ u16 = _mm_packus_epi32(i32, i32);
459+ if (partial)
460+ {
461+#ifdef CAN_USE_MMX
462+ _mm_maskmove_si64(_mm_movepi64_pi64(u16), PartialVectorMask8(n * sizeof(int16_t)), (char *)out);
463+#else
464+ _mm_maskmoveu_si128(u16, PartialVectorMask8(n * sizeof(int16_t)), (char *)out);
465+#endif
466+ }
467+ else
468+ _mm_storel_epi64((__m128i *)out, u16);
469+}
470+
471+template <bool partial = false>
472+FORCE_INLINE void StoreDoubles(uint8_t *out, __m256d x, ssize_t n = 4)
473+{
474+ __m128i i32 = _mm256_cvtpd_epi32(x),
475+ u16 = _mm_packus_epi32(i32, i32),
476+ u8 = _mm_packus_epi16(u16, u16);
477+ if (partial)
478+ {
479+#ifdef CAN_USE_MMX
480+ _mm_maskmove_si64(_mm_movepi64_pi64(u8), PartialVectorMask8(n), (char *)out);
481+#else
482+ _mm_maskmoveu_si128(u8, PartialVectorMask8(n), (char *)out);
483+#endif
484+ }
485+ else
486+ *(int32_t *)out = _mm_cvtsi128_si32(u8);
487+}
488+
489+FORCE_INLINE void StoreDoubles(uint8_t *out, __m256d x)
490+{
491+ __m128i vInt = _mm_cvtps_epi32(_mm256_cvtpd_ps(x));
492+ *(int32_t *)out = _mm_cvtsi128_si32(_mm_packus_epi16(_mm_packus_epi32(vInt, vInt), vInt));
493+}
494+
495+template <bool partial = false> // no, this parameter isn't redundant - without it, there will be a redundant n == 8 check when partial = 0
496+FORCE_INLINE void StoreFloats(float *out, __m256 x, ssize_t n = 8)
497+{
498+ if (partial)
499+ _mm256_maskstore_ps(out, PartialVectorMask32(n * sizeof(float)), x);
500+ else
501+ _mm256_storeu_ps(out, x);
502+}
503+
504+template <bool partial = false>
505+FORCE_INLINE void StoreFloats(uint16_t *out, __m256 x, ssize_t n = 8)
506+{
507+ __m256i i32 = _mm256_cvtps_epi32(x);
508+ __m128i u16 = _mm_packus_epi32(_mm256_castsi256_si128(i32), _mm256_extractf128_si256(i32, 1));
509+ if (partial)
510+ _mm_maskmoveu_si128(u16, PartialVectorMask(n * sizeof(int16_t)), (char *)out);
511+ else
512+ _mm_storeu_si128((__m128i *)out, u16);
513+}
514+
515+template <bool partial = false>
516+FORCE_INLINE void StoreFloats(uint8_t *out, __m256 x, ssize_t n = 8)
517+{
518+ __m256i i32 = _mm256_cvtps_epi32(x);
519+ __m128i i32Hi = _mm256_extractf128_si256(i32, 1),
520+ u16 = _mm_packus_epi32(_mm256_castsi256_si128(i32), i32Hi),
521+ u8 = _mm_packus_epi16(u16, u16);
522+ if (partial)
523+ {
524+#ifdef CAN_USE_MMX
525+ _mm_maskmove_si64(_mm_movepi64_pi64(u8), PartialVectorMask8(n), (char *)out);
526+#else
527+ _mm_maskmoveu_si128(u8, PartialVectorMask8(n), (char *)out);
528+#endif
529+ }
530+ else
531+ _mm_storel_epi64((__m128i *)out, u8);
532+}
533+#endif
534+
535+#ifdef __AVX__
536+FORCE_INLINE __m256 BroadcastSIMD(__m256 &out, float x)
537+{
538+ return out = _mm256_set1_ps(x);
539+}
540+
541+FORCE_INLINE __m256d BroadcastSIMD(__m256d &out, double x)
542+{
543+ return out = _mm256_set1_pd(x);
544+}
545+
546+FORCE_INLINE __m256i BroadcastSIMD(__m256i &out, int16_t x)
547+{
548+ return out = _mm256_set1_epi16(x);
549+}
550+#endif
551+
552+FORCE_INLINE __m128 BroadcastSIMD(__m128 &out, float x)
553+{
554+ return out = _mm_set1_ps(x);
555+}
556+
557+FORCE_INLINE __m128d BroadcastSIMD(__m128d &out, double x)
558+{
559+ return out = _mm_set1_pd(x);
560+}
561+
562+FORCE_INLINE __m128i BroadcastSIMD(__m128i &out, int16_t x)
563+{
564+ return out = _mm_set1_epi16(x);
565+}
566+
567+
568+FORCE_INLINE __m128 LoadFloats(__m128 &out, float *x)
569+{
570+ return out = _mm_loadu_ps(x);
571+}
572+
573+FORCE_INLINE __m128 LoadFloats(__m128 &out, uint8_t *x)
574+{
575+ __m128i u8 = _mm_cvtsi32_si128(*(int32_t *)x),
576+ i32;
577+#ifdef __SSE4_1__
578+ i32 = _mm_cvtepu8_epi32(u8);
579+#else
580+ __m128i zero = _mm_setzero_si128();
581+ i32 = _mm_unpacklo_epi16(_mm_unpacklo_epi8(u8, zero), zero);
582+#endif
583+ return out = _mm_cvtepi32_ps(i32);
584+}
585+
586+FORCE_INLINE __m128 LoadFloats(__m128 &out, uint16_t *x)
587+{
588+ __m128i u16 = _mm_loadl_epi64((__m128i *)x),
589+ i32;
590+#ifdef __SSE4_1__
591+ i32 = _mm_cvtepu16_epi32(u16);
592+#else
593+ __m128i zero = _mm_setzero_si128();
594+ i32 = _mm_unpacklo_epi16(u16, zero);
595+#endif
596+ return out = _mm_cvtepi32_ps(i32);
597+}
598+
599+
600+template <bool partial = false> // no, this parameter isn't redundant - without it, there will be a redundant n == 4 check when partial = 0
601+FORCE_INLINE void StoreFloats(float *out, __m128 x, ssize_t n = 4)
602+{
603+ if (partial)
604+ {
605+#ifdef __AVX__
606+ _mm_maskstore_ps(out, PartialVectorMask(n * sizeof(float)), x);
607+#else
608+ _mm_maskmoveu_si128(_mm_castps_si128(x), PartialVectorMask(n * sizeof(float)), (char *)out);
609+#endif
610+ }
611+ else
612+ {
613+ _mm_storeu_ps(out, x);
614+ }
615+}
616+
617+template <bool partial = false>
618+FORCE_INLINE void StoreFloats(uint16_t *out, __m128 x, ssize_t n = 4)
619+{
620+ __m128i i32 = _mm_cvtps_epi32(x),
621+#ifdef __SSE4_1__
622+ u16 = _mm_packus_epi32(i32, i32);
623+#else
624+ u16 = _mm_max_epi16(_mm_packs_epi32(i32, i32), _mm_setzero_si128()); // can get away with treating as int16 for now
625+#endif
626+ if (partial)
627+ {
628+#ifdef CAN_USE_MMX
629+ _mm_maskmove_si64(_mm_movepi64_pi64(u16), PartialVectorMask8(n * sizeof(int16_t)), (char *)out);
630+#else
631+ _mm_maskmoveu_si128(u16, PartialVectorMask(n * sizeof(int16_t)), (char *)out);
632+#endif
633+ }
634+ else
635+ _mm_storel_epi64((__m128i *)out, u16);
636+}
637+
638+template <bool partial = false>
639+FORCE_INLINE void StoreFloats(uint8_t *out, __m128 x, ssize_t n = 4)
640+{
641+ __m128i i32 = _mm_cvtps_epi32(x),
642+ u8 = _mm_packus_epi16(_mm_packs_epi32(i32, i32), i32); // should use packus_epi32, but that's only in SSE4
643+ if (partial)
644+ {
645+#ifdef CAN_USE_MMX
646+ _mm_maskmove_si64(_mm_movepi64_pi64(u8), PartialVectorMask8(n), (char *)out);
647+#else
648+ _mm_maskmoveu_si128(u8, PartialVectorMask(n), (char *)out);
649+#endif
650+ }
651+ else
652+ *(int32_t *)out = _mm_cvtsi128_si32(u8);
653+}
654+
655+
656+FORCE_INLINE __m128d LoadDoubles(__m128d &out, double *x)
657+{
658+ return out = _mm_loadu_pd(x);
659+}
660+
661+FORCE_INLINE __m128d LoadDoubles(__m128d &out, uint8_t *x)
662+{
663+ __m128i u8 = _mm_cvtsi32_si128(*(uint16_t *)x),
664+ i32;
665+#ifdef __SSE4_1__
666+ i32 = _mm_cvtepu8_epi32(u8);
667+#else
668+ __m128i zero = _mm_setzero_si128();
669+ i32 = _mm_unpacklo_epi16(_mm_unpacklo_epi8(u8, zero), zero);
670+#endif
671+ return out = _mm_cvtepi32_pd(i32);
672+}
673+
674+FORCE_INLINE __m128d LoadDoubles(__m128d &out, uint16_t *x)
675+{
676+ __m128i u16 = _mm_cvtsi32_si128(*(uint32_t *)x),
677+ i32;
678+#ifdef __SSE4_1__
679+ i32 = _mm_cvtepu16_epi32(u16);
680+#else
681+ __m128i zero = _mm_setzero_si128();
682+ i32 = _mm_unpacklo_epi16(u16, zero);
683+#endif
684+ return out = _mm_cvtepi32_pd(i32);
685+}
686+
687+template <bool partial = false>
688+FORCE_INLINE void StoreDoubles(double *out, __m128d x, ssize_t n = 2)
689+{
690+ if (partial)
691+ {
692+ #ifdef __AVX__
693+ _mm_maskstore_pd(out, PartialVectorMask(n * sizeof(double)), x);
694+ #else
695+ _mm_maskmoveu_si128(_mm_castpd_si128(x), PartialVectorMask(n * sizeof(double)), (char *)out);
696+ #endif
697+ }
698+ else
699+ {
700+ _mm_storeu_pd(out, x);
701+ }
702+}
703+
704+template <bool partial = false>
705+FORCE_INLINE void StoreDoubles(float *out, __m128d x, ssize_t n = 2)
706+{
707+ __m128 f32 = _mm_cvtpd_ps(x);
708+ if (partial)
709+ {
710+ #ifdef CAN_USE_MMX
711+ _mm_maskmove_si64(_mm_movepi64_pi64(_mm_castps_si128(f32)), PartialVectorMask8(n * sizeof(float)), (char *)out);
712+ #else
713+ _mm_maskmoveu_si128(_mm_castps_si128(f32), PartialVectorMask8(n * sizeof(float)), (char *)out);
714+ #endif
715+ }
716+ else
717+ {
718+ _mm_storel_pi((__m64 *)out, f32);
719+ }
720+}
721+
722+template <bool partial = false>
723+FORCE_INLINE void StoreDoubles(uint16_t *out, __m128d x, ssize_t n = 2)
724+{
725+ __m128i i32 = _mm_cvtpd_epi32(x),
726+#ifdef __SSE4_1__
727+ u16 = _mm_packus_epi32(i32, i32);
728+#else
729+ u16 = _mm_max_epi16(_mm_packs_epi32(i32, i32), _mm_setzero_si128()); // can get away with using i16 for now
730+#endif
731+ if (partial)
732+ {
733+ #ifdef CAN_USE_MMX
734+ _mm_maskmove_si64(_mm_movepi64_pi64(u16), PartialVectorMask8(n * sizeof(int16_t)), (char *)out);
735+ #else
736+ _mm_maskmoveu_si128(u16, PartialVectorMask8(n * sizeof(int16_t)), (char *)out);
737+ #endif
738+ }
739+ else
740+ {
741+ *(uint32_t *)out = _mm_cvtsi128_si32(u16);
742+ }
743+}
744+
745+template <bool partial = false>
746+FORCE_INLINE void StoreDoubles(uint8_t *out, __m128d x, ssize_t n = 2)
747+{
748+ __m128i i32 = _mm_cvtpd_epi32(x),
749+#ifdef __SSE4_1__
750+ u16 = _mm_packus_epi32(i32, i32),
751+#else
752+ u16 = _mm_max_epi16(_mm_packs_epi32(i32, i32), _mm_setzero_si128()), // can get away with using i16 for now
753+#endif
754+ u8 = _mm_packus_epi16(u16, u16);
755+
756+ if (partial)
757+ {
758+ #ifdef CAN_USE_MMX
759+ _mm_maskmove_si64(_mm_movepi64_pi64(u8), PartialVectorMask8(n), (char *)out);
760+ #else
761+ _mm_maskmoveu_si128(u8, PartialVectorMask8(n), (char *)out);
762+ #endif
763+ }
764+ else
765+ {
766+ *(uint16_t *)out = _mm_cvtsi128_si32(u8);
767+ }
768+}
769+
770+#ifdef __AVX__
771+FORCE_INLINE __m256 Load4x2Floats(uint8_t *row0, uint8_t *row1)
772+{
773+ return _mm256_cvtepi32_ps(_mm256_setr_m128i(_mm_cvtepu8_epi32(_mm_cvtsi32_si128(*(int32_t *)row0)),
774+ _mm_cvtepu8_epi32(_mm_cvtsi32_si128(*(int32_t *)row1))));
775+}
776+
777+FORCE_INLINE __m256 Load4x2Floats(uint16_t *row0, uint16_t *row1)
778+{
779+ return _mm256_cvtepi32_ps(_mm256_setr_m128i(_mm_cvtepu16_epi32(_mm_loadl_epi64((__m128i *)row0)),
780+ _mm_cvtepu16_epi32(_mm_loadl_epi64((__m128i *)row1))));
781+}
782+
783+FORCE_INLINE __m256 Load4x2Floats(float *row0, float *row1)
784+{
785+ return _mm256_setr_m128(_mm_loadu_ps(row0), _mm_loadu_ps(row1));
786+}
787+#endif
788+
789+FORCE_INLINE __m128i LoadAndScaleToInt16(__m128i &out, uint8_t *x)
790+{
791+ // convert from [0-255] to [0-16383]
792+ // leave 1 spare bit so that 2 values can be added without overflow for symmetric filters
793+ __m128i u8 = _mm_loadl_epi64((__m128i *)x),
794+ i16;
795+#ifdef __SSE4_1__
796+ i16 = _mm_cvtepu8_epi16(u8);
797+#else
798+ i16 = _mm_unpacklo_epi8(u8, _mm_setzero_si128());
799+#endif
800+ return out = _mm_slli_epi16(i16, 6);
801+}
802+
803+__m128i LoadAndScaleToInt16(__m128i &out, int16_t *x)
804+{
805+ return out = _mm_loadu_si128((__m128i *)x);
806+}
807+
808+#ifdef __AVX2__
809+
810+FORCE_INLINE __m256i LoadAndScaleToInt16(__m256i &out, uint8_t *x)
811+{
812+ // convert from [0-255] to [0-16383]
813+ // leave 1 spare bit so that 2 values can be added without overflow for symmetric filters
814+ return out = _mm256_slli_epi16(_mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i *)x)), 6);
815+}
816+
817+FORCE_INLINE __m256i LoadAndScaleToInt16(__m256i &out, int16_t *x)
818+{
819+ return out = _mm256_loadu_si256((__m256i *)x);
820+}
821+
822+#endif
823+
824+template <bool partial = false>
825+FORCE_INLINE void ScaleAndStoreInt16(uint8_t *out, __m128i x, ssize_t n = 8)
826+{
827+ __m128i i16 = _mm_srai_epi16(_mm_adds_epi16(x, _mm_set1_epi16(32)), 6),
828+ u8 = _mm_packus_epi16(i16, i16);
829+ if (partial)
830+ {
831+#ifdef CAN_USE_MMX
832+ _mm_maskmove_si64(_mm_movepi64_pi64(u8), PartialVectorMask8(n), (char *)out);
833+#else
834+ _mm_maskmoveu_si128(u8, PartialVectorMask8(n), (char *)out);
835+#endif
836+ }
837+ else
838+ _mm_storel_epi64((__m128i *)out, u8);
839+}
840+
841+template <bool partial = false>
842+FORCE_INLINE void ScaleAndStoreInt16(int16_t *out, __m128i i16, ssize_t n = 8)
843+{
844+ if (partial)
845+ _mm_maskmoveu_si128(i16, PartialVectorMask(n * sizeof(int16_t)), (char *)out);
846+ else
847+ _mm_storeu_si128((__m128i *)out, i16);
848+}
849+
850+#ifdef __AVX2__
851+
852+template <bool partial = false>
853+FORCE_INLINE void ScaleAndStoreInt16(uint8_t *out, __m256i x, ssize_t n = 16)
854+{
855+ __m256i i16 = _mm256_srai_epi16(_mm256_adds_epi16(x, _mm256_set1_epi16(32)), 6);
856+ __m128i u8 = _mm256_castsi256_si128(_mm256_packus_epi16(i16, _mm256_permute2f128_si256(i16, i16, 1)));
857+ if (partial)
858+ _mm_maskmoveu_si128(u8, PartialVectorMask(n), (char *)out);
859+ else
860+ _mm_storeu_si128((__m128i *)out, u8);
861+}
862+
863+template <bool partial = false>
864+FORCE_INLINE void ScaleAndStoreInt16(int16_t *out, __m256i i16, ssize_t n = 16)
865+{
866+ if (partial)
867+ {
868+ _mm_maskmoveu_si128(_mm256_castsi256_si128(i16), PartialVectorMask(min(ssize_t(8), n) * sizeof(int16_t)), (char *)out);
869+ _mm_maskmoveu_si128(_mm256_extractf128_si256(i16, 1), PartialVectorMask(max(ssize_t(0), n - 8) * sizeof(int16_t)), (char *)&out[8]);
870+ }
871+ else
872+ _mm256_storeu_si256((__m256i *)out, i16);
873+}
874+#endif
875+
876+// selectors are doubles to avoid int-float domain transition
877+FORCE_INLINE __m128d Select(__m128d a, __m128d b, __m128d selectors)
878+{
879+#ifdef __SSE4_1__
880+ return _mm_blendv_pd(a, b, selectors);
881+#else
882+ return _mm_or_pd(_mm_andnot_pd(selectors, a), _mm_and_pd(selectors, b));
883+#endif
884+}
885+
886+// selectors are floats to avoid int-float domain transition
887+FORCE_INLINE __m128 Select(__m128 a, __m128 b, __m128 selectors)
888+{
889+#ifdef __SSE4_1__
890+ return _mm_blendv_ps(a, b, selectors);
891+#else
892+ return _mm_or_ps(_mm_andnot_ps(selectors, a), _mm_and_ps(selectors, b));
893+#endif
894+}
895+
896+// even these simple ops need to be redeclared for each SIMD architecture due to VEX and non-VEX encodings of SSE instructions
897+#ifdef __AVX__
898+FORCE_INLINE __m128 Cast256To128(__m256 v)
899+{
900+ return _mm256_castps256_ps128(v);
901+}
902+
903+FORCE_INLINE __m128d Cast256To128(__m256d v)
904+{
905+ return _mm256_castpd256_pd128(v);
906+}
907+FORCE_INLINE __m128i Cast256To128(__m256i v)
908+{
909+ return _mm256_castsi256_si128(v);
910+}
911+#endif
912+
913+FORCE_INLINE __m128 Cast256To128(__m128 v)
914+{
915+ return v;
916+}
917+
918+FORCE_INLINE __m128d Cast256To128(__m128d v)
919+{
920+ return v;
921+}
922+
923+FORCE_INLINE __m128i Cast256To128(__m128i v)
924+{
925+ return v;
926+}
927+
928+
929+// does 1D IIR convolution on multiple rows (height) of data
930+// IntermediateType must be float or double
931+template <bool transposeOut, bool isForwardPass, bool isBorder, int channels, typename OutType, typename InType, typename IntermediateType>
932+FORCE_INLINE void Convolve1DHorizontalRef(SimpleImage <OutType> out,
933+ SimpleImage <InType> in,
934+ IntermediateType *borderValues, // [y][color]
935+ ssize_t xStart, ssize_t xEnd, ssize_t width, ssize_t height,
936+ typename MyTraits<IntermediateType>::SIMDtype *vCoefficients, double M[N * N])
937+{
938+ ssize_t xStep = isForwardPass ? 1 : -1;
939+
940+ ssize_t y = 0;
941+ do
942+ {
943+ ssize_t c = 0;
944+ do
945+ {
946+ IntermediateType prevOut[N];
947+ ssize_t x = xStart;
948+ if (isBorder && !isForwardPass)
949+ {
950+ // xStart must be width - 1
951+ IntermediateType u[N + 1][1]; // u[0] = last forward filtered value, u[1] = 2nd last forward filtered value, ...
952+ for (ssize_t i = 0; i < N + 1; ++i)
953+ {
954+ u[i][0] = in[y][(xStart + i * xStep) * channels + c];
955+ }
956+ IntermediateType backwardsInitialState[N][1];
957+ calcTriggsSdikaInitialization<1>(M, u, &borderValues[y * channels + c], &borderValues[y * channels + c], ExtractElement0(vCoefficients[0]), backwardsInitialState);
958+ for (ssize_t i = 0; i < N; ++i)
959+ prevOut[i] = backwardsInitialState[i][0];
960+
961+ if (transposeOut)
962+ out[x][y * channels + c] = clip_round_cast<OutType, IntermediateType>(prevOut[0]);
963+ else
964+ out[y][x * channels + c] = clip_round_cast<OutType, IntermediateType>(prevOut[0]);
965+ x += xStep;
966+ if (x == xEnd)
967+ goto nextIteration; // do early check here so that we can still use do-while for forward pass
968+ }
969+ else if (isBorder && isForwardPass)
970+ {
971+ for (ssize_t i = 0; i < N; ++i)
972+ prevOut[i] = in[y][0 * channels + c];
973+ }
974+ else
975+ {
976+ for (ssize_t i = 0; i < N; ++i)
977+ {
978+ prevOut[i] = transposeOut ? out[xStart - (i + 1) * xStep][y * channels + c]
979+ : out[y][(xStart - (i + 1) * xStep) * channels + c];
980+ }
981+ }
982+
983+ do
984+ {
985+ IntermediateType sum = prevOut[0] * ExtractElement0(vCoefficients[1])
986+ + prevOut[1] * ExtractElement0(vCoefficients[2])
987+ + prevOut[2] * ExtractElement0(vCoefficients[3])
988+ + in[y][x * channels + c] * ExtractElement0(vCoefficients[0]); // add last for best accuracy since this terms tends to be the smallest
989+ if (transposeOut)
990+ out[x][y * channels + c] = clip_round_cast<OutType, IntermediateType>(sum);
991+ else
992+ out[y][x * channels + c] = clip_round_cast<OutType, IntermediateType>(sum);
993+ prevOut[2] = prevOut[1];
994+ prevOut[1] = prevOut[0];
995+ prevOut[0] = sum;
996+ x += xStep;
997+ } while (x != xEnd);
998+ ++c;
999+ } while (c < channels);
1000+ nextIteration:
1001+ ++y;
1002+ } while (y < height);
1003+}
1004+
1005+template <int channels, bool transposeOut, ssize_t xStep, int i0, int i1, int i2, typename OutType, typename InType>
1006+FORCE_INLINE void DoOneIIR(SimpleImage<OutType> out, SimpleImage<InType> in, __m256d &vSum, __m256d &vIn, ssize_t x, ssize_t y, __m256d vCoefficients[N + 1], __m256d prevOut[N])
1007+{
1008+ vSum = vIn * vCoefficients[0];
1009+ LoadDoubles(vIn, &in[y][(x + xStep) * channels]); // load data for next iteration early to hide latency (software pipelining)
1010+
1011+ // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3]
1012+#ifdef __FMA__
1013+ // this expression uses fewer MADs than the max. possible, but has a shorter critical path and is actually faster
1014+ vSum = MultiplyAdd(prevOut[i2], vCoefficients[3], vSum) + MultiplyAdd(prevOut[i1], vCoefficients[2], prevOut[i0] * vCoefficients[1]);
1015+#else
1016+ vSum = prevOut[i0] * vCoefficients[1]
1017+ + prevOut[i1] * vCoefficients[2]
1018+ + prevOut[i2] * vCoefficients[3]
1019+ + vIn * vCoefficients[0];
1020+#endif
1021+ if (transposeOut)
1022+ StoreDoubles(&out[x][y * channels], vSum);
1023+ else
1024+ StoreDoubles(&out[y][x * channels], vSum);
1025+}
1026+
1027+// input is always untransposed
1028+// for reverse pass, input is output from forward pass
1029+// for transposed output, in-place operation isn't possible
1030+// hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined
1031+template <bool transposeOut, bool isForwardPass, bool isBorder, int channels, typename OutType, typename InType, typename SIMD_Type>
1032+static /*FORCE_INLINE*/ void Convolve1DHorizontal(SimpleImage<OutType> out,
1033+ SimpleImage<InType> in,
1034+ double *borderValues,
1035+ ssize_t xStart, ssize_t xEnd, ssize_t width, ssize_t height,
1036+ SIMD_Type *vCoefficients, double M[N * N])
1037+{
1038+#if 0
1039+
1040+ Convolve1DHorizontalRef<transposeOut, isForwardPass, isBorder, channels>(out,
1041+ in,
1042+ borderValues,
1043+ xStart, xEnd, width, height,
1044+ vCoefficients, M);
1045+ return;
1046+
1047+#endif
1048+ const ssize_t xStep = isForwardPass ? 1 : -1;
1049+ if (channels == 4)
1050+ {
1051+#ifdef __AVX__
1052+ ssize_t y = 0;
1053+ do
1054+ {
1055+ __m256d prevOut[N];
1056+
1057+ ssize_t x = xStart;
1058+ if (isBorder && !isForwardPass)
1059+ {
1060+ // condition: xStart must be width - 1
1061+ double u[N + 1][channels]; //[x][channels]
1062+ for (ssize_t i = 0; i < N + 1; ++i)
1063+ {
1064+ __m256d temp;
1065+ _mm256_storeu_pd(u[i], LoadDoubles(temp, &in[y][(xStart + i * xStep) * channels]));
1066+ }
1067+ double backwardsInitialState[N][channels];
1068+ calcTriggsSdikaInitialization<channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState);
1069+ for (ssize_t i = 0; i < N; ++i)
1070+ LoadDoubles(prevOut[i], backwardsInitialState[i]);
1071+
1072+ if (transposeOut)
1073+ StoreDoubles(&out[x][y * channels], prevOut[0]);
1074+ else
1075+ StoreDoubles(&out[y][x * channels], prevOut[0]);
1076+
1077+ x += xStep;
1078+ if (x == xEnd)
1079+ goto nextIteration;
1080+ }
1081+ else if (isBorder && isForwardPass)
1082+ {
1083+ __m256d firstPixel;
1084+ LoadDoubles(firstPixel, &in[y][0 * channels]);
1085+ for (ssize_t i = 0; i < N; ++i)
1086+ prevOut[i] = firstPixel;
1087+ }
1088+ else
1089+ {
1090+ for (ssize_t i = 0; i < N; ++i)
1091+ {
1092+ if (transposeOut)
1093+ LoadDoubles(prevOut[i], &out[xStart - (i + 1) * xStep][y * channels]);
1094+ else
1095+ LoadDoubles(prevOut[i], &out[y][(xStart - (i + 1) * xStep) * channels]);
1096+ }
1097+ }
1098+
1099+#if 0 // no measurable speedup
1100+ // same as loop below, but unrolled 3 times to increase ||ism, hide latency, and reduce overhead of shifting the sliding window (prevOut)
1101+ __m256d vIn;
1102+ LoadDoubles(vIn, &in[y][xStart * channels]);
1103+ for ( ; isForwardPass ? (x < xEnd - 3) : (x > xEnd + 3); )
1104+ {
1105+ __m256d vSum;
1106+ DoOneIIR<channels, transposeOut, xStep, 0, 1, 2>(out, in, vSum, vIn, x, y, vCoefficients, prevOut);
1107+ prevOut[2] = vSum;
1108+ x += xStep;
1109+
1110+ DoOneIIR<channels, transposeOut, xStep, 2, 0, 1>(out, in, vSum, vIn, x, y, vCoefficients, prevOut);
1111+ prevOut[1] = vSum;
1112+ x += xStep;
1113+
1114+ DoOneIIR<channels, transposeOut, xStep, 1, 2, 0>(out, in, vSum, vIn, x, y, vCoefficients, prevOut);
1115+ prevOut[0] = vSum;
1116+ x += xStep;
1117+ }
1118+#endif
1119+ while (isForwardPass ? (x < xEnd) : (x > xEnd))
1120+ {
1121+ __m256d vIn, vSum;
1122+ LoadDoubles(vIn, &in[y][x * channels]),
1123+
1124+ // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3]
1125+ #ifdef __FMA__
1126+ // this expression uses fewer MADs than the max. possible, but has a shorter critical path and is actually faster
1127+ vSum = MultiplyAdd(vIn, vCoefficients[0], prevOut[2] * vCoefficients[3]) + MultiplyAdd(prevOut[1], vCoefficients[2], prevOut[0] * vCoefficients[1]);
1128+ #else
1129+ vSum = prevOut[0] * vCoefficients[1]
1130+ + prevOut[1] * vCoefficients[2]
1131+ + prevOut[2] * vCoefficients[3]
1132+ + vIn * vCoefficients[0];
1133+ #endif
1134+ if (transposeOut)
1135+ StoreDoubles(&out[x][y * channels], vSum);
1136+ else
1137+ StoreDoubles(&out[y][x * channels], vSum);
1138+
1139+ prevOut[2] = prevOut[1];
1140+ prevOut[1] = prevOut[0];
1141+ prevOut[0] = vSum;
1142+ x += xStep;
1143+ }
1144+ nextIteration:
1145+ ++y;
1146+ } while (y < height);
1147+#else
1148+ // todo: yuck, find some way to refactor (emulate m256d perhaps?)
1149+ ssize_t y = 0;
1150+ do
1151+ {
1152+ __m128d prevOut[N][2];
1153+
1154+ ssize_t x = xStart;
1155+ if (isBorder && !isForwardPass)
1156+ {
1157+ // condition: xStart must be width - 1
1158+ double u[N + 1][channels]; //[x][channels]
1159+ for (ssize_t i = 0; i < N + 1; ++i)
1160+ {
1161+ __m128d temp;
1162+ _mm_storeu_pd(u[i], LoadDoubles(temp, &in[y][(xStart + i * xStep) * channels]));
1163+ _mm_storeu_pd(&u[i][2], LoadDoubles(temp, &in[y][(xStart + i * xStep) * channels + 2]));
1164+ }
1165+ double backwardsInitialState[N][channels];
1166+ calcTriggsSdikaInitialization<channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState);
1167+ for (ssize_t i = 0; i < N; ++i)
1168+ {
1169+ LoadDoubles(prevOut[i][0], backwardsInitialState[i]);
1170+ LoadDoubles(prevOut[i][1], &backwardsInitialState[i][2]);
1171+ }
1172+
1173+ if (transposeOut)
1174+ {
1175+ StoreDoubles(&out[x][y * channels], prevOut[0][0]);
1176+ StoreDoubles(&out[x][y * channels + 2], prevOut[0][1]);
1177+ }
1178+ else
1179+ {
1180+ StoreDoubles(&out[y][x * channels], prevOut[0][0]);
1181+ StoreDoubles(&out[y][x * channels + 2], prevOut[0][1]);
1182+ }
1183+
1184+ x += xStep;
1185+ if (x == xEnd)
1186+ goto nextIteration;
1187+ }
1188+ else if (isBorder && isForwardPass)
1189+ {
1190+ __m128d firstPixel[2];
1191+ LoadDoubles(firstPixel[0], &in[y][0 * channels]);
1192+ LoadDoubles(firstPixel[1], &in[y][0 * channels + 2]);
1193+ for (ssize_t i = 0; i < N; ++i)
1194+ {
1195+ prevOut[i][0] = firstPixel[0];
1196+ prevOut[i][1] = firstPixel[1];
1197+ }
1198+ }
1199+ else
1200+ {
1201+ for (ssize_t i = 0; i < N; ++i)
1202+ {
1203+ if (transposeOut)
1204+ {
1205+ LoadDoubles(prevOut[i][0], &out[xStart - (i + 1) * xStep][y * channels]);
1206+ LoadDoubles(prevOut[i][1], &out[xStart - (i + 1) * xStep][y * channels + 2]);
1207+ }
1208+ else
1209+ {
1210+ LoadDoubles(prevOut[i][0], &out[y][(xStart - (i + 1) * xStep) * channels]);
1211+ LoadDoubles(prevOut[i][1], &out[y][(xStart - (i + 1) * xStep) * channels + 2]);
1212+ }
1213+ }
1214+ }
1215+
1216+ while (isForwardPass ? (x < xEnd) : (x > xEnd))
1217+ {
1218+ __m128d vIn[2], vSum[2];
1219+ LoadDoubles(vIn[0], &in[y][x * channels]),
1220+ LoadDoubles(vIn[1], &in[y][x * channels + 2]),
1221+
1222+ // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3]
1223+ vSum[0] = prevOut[0][0] * vCoefficients[1][0]
1224+ + prevOut[1][0] * vCoefficients[2][0]
1225+ + prevOut[2][0] * vCoefficients[3][0]
1226+ + vIn[0] * vCoefficients[0][0];
1227+
1228+ vSum[1] = prevOut[0][1] * vCoefficients[1][1]
1229+ + prevOut[1][1] * vCoefficients[2][1]
1230+ + prevOut[2][1] * vCoefficients[3][1]
1231+ + vIn[1] * vCoefficients[0][1];
1232+ if (transposeOut)
1233+ {
1234+ StoreDoubles(&out[x][y * channels], vSum[0]);
1235+ StoreDoubles(&out[x][y * channels + 2], vSum[1]);
1236+ }
1237+ else
1238+ {
1239+ StoreDoubles(&out[y][x * channels], vSum[0]);
1240+ StoreDoubles(&out[y][x * channels + 2], vSum[1]);
1241+ }
1242+ prevOut[2][0] = prevOut[1][0];
1243+ prevOut[2][1] = prevOut[1][1];
1244+ prevOut[1][0] = prevOut[0][0];
1245+ prevOut[1][1] = prevOut[0][1];
1246+ prevOut[0][0] = vSum[0];
1247+ prevOut[0][1] = vSum[1];
1248+ x += xStep;
1249+ }
1250+ nextIteration:
1251+ ++y;
1252+ } while (y < height);
1253+#endif
1254+ }
1255+ else
1256+ {
1257+ ssize_t y = 0;
1258+ do
1259+ {
1260+ if (isForwardPass)
1261+ {
1262+ ssize_t x = xStart;
1263+ __m128d feedback[2],
1264+ k0[2];
1265+ #ifdef __AVX__
1266+ k0[0] = Cast256To128(vCoefficients[0]);
1267+ k0[1] = _mm256_extractf128_pd(vCoefficients[0], 1);
1268+ #else
1269+ k0[0] = vCoefficients[0];
1270+ k0[1] = vCoefficients[1];
1271+ #endif
1272+
1273+ if (isBorder && isForwardPass)
1274+ {
1275+ // xStart must be 0
1276+ feedback[0] = feedback[1] = _mm_set1_pd(in[y][0]);
1277+ }
1278+ else
1279+ {
1280+ LoadDoubles(feedback[0], &out[y][xStart - 3 * xStep]);
1281+ LoadDoubles(feedback[1], &out[y][xStart - 1 * xStep]);
1282+
1283+ feedback[1] = _mm_shuffle_pd(feedback[0], feedback[1], _MM_SHUFFLE2(0, 1));
1284+ feedback[0] = _mm_shuffle_pd(feedback[0], feedback[0], _MM_SHUFFLE2(0, 0));
1285+ }
1286+ // feedback = [garbage y-3 y-2 y-1]
1287+ for (; x != xEnd; x += xStep)
1288+ {
1289+ __m128d _in = _mm_set1_pd(in[y][x]),
1290+ newOutput;
1291+ #ifdef __SSE4_1__
1292+ feedback[0] = _mm_blend_pd(feedback[0], _in, 0x1);
1293+ newOutput = _mm_add_pd(_mm_dp_pd(feedback[0], k0[0], 0x31),
1294+ _mm_dp_pd(feedback[1], k0[1], 0x31));
1295+ feedback[0] = _mm_blend_pd(feedback[0], newOutput, 0x1); // insert back input
1296+ #else
1297+ __m128d FIRST_ELEMENT_MASK = _mm_castsi128_pd(_mm_set_epi64x(0, ~uint64_t(0)));
1298+ feedback[0] = Select(feedback[0], _in, FIRST_ELEMENT_MASK);
1299+
1300+ __m128d partialDP = _mm_add_pd
1301+ (
1302+ _mm_mul_pd(feedback[0], k0[0]),
1303+ _mm_mul_pd(feedback[1], k0[1])
1304+ );
1305+ newOutput = _mm_add_pd
1306+ (
1307+ partialDP,
1308+ _mm_shuffle_pd(partialDP, partialDP, _MM_SHUFFLE2(0, 1))
1309+ );
1310+ feedback[0] = Select(feedback[0], newOutput, FIRST_ELEMENT_MASK); // insert back input
1311+ #endif
1312+ out[y][x] = _mm_cvtsd_f64(newOutput);
1313+ feedback[0] = _mm_shuffle_pd(feedback[0], feedback[1], _MM_SHUFFLE2(0, 0));
1314+ feedback[1] = _mm_shuffle_pd(feedback[1], feedback[0], _MM_SHUFFLE2(0, 1));
1315+ }
1316+ }
1317+ else
1318+ {
1319+ __m128d feedback[2], k4[2];
1320+ #ifdef __AVX__
1321+ k4[0] = Cast256To128(vCoefficients[4]);
1322+ k4[1] = _mm256_extractf128_pd(vCoefficients[4], 1);
1323+ #else
1324+ k4[0] = vCoefficients[8];
1325+ k4[1] = vCoefficients[9];
1326+ #endif
1327+ ssize_t x = xStart;
1328+ if (isBorder && !isForwardPass)
1329+ {
1330+ // xstart must be width - 1
1331+ double u[N + 1][1]; //[x][y][channels]
1332+ for (ssize_t i = 0; i < N + 1; ++i)
1333+ {
1334+ u[i][0] = in[y][xStart + i * xStep];
1335+ }
1336+ #define ROUND_UP(a, b) ((a + b - 1) / b * b)
1337+ double backwardsInitialState[ROUND_UP(N, 2)][1]; // pad so vector loads don't go past end
1338+ calcTriggsSdikaInitialization<1>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState);
1339+
1340+ feedback[0] = _mm_load_pd(&backwardsInitialState[0][0]);
1341+ feedback[1] = _mm_load_pd(&backwardsInitialState[2][0]);
1342+
1343+ out[y][x] = backwardsInitialState[0][0];
1344+ x += xStep;
1345+ if (x == xEnd)
1346+ continue;
1347+ }
1348+ else
1349+ {
1350+ LoadDoubles(feedback[0], &out[y][xStart - xStep]);
1351+ LoadDoubles(feedback[1], &out[y][xStart - 3 * xStep]);
1352+ }
1353+
1354+ for (; x != xEnd; x += xStep)
1355+ {
1356+ __m128d _in = _mm_set1_pd(in[y][x]),
1357+ newOutput;
1358+ #ifdef __SSE4_1__
1359+ feedback[1] = _mm_blend_pd(feedback[1], _in, 0x2);
1360+ newOutput = _mm_add_pd(_mm_dp_pd(feedback[0], k4[0], 0x32),
1361+ _mm_dp_pd(feedback[1], k4[1], 0x32));
1362+ feedback[1] = _mm_blend_pd(feedback[1], newOutput, 0x2); // insert back input
1363+ #else
1364+ __m128d LAST_ELEMENT_MASK = _mm_castsi128_pd(_mm_set_epi64x(~uint64_t(0), 0));
1365+ feedback[1] = Select(feedback[1], _in, LAST_ELEMENT_MASK);
1366+
1367+ __m128d partialDP = _mm_add_pd
1368+ (
1369+ _mm_mul_pd(feedback[0], k4[0]),
1370+ _mm_mul_pd(feedback[1], k4[1])
1371+ );
1372+ newOutput = _mm_add_pd
1373+ (
1374+ partialDP,
1375+ _mm_shuffle_pd(partialDP, partialDP, _MM_SHUFFLE2(0, 0))
1376+ );
1377+ feedback[1] = Select(feedback[1], newOutput, LAST_ELEMENT_MASK);
1378+ #endif
1379+
1380+ __m128d temp = _mm_shuffle_pd(feedback[1], feedback[1], _MM_SHUFFLE2(0, 1));
1381+ out[y][x] = _mm_cvtsd_f64(temp);
1382+
1383+ feedback[1] = _mm_shuffle_pd(feedback[0], feedback[1], _MM_SHUFFLE2(1, 1));
1384+ feedback[0] = _mm_shuffle_pd(feedback[1], feedback[0], _MM_SHUFFLE2(0, 1));
1385+ }
1386+ }
1387+ ++y;
1388+ } while (y < height);
1389+ }
1390+}
1391+
1392+
1393+template <int channels, bool transposeOut, ssize_t xStep, int i0, int i1, int i2, typename OutType, typename InType>
1394+FORCE_INLINE void DoOneIIR(SimpleImage<OutType> out, SimpleImage<InType> in, __m256 &vSum, __m256 &vIn, ssize_t x, ssize_t y, __m256 vCoefficients[N + 1], __m256 prevOut[N])
1395+{
1396+ vSum = vIn * vCoefficients[0];
1397+
1398+ // load data for next iteration early to hide latency (software pipelining)
1399+ vIn = Load4x2Floats(&in[y][(x + xStep) * channels],
1400+ &in[y + 1][(x + xStep) * channels]);
1401+
1402+ // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3]
1403+#ifdef __FMA__
1404+ // this expression uses fewer MADs than the max. possible, but has a shorter critical path and is actually faster
1405+ vSum = MultiplyAdd(prevOut[i2], vCoefficients[3], vSum) + MultiplyAdd(prevOut[i1], vCoefficients[2], prevOut[i0] * vCoefficients[1]);
1406+#else
1407+ vSum = prevOut[i0] * vCoefficients[1]
1408+ + prevOut[i1] * vCoefficients[2]
1409+ + prevOut[i2] * vCoefficients[3]
1410+ + vIn * vCoefficients[0];
1411+#endif
1412+ if (transposeOut)
1413+ {
1414+ StoreFloats(&out[x][y * channels], _mm256_castps256_ps128(vSum));
1415+ StoreFloats(&out[x][(y + 1) * channels], _mm256_extractf128_ps(vSum, 1));
1416+ }
1417+ else
1418+ {
1419+ StoreFloats(&out[y][x * channels], _mm256_castps256_ps128(vSum));
1420+ StoreFloats(&out[y + 1][x * channels], _mm256_extractf128_ps(vSum, 1));
1421+ }
1422+}
1423+
1424+// hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined
1425+template <bool transposeOut, bool isForwardPass, bool isBorder, int channels, typename OutType, typename InType, typename SIMD_Type>
1426+static /*FORCE_INLINE*/void Convolve1DHorizontal(SimpleImage<OutType> out,
1427+ SimpleImage<InType> in,
1428+ float *borderValues,
1429+ ssize_t xStart, ssize_t xEnd, ssize_t width, ssize_t height,
1430+ SIMD_Type *vCoefficients, double M[N * N])
1431+{
1432+#if 0
1433+ MyTraits<float>::SIMDtype coefficients2[4];
1434+
1435+ if (channels == 1)
1436+ {
1437+ coefficients2[0] = _mm256_set1_ps(((float *)vCoefficients)[0]);
1438+ coefficients2[1] = _mm256_set1_ps(((float *)vCoefficients)[3]);
1439+ coefficients2[2] = _mm256_set1_ps(((float *)vCoefficients)[2]);
1440+ coefficients2[3] = _mm256_set1_ps(((float *)vCoefficients)[1]);
1441+ vCoefficients = coefficients2;
1442+ }
1443+ Convolve1DHorizontalRef<transposeOut, isForwardPass, isBorder, channels>(out,
1444+ in,
1445+ borderValues,
1446+ xStart, xEnd, width, height,
1447+ vCoefficients, M);
1448+ return;
1449+#endif
1450+ const ssize_t xStep = isForwardPass ? 1 : -1;
1451+
1452+ if (channels == 4)
1453+ {
1454+ ssize_t y = 0;
1455+#ifdef __AVX__
1456+ for (; y <= height - 2; y += 2) // AVX code process 2 rows at a time
1457+ {
1458+ __m256 prevOut[N];
1459+
1460+ ssize_t x = xStart;
1461+
1462+ if (isBorder && !isForwardPass)
1463+ {
1464+ float u[N + 1][2 * channels]; //[x][y][channels]
1465+ for (ssize_t i = 0; i < N + 1; ++i)
1466+ {
1467+ __m128 temp;
1468+ _mm_storeu_ps(&u[i][0], LoadFloats(temp, &in[y][(xStart + i * xStep) * channels]));
1469+ _mm_storeu_ps(&u[i][channels], LoadFloats(temp, &in[y + 1][(xStart + i * xStep) * channels]));
1470+ }
1471+ float backwardsInitialState[N][2 * channels];
1472+ calcTriggsSdikaInitialization<2 * channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState);
1473+ for (ssize_t i = 0; i < N; ++i)
1474+ LoadFloats(prevOut[i], backwardsInitialState[i]);
1475+
1476+ if (transposeOut)
1477+ {
1478+ StoreFloats(&out[x][y * channels], _mm256_castps256_ps128(prevOut[0]));
1479+ StoreFloats(&out[x][(y + 1) * channels], _mm256_extractf128_ps(prevOut[0], 1));
1480+ }
1481+ else
1482+ {
1483+ StoreFloats(&out[y][x * channels], _mm256_castps256_ps128(prevOut[0]));
1484+ StoreFloats(&out[y + 1][x * channels], _mm256_extractf128_ps(prevOut[0], 1));
1485+ }
1486+ x += xStep;
1487+ if (x == xEnd)
1488+ continue;
1489+ }
1490+ else if (isBorder && isForwardPass)
1491+ {
1492+ // xStart must be 0
1493+ __m256 firstPixel = Load4x2Floats(&in[y][0 * channels], &in[y + 1][0 * channels]);
1494+ for (ssize_t i = 0; i < N; ++i)
1495+ prevOut[i] = firstPixel;
1496+ }
1497+ else
1498+ {
1499+ for (ssize_t i = 0; i < N; ++i)
1500+ {
1501+ prevOut[i] = transposeOut ? Load4x2Floats(&out[xStart - (i + 1) * xStep][y * channels],
1502+ &out[xStart - (i + 1) * xStep][(y + 1) * channels])
1503+ : Load4x2Floats(&out[y][(xStart - (i + 1) * xStep) * channels],
1504+ &out[y + 1][(xStart - (i + 1) * xStep) * channels]);
1505+ }
1506+ }
1507+
1508+#if 0 // 2x slower than no unrolling - too many register spills?
1509+ // same as loop below, but unrolled 3 times to increase ||ism, hide latency, and reduce overhead of shifting the sliding window (prevOut)
1510+ __m256 vIn = Load4x2Floats(&in[y][xStart * channels],
1511+ &in[y + 1][xStart * channels]);
1512+
1513+ for (; isForwardPass ? (x < xEnd - 3) : (x > xEnd + 3); )
1514+ {
1515+ __m256 vSum;
1516+ DoOneIIR<channels, transposeOut, xStep, 0, 1, 2>(out, in, vSum, vIn, x, y, vCoefficients, prevOut);
1517+ prevOut[2] = vSum;
1518+ x += xStep;
1519+
1520+ DoOneIIR<channels, transposeOut, xStep, 2, 0, 1>(out, in, vSum, vIn, x, y, vCoefficients, prevOut);
1521+ prevOut[1] = vSum;
1522+ x += xStep;
1523+
1524+ DoOneIIR<channels, transposeOut, xStep, 1, 2, 0>(out, in, vSum, vIn, x, y, vCoefficients, prevOut);
1525+ prevOut[0] = vSum;
1526+ x += xStep;
1527+ }
1528+#endif
1529+ for (; x != xEnd; x += xStep)
1530+ {
1531+ __m256 vIn = Load4x2Floats(&in[y][x * channels],
1532+ &in[y + 1][x * channels]),
1533+ vSum;
1534+
1535+ // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3]
1536+#ifdef __FMA__
1537+ // this expression uses fewer MADs than the max. possible, but has a shorter critical path and is actually faster
1538+ vSum = MultiplyAdd(vIn, vCoefficients[0], prevOut[2] * vCoefficients[3]) + MultiplyAdd(prevOut[1], vCoefficients[2], prevOut[0] * vCoefficients[1]);
1539+#else
1540+ vSum = prevOut[0] * vCoefficients[1]
1541+ + prevOut[1] * vCoefficients[2]
1542+ + prevOut[2] * vCoefficients[3]
1543+ + vIn * vCoefficients[0];
1544+#endif
1545+
1546+ if (transposeOut)
1547+ {
1548+ StoreFloats(&out[x][y * channels], _mm256_castps256_ps128(vSum));
1549+ StoreFloats(&out[x][(y + 1) * channels], _mm256_extractf128_ps(vSum, 1));
1550+ }
1551+ else
1552+ {
1553+ StoreFloats(&out[y][x * channels], _mm256_castps256_ps128(vSum));
1554+ StoreFloats(&out[y + 1][x * channels], _mm256_extractf128_ps(vSum, 1));
1555+ }
1556+ prevOut[2] = prevOut[1];
1557+ prevOut[1] = prevOut[0];
1558+ prevOut[0] = vSum;
1559+ }
1560+ }
1561+#endif
1562+ for (; y < height; ++y)
1563+ {
1564+ __m128 prevOut[N];
1565+ ssize_t x = xStart;
1566+
1567+ if (isBorder && !isForwardPass)
1568+ {
1569+ float u[N + 1][channels]; //[x][channels]
1570+ for (ssize_t i = 0; i < N + 1; ++i)
1571+ {
1572+ __m128 temp;
1573+ _mm_storeu_ps(u[i], LoadFloats(temp, &in[y][(xStart + i * xStep) * channels]));
1574+ }
1575+ float backwardsInitialState[N][channels];
1576+ calcTriggsSdikaInitialization<channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState);
1577+ for (ssize_t i = 0; i < N; ++i)
1578+ LoadFloats(prevOut[i], backwardsInitialState[i]);
1579+
1580+ if (transposeOut)
1581+ StoreFloats(&out[x][y * channels], prevOut[0]);
1582+ else
1583+ StoreFloats(&out[y][x * channels], prevOut[0]);
1584+ x += xStep;
1585+ if (x == xEnd)
1586+ continue;
1587+ }
1588+ else if (isBorder && isForwardPass)
1589+ {
1590+ // xStart must be 0
1591+ __m128 firstPixel;
1592+ LoadFloats(firstPixel, &in[y][0 * channels]);
1593+ for (ssize_t i = 0; i < N; ++i)
1594+ prevOut[i] = firstPixel;
1595+ }
1596+ else
1597+ {
1598+ for (ssize_t i = 0; i < N; ++i)
1599+ {
1600+ if (transposeOut)
1601+ LoadFloats(prevOut[i], &out[xStart - (i + 1) * xStep][y * channels]);
1602+ else
1603+ LoadFloats(prevOut[i], &out[y][(xStart - (i + 1) * xStep) * channels]);
1604+ }
1605+ }
1606+
1607+ do
1608+ {
1609+ __m128 vIn, vSum;
1610+ LoadFloats(vIn, &in[y][x * channels]);
1611+ // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3]
1612+#ifdef __FMA__
1613+ // this expression uses fewer MADs than the max. possible, but has a shorter critical path and is actually faster
1614+ vSum = MultiplyAdd(vIn, Cast256To128(vCoefficients[0]), prevOut[2] * Cast256To128(vCoefficients[3])) + MultiplyAdd(prevOut[1], Cast256To128(vCoefficients[2]), prevOut[0] * Cast256To128(vCoefficients[1]));
1615+#else
1616+ vSum = prevOut[0] * Cast256To128(vCoefficients[1])
1617+ + prevOut[1] * Cast256To128(vCoefficients[2])
1618+ + prevOut[2] * Cast256To128(vCoefficients[3])
1619+ + vIn * Cast256To128(vCoefficients[0]);
1620+#endif
1621+ if (transposeOut)
1622+ {
1623+ StoreFloats(&out[x][y * channels], vSum);
1624+ }
1625+ else
1626+ {
1627+ StoreFloats(&out[y][x * channels], vSum);
1628+ }
1629+ prevOut[2] = prevOut[1];
1630+ prevOut[1] = prevOut[0];
1631+ prevOut[0] = vSum;
1632+ x += xStep;
1633+ } while (x != xEnd);
1634+ }
1635+ }
1636+ else
1637+ {
1638+ //static_assert(!transposeOut, "transpose not supported");
1639+ ssize_t y = 0;
1640+
1641+ const ssize_t Y_BLOCK_SIZE = 8;
1642+#ifdef __AVX__
1643+ for (; y <= height - Y_BLOCK_SIZE; y += Y_BLOCK_SIZE)
1644+ {
1645+ if (isForwardPass)
1646+ {
1647+ ssize_t x = xStart;
1648+ __m256 feedback[4],
1649+ k0 = vCoefficients[0],
1650+ k1 = vCoefficients[1],
1651+ k2 = vCoefficients[2],
1652+ k3 = vCoefficients[3];
1653+
1654+ if (isBorder && isForwardPass)
1655+ {
1656+ // xStart must be 0
1657+ for (ssize_t i = 0; i < 4; ++i)
1658+ {
1659+ feedback[i] = _mm256_setr_m128(_mm_set1_ps(in[y + i * 2][0]),
1660+ _mm_set1_ps(in[y + i * 2 + 1][0]));
1661+ }
1662+ }
1663+ else
1664+ {
1665+ for (ssize_t i = 0; i < 4; ++i)
1666+ {
1667+ feedback[i] = Load4x2Floats(&out[y + i * 2][xStart - 3 * xStep],
1668+ &out[y + i * 2 + 1][xStart - 3 * xStep]);
1669+ feedback[i] = _mm256_shuffle_ps(feedback[i], feedback[i], _MM_SHUFFLE(2, 1, 0, 0));
1670+ }
1671+ }
1672+ // feedback0 = [garbage y-3 y-2 y-1]
1673+ for (; x <= xEnd - 4; x += 4)
1674+ {
1675+ __m256 _in[4];
1676+ for (ssize_t i = 0; i < 4; ++i)
1677+ _in[i] = Load4x2Floats(&in[y + i * 2][x], &in[y + i * 2 + 1][x]);
1678+
1679+ for (int i = 0; i < 4; ++i)
1680+ {
1681+ feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x11);
1682+ feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k0, 0xf1), 0x11); // insert back input
1683+ }
1684+
1685+ for (int i = 0; i < 4; ++i)
1686+ {
1687+ feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x22);
1688+ feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k1, 0xf2), 0x22); // insert back input
1689+ }
1690+
1691+ for (int i = 0; i < 4; ++i)
1692+ {
1693+ feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x44);
1694+ feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k2, 0xf4), 0x44); // insert back input
1695+ }
1696+
1697+ for (ssize_t i = 0; i < 4; ++i)
1698+ {
1699+ feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x88);
1700+ feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k3, 0xf8), 0x88); // insert back input
1701+
1702+ _mm_storeu_ps((float *)&out[y + i * 2][x], _mm256_castps256_ps128(feedback[i]));
1703+ _mm_storeu_ps((float *)&out[y + i * 2 + 1][x], _mm256_extractf128_ps(feedback[i], 1));
1704+ }
1705+ }
1706+ for (; x != xEnd; x += xStep)
1707+ {
1708+ // todo: make these scalar loads to avoid buffer overflow
1709+ __m256 _in[4];
1710+ for (ssize_t i = 0; i < 4; ++i)
1711+ {
1712+ _in[i] = Load4x2Floats(&in[y + i * 2][x],
1713+ &in[y + i * 2 + 1][x]);
1714+ }
1715+
1716+ for (int i = 0; i < 4; ++i)
1717+ {
1718+ feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x11);
1719+ feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k0, 0xf1), 0x11); // insert back input
1720+ }
1721+
1722+ for (ssize_t i = 0; i < 4; ++i)
1723+ {
1724+ out[y + i * 2][x] = _mm_cvtss_f32(_mm256_castps256_ps128(feedback[i]));
1725+ out[y + i * 2 + 1][x] = _mm_cvtss_f32(_mm256_extractf128_ps(feedback[i], 1));
1726+ }
1727+
1728+ for (int i = 0; i < 4; ++i)
1729+ feedback[i] = _mm256_shuffle_ps(feedback[i], feedback[i], _MM_SHUFFLE(0, 3, 2, 0));
1730+ }
1731+ }
1732+ else
1733+ {
1734+ __m256 feedback[4],
1735+ k4 = vCoefficients[4],
1736+ k5 = vCoefficients[5],
1737+ k6 = vCoefficients[6],
1738+ k7 = vCoefficients[7];
1739+ ssize_t x = xStart;
1740+ if (isBorder && !isForwardPass)
1741+ {
1742+ // xstart must be width - 1
1743+ float u[N + 1][8 * channels]; //[x][y][channels]
1744+ for (ssize_t i = 0; i < N + 1; ++i)
1745+ {
1746+ for (ssize_t _y = 0; _y < 8; ++_y)
1747+ u[i][_y] = in[y + _y][xStart + i * xStep];
1748+ }
1749+ float backwardsInitialState[N][8 * channels];
1750+ calcTriggsSdikaInitialization<8 * channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState);
1751+
1752+ for (ssize_t i = 0; i < 4; ++i)
1753+ {
1754+ float temp[2][N + 1];
1755+ for (ssize_t j = 0; j < N; ++j)
1756+ {
1757+ temp[0][j] = backwardsInitialState[j][i * 2];
1758+ temp[1][j] = backwardsInitialState[j][i * 2 + 1];
1759+ }
1760+ feedback[i] = Load4x2Floats(temp[0], temp[1]);
1761+ }
1762+
1763+ for (ssize_t _y = 0; _y < Y_BLOCK_SIZE; ++_y)
1764+ out[y + _y][x] = backwardsInitialState[0][_y];
1765+
1766+ x += xStep;
1767+ if (x == xEnd)
1768+ continue;
1769+ }
1770+ else
1771+ {
1772+ for (ssize_t i = 0; i < 4; ++i)
1773+ {
1774+ feedback[i] = Load4x2Floats(&out[y + i * 2][xStart - xStep],
1775+ &out[y + i * 2 + 1][xStart - xStep]);
1776+ }
1777+ }
1778+ for (; x - 4 >= xEnd; x -= 4)
1779+ {
1780+ __m256 _in[4];
1781+ for (ssize_t i = 0; i < 4; ++i)
1782+ {
1783+ _in[i] = Load4x2Floats(&in[y + i * 2][x - 3],
1784+ &in[y + i * 2 + 1][x - 3]);
1785+ }
1786+
1787+ for (int i = 0; i < 4; ++i)
1788+ {
1789+ feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x88);
1790+ feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k4, 0xf8), 0x88); // insert back input
1791+ }
1792+
1793+ for (int i = 0; i < 4; ++i)
1794+ {
1795+ feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x44);
1796+ feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k5, 0xf4), 0x44); // insert back input
1797+ }
1798+
1799+ for (int i = 0; i < 4; ++i)
1800+ {
1801+ feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x22);
1802+ feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k6, 0xf2), 0x22); // insert back input
1803+ }
1804+
1805+ for (ssize_t i = 0; i < 4; ++i)
1806+ {
1807+ feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x11);
1808+ feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k7, 0xf1), 0x11); // insert back input
1809+
1810+ StoreFloats(&out[y + i * 2][x - 3], _mm256_castps256_ps128(feedback[i]));
1811+ StoreFloats(&out[y + i * 2 + 1][x - 3], _mm256_extractf128_ps(feedback[i], 1));
1812+ }
1813+ }
1814+
1815+ for ( ; x != xEnd; x += xStep)
1816+ {
1817+ // todo: make these scalar loads to avoid buffer overflow
1818+ __m256 _in[4];
1819+ for (ssize_t i = 0; i < 4; ++i)
1820+ {
1821+ _in[i] = Load4x2Floats(&in[y + i * 2][x - 3],
1822+ &in[y + i * 2 + 1][x - 3]);
1823+ }
1824+
1825+ for (int i = 0; i < 4; ++i)
1826+ {
1827+ feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x88);
1828+ feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k4, 0xf8), 0x88); // insert back input
1829+ }
1830+
1831+ for (ssize_t i = 0; i < 4; ++i)
1832+ {
1833+ __m256 temp = _mm256_shuffle_ps(feedback[i], feedback[i], _MM_SHUFFLE(0, 0, 0, 3));
1834+ out[y + i * 2][x] = _mm_cvtss_f32(_mm256_castps256_ps128(temp));
1835+ out[y + i * 2 + 1][x] = _mm_cvtss_f32(_mm256_extractf128_ps(temp, 1));
1836+ }
1837+
1838+
1839+ for (int i = 0; i < 4; ++i)
1840+ feedback[i] = _mm256_shuffle_ps(feedback[i], feedback[i], _MM_SHUFFLE(2, 1, 0, 3));
1841+ }
1842+ }
1843+ }
1844+#endif
1845+ for (; y < height; ++y)
1846+ {
1847+ if (isForwardPass)
1848+ {
1849+ ssize_t x = xStart;
1850+ __m128 feedback0,
1851+ k0 = Cast256To128(vCoefficients[0]);
1852+
1853+ if (isBorder && isForwardPass)
1854+ {
1855+ // xStart must be 0
1856+ feedback0 = _mm_set1_ps(in[y][0]);
1857+ }
1858+ else
1859+ {
1860+ LoadFloats(feedback0, &out[y][xStart - 3 * xStep]);
1861+ feedback0 = _mm_shuffle_ps(feedback0, feedback0, _MM_SHUFFLE(2, 1, 0, 0));
1862+ }
1863+ // feedback0 = [garbage y-3 y-2 y-1]
1864+ for (; x != xEnd; x += xStep)
1865+ {
1866+ __m128 _in0 = _mm_set1_ps(in[y][x]);
1867+
1868+ #ifdef __SSE4_1__
1869+ feedback0 = _mm_blend_ps(feedback0, _in0, 0x1);
1870+ feedback0 = _mm_blend_ps(feedback0, _mm_dp_ps(feedback0, k0, 0xf1), 0x1); // insert back input
1871+ #else
1872+ const __m128 FIRST_ELEMENT_MASK = _mm_castsi128_ps(_mm_set_epi32(0, 0, 0, ~0));
1873+ feedback0 = Select(feedback0, _in0, FIRST_ELEMENT_MASK);
1874+
1875+ __m128 partialDP = _mm_mul_ps(feedback0, k0);
1876+ partialDP = _mm_add_ps(partialDP, _mm_shuffle_ps(partialDP, partialDP, _MM_SHUFFLE(0, 0, 3, 2)));
1877+ __m128 DP = _mm_add_ps(partialDP, _mm_shuffle_ps(partialDP, partialDP, _MM_SHUFFLE(0, 0, 0, 1)));
1878+
1879+ feedback0 = Select(feedback0, DP, FIRST_ELEMENT_MASK); // insert back input
1880+ #endif
1881+ out[y][x] = _mm_cvtss_f32(feedback0);
1882+ feedback0 = _mm_shuffle_ps(feedback0, feedback0, _MM_SHUFFLE(0, 3, 2, 0));
1883+ }
1884+ }
1885+ else
1886+ {
1887+ __m128 feedback0,
1888+ k4 = Cast256To128(vCoefficients[4]);
1889+
1890+ ssize_t x = xStart;
1891+ if (isBorder && !isForwardPass)
1892+ {
1893+ // xstart must be width - 1
1894+ float u[N + 1][channels]; //[x][y][channels]
1895+ for (ssize_t i = 0; i < N + 1; ++i)
1896+ {
1897+ u[i][0] = in[y][xStart + i * xStep];
1898+ }
1899+ float backwardsInitialState[N][channels];
1900+ calcTriggsSdikaInitialization<channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState);
1901+
1902+ float temp[N + 1];
1903+ for (ssize_t i = 0; i < N; ++i)
1904+ {
1905+ temp[i] = backwardsInitialState[i][0];
1906+ }
1907+ LoadFloats(feedback0, temp);
1908+
1909+ out[y][x] = backwardsInitialState[0][0];
1910+ x += xStep;
1911+ if (x == xEnd)
1912+ continue;
1913+ }
1914+ else
1915+ {
1916+ LoadFloats(feedback0, &out[y][xStart - xStep]);
1917+ }
1918+
1919+ for (; x != xEnd; x += xStep)
1920+ {
1921+ __m128 _in0 = _mm_set1_ps(in[y][x]);
1922+
1923+ #ifdef __SSE4_1__
1924+ feedback0 = _mm_blend_ps(feedback0, _in0, 0x8);
1925+ feedback0 = _mm_blend_ps(feedback0, _mm_dp_ps(feedback0, k4, 0xf8), 0x8); // insert back input
1926+ #else
1927+ const __m128 LAST_ELEMENT_MASK = _mm_castsi128_ps(_mm_set_epi32(~0, 0, 0, 0));
1928+ feedback0 = Select(feedback0, _in0, LAST_ELEMENT_MASK);
1929+
1930+ __m128 partialDP = _mm_mul_ps(feedback0, k4);
1931+ partialDP = _mm_add_ps(partialDP, _mm_shuffle_ps(partialDP, partialDP, _MM_SHUFFLE(1, 0, 0, 0)));
1932+ __m128 DP = _mm_add_ps(partialDP, _mm_shuffle_ps(partialDP, partialDP, _MM_SHUFFLE(2, 0, 0, 0)));
1933+
1934+ feedback0 = Select(feedback0, DP, LAST_ELEMENT_MASK); // insert back input
1935+ #endif
1936+ __m128 temp = _mm_shuffle_ps(feedback0, feedback0, _MM_SHUFFLE(0, 0, 0, 3));
1937+ out[y][x] = _mm_cvtss_f32(temp);
1938+ feedback0 = _mm_shuffle_ps(feedback0, feedback0, _MM_SHUFFLE(2, 1, 0, 3));
1939+ }
1940+ }
1941+ }
1942+ }
1943+}
1944+
1945+
1946+// does 1D IIR convolution on multiple rows (height) of data
1947+// IntermediateType must be float or double
1948+template <bool isForwardPass, bool isBorder, typename OutType, typename InType, typename IntermediateType>
1949+FORCE_INLINE void Convolve1DVerticalRef(SimpleImage<OutType> out,
1950+ SimpleImage<InType> in,
1951+ IntermediateType *borderValues, // [y][color]
1952+ ssize_t yStart, ssize_t yEnd, ssize_t width, ssize_t height,
1953+ typename MyTraits<IntermediateType>::SIMDtype *vCoefficients, double M[N * N])
1954+{
1955+ ssize_t yStep = isForwardPass ? 1 : -1;
1956+
1957+ ssize_t x = 0;
1958+ do
1959+ {
1960+ IntermediateType prevOut[N];
1961+ ssize_t y = yStart;
1962+ if (isBorder && !isForwardPass)
1963+ {
1964+ IntermediateType u[N + 1][1]; // u[0] = last forward filtered value, u[1] = 2nd last forward filtered value, ...
1965+ for (ssize_t i = 0; i < N + 1; ++i)
1966+ {
1967+ u[i][0] = in[yStart + i * yStep][x];
1968+ }
1969+ IntermediateType backwardsInitialState[N][1];
1970+ calcTriggsSdikaInitialization<1>(M, u, &borderValues[x], &borderValues[x], ExtractElement0(vCoefficients[0]), backwardsInitialState);
1971+ for (ssize_t i = 0; i < N; ++i)
1972+ prevOut[i] = backwardsInitialState[i][0];
1973+
1974+ out[y][x] = clip_round_cast<OutType, IntermediateType>(prevOut[0]);
1975+ y += yStep;
1976+ if (y == yEnd)
1977+ goto nextIteration;
1978+ }
1979+ else if (isBorder && isForwardPass)
1980+ {
1981+ for (ssize_t i = 0; i < N; ++i)
1982+ prevOut[i] = in[0][x];
1983+ }
1984+ else
1985+ {
1986+ for (ssize_t i = 0; i < N; ++i)
1987+ prevOut[i] = out[yStart - (i + 1) * yStep][x];
1988+ }
1989+
1990+ do
1991+ {
1992+ IntermediateType sum = prevOut[0] * ExtractElement0(vCoefficients[1])
1993+ + prevOut[1] * ExtractElement0(vCoefficients[2])
1994+ + prevOut[2] * ExtractElement0(vCoefficients[3])
1995+ + in[y][x] * ExtractElement0(vCoefficients[0]); // add last for best accuracy since this terms tends to be the smallest
1996+
1997+
1998+ out[y][x] = clip_round_cast<OutType, IntermediateType>(sum);
1999+ prevOut[2] = prevOut[1];
2000+ prevOut[1] = prevOut[0];
2001+ prevOut[0] = sum;
2002+ y += yStep;
2003+ } while (y != yEnd);
2004+ nextIteration:
2005+ ++x;
2006+ } while (x < width);
2007+}
2008+
2009+
2010+
2011+// input is always untransposed
2012+// for reverse pass, input is output from forward pass
2013+// for transposed output, in-place operation isn't possible
2014+// hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined
2015+template <bool isForwardPass, bool isBorder, typename OutType, typename InType, typename SIMD_Type>
2016+static /*FORCE_INLINE*/ void Convolve1DVertical(SimpleImage<OutType> out,
2017+ SimpleImage<InType> in,
2018+ float *borderValues,
2019+ ssize_t yStart, ssize_t yEnd, ssize_t width, ssize_t height,
2020+ SIMD_Type *vCoefficients, double M[N * N])
2021+{
2022+#if 0
2023+ Convolve1DVerticalRef<isForwardPass, isBorder>(out,
2024+ in,
2025+ borderValues,
2026+ yStart, yEnd, width, height,
2027+ vCoefficients, M);
2028+ return;
2029+#endif
2030+ const ssize_t yStep = isForwardPass ? 1 : -1;
2031+
2032+ const int SIMD_WIDTH = 8;
2033+ ssize_t x = 0;
2034+#ifdef __AVX__
2035+ for ( ; x <= width - SIMD_WIDTH; x += SIMD_WIDTH)
2036+ {
2037+ __m256 prevOut[N];
2038+ ssize_t y = yStart;
2039+ if (isBorder && !isForwardPass)
2040+ {
2041+ float u[N + 1][SIMD_WIDTH]; //[x][channels]
2042+ for (ssize_t i = 0; i < N + 1; ++i)
2043+ {
2044+ __m256 temp;
2045+ _mm256_storeu_ps(u[i], LoadFloats(temp, &in[yStart + i * yStep][x]));
2046+ }
2047+ float backwardsInitialState[N][SIMD_WIDTH];
2048+ calcTriggsSdikaInitialization<SIMD_WIDTH>(M, u, &borderValues[x], &borderValues[x], ExtractElement0(vCoefficients[0]), backwardsInitialState);
2049+ for (ssize_t i = 0; i < N; ++i)
2050+ LoadFloats(prevOut[i], backwardsInitialState[i]);
2051+
2052+ StoreFloats(&out[y][x], prevOut[0]);
2053+
2054+ y += yStep;
2055+ if (y == yEnd)
2056+ continue;
2057+ }
2058+ else if (isBorder && isForwardPass)
2059+ {
2060+ // yStart must be 0
2061+ __m256 firstPixel;
2062+ LoadFloats(firstPixel, &in[0][x]);
2063+ for (ssize_t i = 0; i < N; ++i)
2064+ prevOut[i] = firstPixel;
2065+ }
2066+ else
2067+ {
2068+ for (ssize_t i = 0; i < N; ++i)
2069+ {
2070+ LoadFloats(prevOut[i], &out[yStart - (i + 1) * yStep][x]);
2071+ }
2072+ }
2073+
2074+ do
2075+ {
2076+ __m256 vIn;
2077+ LoadFloats(vIn, &in[y][x]);
2078+ __m256 vSum = vIn * vCoefficients[0];
2079+
2080+ vSum = prevOut[0] * vCoefficients[1]
2081+ + prevOut[1] * vCoefficients[2]
2082+ + prevOut[2] * vCoefficients[3]
2083+ + vSum;
2084+
2085+ StoreFloats(&out[y][x], vSum);
2086+
2087+ prevOut[2] = prevOut[1];
2088+ prevOut[1] = prevOut[0];
2089+ prevOut[0] = vSum;
2090+ y += yStep;
2091+ } while (isForwardPass ? (y < yEnd) : (y > yEnd));
2092+ }
2093+#endif
2094+ {
2095+ const ssize_t SIMD_WIDTH = 4;
2096+ for (; x < width; x += SIMD_WIDTH)
2097+ {
2098+ __m128 prevOut[N];
2099+ ssize_t y = yStart;
2100+ if (isBorder && !isForwardPass)
2101+ {
2102+ float u[N + 1][SIMD_WIDTH]; //[x][channels]
2103+ for (ssize_t i = 0; i < N + 1; ++i)
2104+ {
2105+ __m128 temp;
2106+ _mm_storeu_ps(u[i], LoadFloats(temp, &in[yStart + i * yStep][x]));
2107+ }
2108+ float backwardsInitialState[N][SIMD_WIDTH];
2109+ calcTriggsSdikaInitialization<SIMD_WIDTH>(M, u, &borderValues[x], &borderValues[x], ExtractElement0(vCoefficients[0]), backwardsInitialState);
2110+ for (ssize_t i = 0; i < N; ++i)
2111+ LoadFloats(prevOut[i], backwardsInitialState[i]);
2112+
2113+ StoreFloats<true>(&out[y][x], prevOut[0], min(SIMD_WIDTH, width - x)); // todo: specialize loop to avoid partial stores
2114+
2115+ y += yStep;
2116+ if (y == yEnd)
2117+ continue;
2118+ }
2119+ else if (isBorder && isForwardPass)
2120+ {
2121+ // yStart must be 0
2122+ __m128 firstPixel;
2123+ LoadFloats(firstPixel, &in[0][x]);
2124+ for (ssize_t i = 0; i < N; ++i)
2125+ prevOut[i] = firstPixel;
2126+ }
2127+ else
2128+ {
2129+ for (ssize_t i = 0; i < N; ++i)
2130+ {
2131+ LoadFloats(prevOut[i], &out[yStart - (i + 1) * yStep][x]);
2132+ }
2133+ }
2134+
2135+ do
2136+ {
2137+ __m128 vIn;
2138+ LoadFloats(vIn, &in[y][x]);
2139+ __m128 vSum = vIn * Cast256To128(vCoefficients[0]);
2140+
2141+ vSum = prevOut[0] * Cast256To128(vCoefficients[1])
2142+ + prevOut[1] * Cast256To128(vCoefficients[2])
2143+ + prevOut[2] * Cast256To128(vCoefficients[3])
2144+ + vSum;
2145+
2146+ StoreFloats<true>(&out[y][x], vSum, min(SIMD_WIDTH, width - x)); // todo: specialize loop to avoid partial stores
2147+
2148+ prevOut[2] = prevOut[1];
2149+ prevOut[1] = prevOut[0];
2150+ prevOut[0] = vSum;
2151+ y += yStep;
2152+ } while (isForwardPass ? (y < yEnd) : (y > yEnd));
2153+ }
2154+ }
2155+}
2156+
2157+// input is always untransposed
2158+// for reverse pass, input is output from forward pass
2159+// for transposed output, in-place operation isn't possible
2160+// hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined
2161+template <bool isForwardPass, bool isBorder, typename OutType, typename InType, typename SIMD_Type>
2162+static /*FORCE_INLINE*/ void Convolve1DVertical(SimpleImage<OutType> out,
2163+ SimpleImage<InType> in,
2164+ double *borderValues,
2165+ ssize_t yStart, ssize_t yEnd, ssize_t width, ssize_t height,
2166+ SIMD_Type *vCoefficients, double M[N * N])
2167+{
2168+#if 0
2169+ Convolve1DVerticalRef<isForwardPass, isBorder>(out,
2170+ in,
2171+ borderValues,
2172+ yStart, yEnd, width, height,
2173+ vCoefficients, M);
2174+ return;
2175+#endif
2176+
2177+ const ssize_t yStep = isForwardPass ? 1 : -1,
2178+ SIMD_WIDTH = 4;
2179+ ssize_t x = 0;
2180+#ifdef __AVX__
2181+ for ( ; x <= width - SIMD_WIDTH; x += SIMD_WIDTH)
2182+ {
2183+ __m256d prevOut[N];
2184+ ssize_t y = yStart;
2185+
2186+ if (isBorder && !isForwardPass)
2187+ {
2188+ // condition: yStart must be height - 1
2189+ double u[N + 1][SIMD_WIDTH]; //[x][channels]
2190+ for (ssize_t i = 0; i < N + 1; ++i)
2191+ {
2192+ __m256d temp;
2193+ _mm256_storeu_pd(u[i], LoadDoubles(temp, &in[yStart + i * yStep][x]));
2194+ }
2195+ double backwardsInitialState[N][SIMD_WIDTH];
2196+ calcTriggsSdikaInitialization<SIMD_WIDTH>(M, u, &borderValues[x], &borderValues[x], ExtractElement0(vCoefficients[0]), backwardsInitialState);
2197+ for (ssize_t i = 0; i < N; ++i)
2198+ LoadDoubles(prevOut[i], backwardsInitialState[i]);
2199+
2200+ StoreDoubles(&out[y][x], prevOut[0]);
2201+
2202+ y += yStep;
2203+ if (y == yEnd)
2204+ continue;
2205+ }
2206+ else if (isBorder && isForwardPass)
2207+ {
2208+ // condition: yStart must be 0
2209+ __m256d firstPixel;
2210+ LoadDoubles(firstPixel, &in[0][x]);
2211+ for (ssize_t i = 0; i < N; ++i)
2212+ prevOut[i] = firstPixel;
2213+ }
2214+ else
2215+ {
2216+ for (ssize_t i = 0; i < N; ++i)
2217+ LoadDoubles(prevOut[i], &out[yStart - (i + 1) * yStep][x]);
2218+ }
2219+
2220+ do
2221+ {
2222+ __m256d vIn;
2223+ LoadDoubles(vIn, &in[y][x]);
2224+ __m256d vSum = vIn * vCoefficients[0];
2225+
2226+ vSum = prevOut[0] * vCoefficients[1]
2227+ + prevOut[1] * vCoefficients[2]
2228+ + prevOut[2] * vCoefficients[3]
2229+ + vSum;
2230+
2231+ StoreDoubles(&out[y][x], vSum);
2232+
2233+ prevOut[2] = prevOut[1];
2234+ prevOut[1] = prevOut[0];
2235+ prevOut[0] = vSum;
2236+ y += yStep;
2237+ } while (y != yEnd);
2238+ }
2239+#endif
2240+ {
2241+ const ssize_t SIMD_WIDTH = 2;
2242+ for (; x < width; x += SIMD_WIDTH)
2243+ {
2244+ __m128d prevOut[N];
2245+ ssize_t y = yStart;
2246+
2247+ if (isBorder && !isForwardPass)
2248+ {
2249+ // condition: yStart must be height - 1
2250+ double u[N + 1][SIMD_WIDTH]; //[x][channels]
2251+ for (ssize_t i = 0; i < N + 1; ++i)
2252+ {
2253+ __m128d temp;
2254+ _mm_storeu_pd(u[i], LoadDoubles(temp, &in[yStart + i * yStep][x]));
2255+ }
2256+ double backwardsInitialState[N][SIMD_WIDTH];
2257+ calcTriggsSdikaInitialization<SIMD_WIDTH>(M, u, &borderValues[x], &borderValues[x], ExtractElement0(vCoefficients[0]), backwardsInitialState);
2258+ for (ssize_t i = 0; i < N; ++i)
2259+ LoadDoubles(prevOut[i], backwardsInitialState[i]);
2260+
2261+ StoreDoubles<true>(&out[y][x], prevOut[0], min(SIMD_WIDTH, width - x)); // todo: specialize loop to avoid partial stores
2262+
2263+ y += yStep;
2264+ if (y == yEnd)
2265+ continue;
2266+ }
2267+ else if (isBorder && isForwardPass)
2268+ {
2269+ // condition: yStart must be 0
2270+ __m128d firstPixel;
2271+ LoadDoubles(firstPixel, &in[0][x]);
2272+ for (ssize_t i = 0; i < N; ++i)
2273+ prevOut[i] = firstPixel;
2274+ }
2275+ else
2276+ {
2277+ for (ssize_t i = 0; i < N; ++i)
2278+ LoadDoubles(prevOut[i], &out[yStart - (i + 1) * yStep][x]);
2279+ }
2280+
2281+ do
2282+ {
2283+ __m128d vIn;
2284+ LoadDoubles(vIn, &in[y][x]);
2285+ __m128d vSum = vIn * Cast256To128(vCoefficients[0]);
2286+
2287+ vSum = prevOut[0] * Cast256To128(vCoefficients[1])
2288+ + prevOut[1] * Cast256To128(vCoefficients[2])
2289+ + prevOut[2] * Cast256To128(vCoefficients[3])
2290+ + vSum;
2291+
2292+ StoreDoubles<true>(&out[y][x], vSum, min(SIMD_WIDTH, width - x)); // todo: specialize loop to avoid partial stores
2293+
2294+ prevOut[2] = prevOut[1];
2295+ prevOut[1] = prevOut[0];
2296+ prevOut[0] = vSum;
2297+ y += yStep;
2298+ } while (y != yEnd);
2299+ }
2300+ }
2301+}
2302+
2303+// hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined
2304+template <bool transposeOut, int channels>
2305+static /*FORCE_INLINE*/ void Copy2D(SimpleImage<uint8_t> out, SimpleImage<float> in, ssize_t width, ssize_t height)
2306+{
2307+ ssize_t y = 0;
2308+ do
2309+ {
2310+ ssize_t x = 0;
2311+ if (channels == 4)
2312+ {
2313+ #ifdef __AVX2__
2314+ for (; x <= width - 2; x += 2)
2315+ {
2316+ __m256i vInt = _mm256_cvtps_epi32(_mm256_loadu_ps(&in[y][x * channels]));
2317+ __m256i vInt2 = _mm256_permute2x128_si256(vInt, vInt, 1);
2318+
2319+ __m128i u16 = _mm256_castsi256_si128(_mm256_packus_epi32(vInt, vInt2));
2320+ __m128i vRGBA = _mm_packus_epi16(u16, u16);
2321+
2322+ if (transposeOut)
2323+ {
2324+ *(uint32_t *)&out[x][y * channels] = _mm_extract_epi32(vRGBA, 0);
2325+ *(uint32_t *)&out[x + 1][y * channels] = _mm_extract_epi32(vRGBA, 1);
2326+ }
2327+ else
2328+ {
2329+ _mm_storel_epi64((__m128i *)&out[y][x * channels], vRGBA);
2330+ }
2331+ }
2332+ #endif
2333+ while (x < width)
2334+ {
2335+ __m128 data = _mm_loadu_ps(&in[y][x * channels]);
2336+ if (transposeOut)
2337+ StoreFloats(&out[x][y * channels], data);
2338+ else
2339+ StoreFloats(&out[y][x * channels], data);
2340+ ++x;
2341+ }
2342+ }
2343+ else if (channels == 1)
2344+ {
2345+#ifdef __AVX__
2346+ for (; x <= width - 8; x += 8)
2347+ {
2348+ StoreFloats(&out[y][x], _mm256_loadu_ps(&in[y][x]));
2349+ }
2350+ if (x < width)
2351+ StoreFloats<true>(&out[y][x], _mm256_loadu_ps(&in[y][x]), width - x);
2352+#else
2353+ for (; x <= width - 4; x += 4)
2354+ {
2355+ StoreFloats(&out[y][x], _mm_loadu_ps(&in[y][x]));
2356+ }
2357+ if (x < width)
2358+ StoreFloats<true>(&out[y][x], _mm_loadu_ps(&in[y][x]), width - x);
2359+#endif
2360+ }
2361+ ++y;
2362+ } while (y < height);
2363+}
2364+
2365+// hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined
2366+template <bool transposeOut, int channels>
2367+static /*FORCE_INLINE*/ void Copy2D(SimpleImage<uint16_t> out, SimpleImage<float> in, ssize_t width, ssize_t height)
2368+{
2369+ ssize_t y = 0;
2370+ do
2371+ {
2372+ ssize_t x = 0;
2373+ if (channels == 4)
2374+ {
2375+ #ifdef __AVX2__
2376+ for (; x <= width - 2; x += 2)
2377+ {
2378+ __m256i vInt = _mm256_cvtps_epi32(_mm256_loadu_ps(&in[y][x * channels]));
2379+ __m256i vInt2 = _mm256_permute2x128_si256(vInt, vInt, 1);
2380+
2381+ __m128i vRGBA = _mm256_castsi256_si128(_mm256_packus_epi32(vInt, vInt2));
2382+
2383+ if (transposeOut)
2384+ {
2385+ _mm_storel_epi64((__m128i *)&out[x][y * channels], vRGBA);
2386+ _mm_storel_epi64((__m128i *)&out[x + 1][y * channels], _mm_shuffle_epi32(vRGBA, _MM_SHUFFLE(0, 0, 3, 2)));
2387+ }
2388+ else
2389+ {
2390+ _mm_storeu_si128((__m128i *)&out[y][x * channels], vRGBA);
2391+ }
2392+ }
2393+ #endif
2394+ while (x < width)
2395+ {
2396+ __m128 data = _mm_loadu_ps(&in[y][x * channels]);
2397+ if (transposeOut)
2398+ StoreFloats(&out[x][y * channels], data);
2399+ else
2400+ StoreFloats(&out[y][x * channels], data);
2401+ ++x;
2402+ }
2403+ }
2404+ else if (channels == 1)
2405+ {
2406+ #ifdef __AVX__
2407+ for (; x <= width - 8; x += 8)
2408+ {
2409+ StoreFloats(&out[y][x], _mm256_loadu_ps(&in[y][x]));
2410+ }
2411+ if (x < width)
2412+ StoreFloats<true>(&out[y][x], _mm256_loadu_ps(&in[y][x]), width - x);
2413+ #else
2414+ for (; x <= width - 4; x += 4)
2415+ {
2416+ StoreFloats(&out[y][x], _mm_loadu_ps(&in[y][x]));
2417+ }
2418+ if (x < width)
2419+ StoreFloats<true>(&out[y][x], _mm_loadu_ps(&in[y][x]), width - x);
2420+ #endif
2421+ }
2422+ ++y;
2423+ } while (y < height);
2424+}
2425+
2426+// hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined
2427+template <bool transposeOut, int channels>
2428+static /*FORCE_INLINE*/ void Copy2D(SimpleImage<uint16_t> out, SimpleImage<double> in, ssize_t width, ssize_t height)
2429+{
2430+ ssize_t y = 0;
2431+ do
2432+ {
2433+ ssize_t x = 0;
2434+ if (channels == 4)
2435+ {
2436+ for ( ; x < width; x++)
2437+ {
2438+ #ifdef __AVX__
2439+ __m128i i32 = _mm256_cvtpd_epi32(_mm256_loadu_pd(&in[y][x * channels]));
2440+ #else
2441+ __m128d in0 = _mm_load_pd(&in[y][x * channels]),
2442+ in1 = _mm_load_pd(&in[y][x * channels + 2]);
2443+ __m128i i32 = _mm_castps_si128(_mm_shuffle_ps(_mm_castsi128_ps(_mm_cvtpd_epi32(in0)), _mm_castsi128_ps(_mm_cvtpd_epi32(in1)), _MM_SHUFFLE(1, 0, 1, 0)));
2444+ #endif
2445+ #ifdef __SSE4_1__
2446+ __m128i vRGBA = _mm_packus_epi32(i32, i32);
2447+ #else
2448+ __m128i vRGBA = _mm_max_epi16(_mm_packs_epi32(i32, i32), _mm_setzero_si128()); // hack: can get away with i16 for now
2449+ #endif
2450+
2451+ if (transposeOut)
2452+ {
2453+ _mm_storel_epi64((__m128i *)&out[x][y * channels], vRGBA);
2454+ }
2455+ else
2456+ {
2457+ _mm_storel_epi64((__m128i *)&out[y][x * channels], vRGBA);
2458+ }
2459+ }
2460+ }
2461+ else if (channels == 1)
2462+ {
2463+ #ifdef __AVX__
2464+ for (; x <= width - 4; x += 4)
2465+ {
2466+ StoreDoubles(&out[y][x], _mm256_loadu_pd(&in[y][x]));
2467+ }
2468+ if (x < width)
2469+ {
2470+ StoreDoubles<true>(&out[y][x], _mm256_loadu_pd(&in[y][x]), width - x);
2471+ }
2472+ #else
2473+ for (; x <= width - 2; x += 2)
2474+ {
2475+ StoreDoubles(&out[y][x], _mm_loadu_pd(&in[y][x]));
2476+ }
2477+ if (x < width)
2478+ {
2479+ StoreDoubles<true>(&out[y][x], _mm_loadu_pd(&in[y][x]), width - x);
2480+ }
2481+ #endif
2482+ }
2483+ ++y;
2484+ } while (y < height);
2485+}
2486+
2487+template <bool transposeOut, int channels>
2488+FORCE_INLINE void Copy2D(SimpleImage<float> out, SimpleImage<double> in, ssize_t width, ssize_t height)
2489+{
2490+ ssize_t y = 0;
2491+ do
2492+ {
2493+ if (channels == 4)
2494+ {
2495+ ssize_t x = 0;
2496+ do
2497+ {
2498+ #ifdef __AVX__
2499+ __m128 v4f_data = _mm256_cvtpd_ps(_mm256_loadu_pd(&in[y][x * channels]));
2500+ #else
2501+ __m128 v4f_data = _mm_shuffle_ps(_mm_cvtpd_ps(_mm_loadu_pd(&in[y][x * channels])),
2502+ _mm_cvtpd_ps(_mm_loadu_pd(&in[y][x * channels + 2])), _MM_SHUFFLE(1, 0, 1, 0));
2503+ #endif
2504+ if (transposeOut)
2505+ _mm_store_ps(&out[x][y * channels], v4f_data);
2506+ else
2507+ _mm_store_ps(&out[y][x * channels], v4f_data);
2508+ ++x;
2509+ } while (x < width);
2510+ }
2511+ else
2512+ {
2513+ // 1 channel
2514+ ssize_t x;
2515+ #ifdef __AVX__
2516+ for (x = 0; x <= width - 4; x += 4)
2517+ StoreDoubles(&out[y][x], _mm256_loadu_pd(&in[y][x]));
2518+ if (x < width)
2519+ StoreDoubles<true>(&out[y][x], _mm256_loadu_pd(&in[y][x]), width - x);
2520+ #else
2521+ for (x = 0; x <= width - 2; x += 2)
2522+ StoreDoubles(&out[y][x], _mm_loadu_pd(&in[y][x]));
2523+ if (x < width)
2524+ StoreDoubles<true>(&out[y][x], _mm_loadu_pd(&in[y][x]), width - x);
2525+ #endif
2526+ }
2527+ ++y;
2528+ } while (y < height);
2529+}
2530+
2531+// hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined
2532+template <bool transposeOut, int channels>
2533+static /*FORCE_INLINE*/ void Copy2D(SimpleImage<uint8_t> out, SimpleImage <double> in, ssize_t width, ssize_t height)
2534+{
2535+ ssize_t y = 0;
2536+ do
2537+ {
2538+ if (channels == 4)
2539+ {
2540+ ssize_t x = 0;
2541+ do
2542+ {
2543+ #ifdef __AVX__
2544+ __m256d _in = _mm256_load_pd(&in[y][x * channels]);
2545+ __m128i i32 = _mm256_cvtpd_epi32(_in),
2546+ #else
2547+ __m128d in0 = _mm_load_pd(&in[y][x * channels]),
2548+ in1 = _mm_load_pd(&in[y][x * channels + 2]);
2549+ __m128i i32 = _mm_castps_si128(_mm_shuffle_ps(_mm_castsi128_ps(_mm_cvtpd_epi32(in0)), _mm_castsi128_ps(_mm_cvtpd_epi32(in1)), _MM_SHUFFLE(1, 0, 1, 0))),
2550+ #endif
2551+ #ifdef __SSE4_1__
2552+ u16 = _mm_packus_epi32(i32, i32),
2553+ #else
2554+ u16 = _mm_max_epi16(_mm_packs_epi32(i32, i32), _mm_setzero_si128()),
2555+ #endif
2556+ u8 = _mm_packus_epi16(u16, u16);
2557+ if (transposeOut)
2558+ *(int32_t *)&out[x][y * channels] = _mm_cvtsi128_si32(u8);
2559+ else
2560+ *(int32_t *)&out[y][x * channels] = _mm_cvtsi128_si32(u8);
2561+ ++x;
2562+ } while (x < width);
2563+ }
2564+ else
2565+ {
2566+ // 1 channel
2567+ ssize_t x;
2568+ #ifdef __AVX__
2569+ for (x = 0; x <= width - 4; x += 4)
2570+ StoreDoubles(&out[y][x], _mm256_load_pd(&in[y][x * channels]));
2571+ if (x < width)
2572+ StoreDoubles<true>(&out[y][x], _mm256_load_pd(&in[y][x * channels]), width - x);
2573+ #else
2574+ for (x = 0; x <= width - 2; x += 2)
2575+ StoreDoubles(&out[y][x], _mm_load_pd(&in[y][x * channels]));
2576+ if (x < width)
2577+ StoreDoubles<true>(&out[y][x], _mm_load_pd(&in[y][x * channels]), width - x);
2578+ #endif
2579+ }
2580+ ++y;
2581+ } while (y < height);
2582+}
2583+
2584+#if 0 // comment this function out to ensure everything is vectorized
2585+template <bool transposeOut, int channels, typename OutType, typename InType>
2586+FORCE_INLINE void Copy2D(SimpleImage<OutType> out, SimpleImage <InType> in, ssize_t width, ssize_t height)
2587+{
2588+ ssize_t y = 0;
2589+ do
2590+ {
2591+ ssize_t x = 0;
2592+ do
2593+ {
2594+ ssize_t c = 0;
2595+ do
2596+ {
2597+ if (transposeOut)
2598+ out[x][y * channels + c] = clip_round_cast<OutType, InType>(in[y][x * channels + c]);
2599+ else
2600+ out[y][x * channels + c] = clip_round_cast<OutType, InType>(in[y][x * channels + c]);
2601+ ++c;
2602+ } while (c < channels);
2603+ ++x;
2604+ } while (x < width);
2605+ ++y;
2606+ } while (y < height);
2607+}
2608+#endif
2609+
2610+template <typename AnyType>
2611+void StoreFloatsTransposed(SimpleImage<AnyType> out, __m128 x)
2612+{
2613+ out[0][0] = _mm_cvtss_f32(x);
2614+ out[1][0] = _mm_cvtss_f32(_mm_shuffle_ps(x, x, _MM_SHUFFLE(0, 0, 0, 1)));
2615+ out[2][0] = _mm_cvtss_f32(_mm_shuffle_ps(x, x, _MM_SHUFFLE(0, 0, 0, 2)));
2616+ out[3][0] = _mm_cvtss_f32(_mm_shuffle_ps(x, x, _MM_SHUFFLE(0, 0, 0, 3)));
2617+}
2618+
2619+// input & output are color interleaved
2620+template <bool transposeOut, int channels, typename IntermediateType, typename OutType, typename InType>
2621+void ConvolveHorizontal(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, float sigmaX, bool canOverwriteInput = false)
2622+{
2623+ const bool convertOutput = typeid(OutType) != typeid(IntermediateType);
2624+
2625+ typedef typename MyTraits<IntermediateType>::SIMDtype SIMDtype;
2626+
2627+ double bf[N];
2628+ double M[N*N]; // matrix used for initialization procedure (has to be double)
2629+ double b[N + 1];
2630+
2631+ calcFilter(sigmaX, bf);
2632+
2633+ for (size_t i = 0; i<N; i++)
2634+ bf[i] = -bf[i];
2635+
2636+ b[0] = 1; // b[0] == alpha (scaling coefficient)
2637+
2638+ for (size_t i = 0; i<N; i++)
2639+ {
2640+ b[i + 1] = bf[i];
2641+ b[0] -= b[i + 1];
2642+ }
2643+
2644+ calcTriggsSdikaM(bf, M); // Compute initialization matrix
2645+
2646+ SIMDtype *vCoefficients;
2647+
2648+ if (channels == 4)
2649+ {
2650+ vCoefficients = (SIMDtype *)ALIGNED_ALLOCA(sizeof(SIMDtype) * (N + 1), sizeof(SIMDtype));
2651+ for (ssize_t i = 0; i <= N; ++i)
2652+ BroadcastSIMD(vCoefficients[i], (IntermediateType)b[i]);
2653+ }
2654+ else
2655+ {
2656+ if (typeid(IntermediateType) == typeid(double))
2657+ {
2658+ #ifdef __AVX2__
2659+ __m256d *_vCoefficients = (__m256d *)ALIGNED_ALLOCA(sizeof(SIMDtype) * 2 * (N + 1), sizeof(SIMDtype));
2660+ vCoefficients = (SIMDtype *)_vCoefficients;
2661+
2662+ __m256d temp = _mm256_loadu_pd(b);
2663+ _vCoefficients[0] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(1, 2, 3, 0));
2664+ _vCoefficients[1] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(2, 3, 0, 1));
2665+ _vCoefficients[2] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(3, 0, 1, 2));
2666+ _vCoefficients[3] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(0, 1, 2, 3));
2667+
2668+ // permutations for backward pass
2669+ _vCoefficients[4] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(0, 3, 2, 1));
2670+ _vCoefficients[5] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(1, 0, 3, 2));
2671+ _vCoefficients[6] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(2, 1, 0, 3));
2672+ _vCoefficients[7] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(3, 2, 1, 0));
2673+ #else
2674+ double *coefficients = (double *)ALIGNED_ALLOCA(sizeof(SIMDtype) * 4 * (N + 1), sizeof(SIMDtype));
2675+ vCoefficients = (SIMDtype *)coefficients;
2676+
2677+ coefficients[0] = b[0];
2678+ coefficients[1] = b[3];
2679+ coefficients[2] = b[2];
2680+ coefficients[3] = b[1];
2681+
2682+ coefficients[4] = b[1];
2683+ coefficients[5] = b[0];
2684+ coefficients[6] = b[3];
2685+ coefficients[7] = b[2];
2686+
2687+ coefficients[8] = b[2];
2688+ coefficients[9] = b[1];
2689+ coefficients[10] = b[0];
2690+ coefficients[11] = b[3];
2691+
2692+ coefficients[12] = b[3];
2693+ coefficients[13] = b[2];
2694+ coefficients[14] = b[1];
2695+ coefficients[15] = b[0];
2696+
2697+ // permutations for backward pass
2698+ coefficients[16] = b[1];
2699+ coefficients[17] = b[2];
2700+ coefficients[18] = b[3];
2701+ coefficients[19] = b[0];
2702+
2703+ coefficients[20] = b[0];
2704+ coefficients[21] = b[1];
2705+ coefficients[22] = b[2];
2706+ coefficients[23] = b[3];
2707+
2708+ coefficients[24] = b[3];
2709+ coefficients[25] = b[0];
2710+ coefficients[26] = b[1];
2711+ coefficients[27] = b[2];
2712+
2713+ coefficients[28] = b[2];
2714+ coefficients[29] = b[3];
2715+ coefficients[30] = b[0];
2716+ coefficients[31] = b[1];
2717+ #endif
2718+ }
2719+ else
2720+ {
2721+ #ifdef __AVX__
2722+ __m256 *_vCoefficients = (__m256 *)ALIGNED_ALLOCA(sizeof(SIMDtype) * 2 * (N + 1), sizeof(SIMDtype));
2723+ vCoefficients = (SIMDtype *)_vCoefficients;
2724+
2725+ __m256 temp = _mm256_castps128_ps256(_mm256_cvtpd_ps(_mm256_loadu_pd(b)));
2726+ temp = _mm256_permute2f128_ps(temp, temp, 0);
2727+ _vCoefficients[0] = _mm256_permute_ps(temp, _MM_SHUFFLE(1, 2, 3, 0));
2728+ _vCoefficients[1] = _mm256_permute_ps(temp, _MM_SHUFFLE(2, 3, 0, 1));
2729+ _vCoefficients[2] = _mm256_permute_ps(temp, _MM_SHUFFLE(3, 0, 1, 2));
2730+ _vCoefficients[3] = _mm256_permute_ps(temp, _MM_SHUFFLE(0, 1, 2, 3));
2731+
2732+ // permutations for backward pass
2733+ _vCoefficients[4] = _mm256_permute_ps(temp, _MM_SHUFFLE(0, 3, 2, 1));
2734+ _vCoefficients[5] = _mm256_permute_ps(temp, _MM_SHUFFLE(1, 0, 3, 2));
2735+ _vCoefficients[6] = _mm256_permute_ps(temp, _MM_SHUFFLE(2, 1, 0, 3));
2736+ _vCoefficients[7] = _mm256_permute_ps(temp, _MM_SHUFFLE(3, 2, 1, 0));
2737+ #else
2738+ __m128 *_vCoefficients = (__m128 *)ALIGNED_ALLOCA(sizeof(SIMDtype) * 2 * (N + 1), sizeof(SIMDtype));
2739+ vCoefficients = (SIMDtype *)_vCoefficients;
2740+
2741+ __m128 temp = _mm_shuffle_ps
2742+ (
2743+ _mm_cvtpd_ps(_mm_loadu_pd(b)),
2744+ _mm_cvtpd_ps(_mm_loadu_pd(&b[2])),
2745+ _MM_SHUFFLE(1, 0, 1, 0)
2746+ );
2747+ _vCoefficients[0] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(1, 2, 3, 0));
2748+ _vCoefficients[1] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(2, 3, 0, 1));
2749+ _vCoefficients[2] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(3, 0, 1, 2));
2750+ _vCoefficients[3] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(0, 1, 2, 3));
2751+
2752+ // permutations for backward pass
2753+ _vCoefficients[4] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(0, 3, 2, 1));
2754+ _vCoefficients[5] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(1, 0, 3, 2));
2755+ _vCoefficients[6] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(2, 1, 0, 3));
2756+ _vCoefficients[7] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(3, 2, 1, 0));
2757+ #endif
2758+ }
2759+ }
2760+ const ssize_t Y_BLOCK_SIZE = 8;
2761+
2762+ // X_BLOCK_SIZE * channels * sizeof(InType) had better be SIMD aligned
2763+ const ssize_t X_BLOCK_SIZE = transposeOut ? 8
2764+ : INT32_MAX / 2;
2765+
2766+ #pragma omp parallel
2767+ {
2768+ AlignedImage<IntermediateType, sizeof(SIMDtype)> forwardFilteredTemp; // TODO: can directly output to output buffer if !transposeOut & output type is float
2769+ SimpleImage<IntermediateType> forwardFiltered;
2770+ if (!convertOutput && !transposeOut)
2771+ {
2772+ forwardFiltered = SimpleImage<IntermediateType>((IntermediateType *)out.buffer, out.pitch);
2773+ }
2774+ /*
2775+ else if (canOverwriteInput && typeid(InType) == typeid(IntermediateType))
2776+ {
2777+ forwardFiltered = SimpleImage<IntermediateType>((IntermediateType *)in.buffer, in.pitch);
2778+ }*/
2779+ else
2780+ {
2781+ forwardFilteredTemp.Resize(width * channels, Y_BLOCK_SIZE);
2782+ forwardFiltered = forwardFilteredTemp;
2783+ }
2784+
2785+ IntermediateType *borderValues = (IntermediateType *)alloca(channels * Y_BLOCK_SIZE * sizeof(IntermediateType));
2786+
2787+ #pragma omp for
2788+ for (ssize_t y0 = 0; y0 < height; y0 += Y_BLOCK_SIZE)
2789+ {
2790+ ssize_t x = 0;
2791+ ssize_t yBlockSize = min(height - y0, Y_BLOCK_SIZE);
2792+
2793+ ssize_t i = 0;
2794+ do
2795+ {
2796+ ssize_t color = 0;
2797+ do
2798+ {
2799+ borderValues[i * channels + color] = in[y0 + i][(width - 1) * channels + color];
2800+ ++color;
2801+ } while (color < channels);
2802+ ++i;
2803+ } while (i < yBlockSize);
2804+
2805+ ssize_t xBlockSize = min(max(X_BLOCK_SIZE, ssize_t(N)), width); // try to process at least X_BLOCK_SIZE or else later, data won't be SIMD aligned
2806+ // convolve pixels[0:FILTER_SIZE - 1]
2807+ Convolve1DHorizontal<false, true, true, channels>(forwardFiltered,
2808+ in.SubImage(0, y0),
2809+ (IntermediateType *)NULL,
2810+ x, x + xBlockSize, width, yBlockSize,
2811+ vCoefficients,
2812+ M);
2813+
2814+ x += xBlockSize;
2815+ while (x < width)
2816+ {
2817+ xBlockSize = min(width - x, X_BLOCK_SIZE);
2818+
2819+ Convolve1DHorizontal<false, true, false, channels>(forwardFiltered,
2820+ in.SubImage(0, y0),
2821+ (IntermediateType *)NULL,
2822+ x, x + xBlockSize, width, yBlockSize,
2823+ vCoefficients,
2824+ M);
2825+ x += xBlockSize;
2826+ }
2827+
2828+ //--------------- backward pass--------------------------
2829+ SimpleImage <IntermediateType> floatOut;
2830+ // if output type is fixed point, we still compute an intermediate result as float for better precision
2831+ if (convertOutput)
2832+ {
2833+ floatOut = forwardFiltered;
2834+ }
2835+ else
2836+ {
2837+ floatOut = SimpleImage<IntermediateType>((IntermediateType *)&out[transposeOut ? 0 : y0][(transposeOut ? y0 : 0) * channels], out.pitch);
2838+ }
2839+ x = width - 1;
2840+
2841+ ssize_t lastAligned = RoundDown(width, X_BLOCK_SIZE);
2842+ // todo: check is this really vector aligned?
2843+ xBlockSize = min(max(width - lastAligned, ssize_t(N)), width); // try to process more than N pixels so that later, data is SIMD aligned
2844+
2845+ // in-place operation (use forwardFiltered as both input & output) is possible due to internal register buffering
2846+ if (transposeOut && !convertOutput)
2847+ Convolve1DHorizontal<true, false, true, channels>(floatOut,
2848+ forwardFiltered,
2849+ borderValues,
2850+ x, x - xBlockSize, width, yBlockSize,
2851+ vCoefficients,
2852+ M);
2853+ else
2854+ Convolve1DHorizontal<false, false, true, channels>(floatOut,
2855+ forwardFiltered,
2856+ borderValues,
2857+ x, x - xBlockSize, width, yBlockSize,
2858+ vCoefficients,
2859+ M);
2860+
2861+ if (convertOutput)
2862+ {
2863+ ssize_t outCornerX = x + 1 - xBlockSize;
2864+ Copy2D<transposeOut, channels>(out.SubImage((transposeOut ? y0 : outCornerX) * channels, transposeOut ? outCornerX : y0), floatOut.SubImage(outCornerX * channels, 0), xBlockSize, yBlockSize);
2865+ }
2866+ x -= xBlockSize;
2867+ while (x >= 0)
2868+ {
2869+ xBlockSize = min(X_BLOCK_SIZE, x + 1);
2870+
2871+ if (transposeOut && !convertOutput)
2872+ Convolve1DHorizontal<true, false, false, channels>(floatOut,
2873+ forwardFiltered,
2874+ borderValues,
2875+ x, x - xBlockSize, width, yBlockSize,
2876+ vCoefficients,
2877+ M);
2878+ else
2879+ Convolve1DHorizontal<false, false, false, channels>(floatOut,
2880+ forwardFiltered,
2881+ borderValues,
2882+ x, x - xBlockSize, width, yBlockSize,
2883+ vCoefficients,
2884+ M);
2885+
2886+ if (convertOutput)
2887+ {
2888+ ssize_t outCornerX = x + 1 - xBlockSize;
2889+ Copy2D<transposeOut, channels>(out.SubImage((transposeOut ? y0 : outCornerX) * channels, transposeOut ? outCornerX : y0), floatOut.SubImage(outCornerX * channels, 0), xBlockSize, yBlockSize);
2890+ }
2891+ x -= xBlockSize;
2892+ }
2893+ }
2894+ }
2895+}
2896+
2897+template <bool transposeOut, int channels, typename OutType, typename InType>
2898+void ConvolveHorizontal(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, float sigmaX, bool canOverwriteInput = false)
2899+{
2900+ if (sigmaX > MAX_SIZE_FOR_SINGLE_PRECISION)
2901+ ConvolveHorizontal<transposeOut, channels, double, OutType, InType>(out, in, width, height, sigmaX, canOverwriteInput);
2902+ else
2903+ ConvolveHorizontal<transposeOut, channels, float, OutType, InType>(out, in, width, height, sigmaX, canOverwriteInput);
2904+}
2905+
2906+// handles blocking
2907+// input & output are color interleaved
2908+template <typename IntermediateType, typename OutType, typename InType>
2909+void ConvolveVertical(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, double sigmaY)
2910+{
2911+ const bool convertOutput = typeid(OutType) != typeid(IntermediateType);
2912+
2913+ const ssize_t Y_BLOCK_SIZE = 8,
2914+ X_BLOCK_SIZE = 40; // must be multiple of SIMD width or else say goodbye to throughput
2915+
2916+ typedef typename MyTraits<IntermediateType>::SIMDtype SIMDtype;
2917+
2918+ double bf[N];
2919+ double M[N*N]; // matrix used for initialization procedure (has to be double)
2920+ double b[N + 1];
2921+
2922+ calcFilter(sigmaY, bf);
2923+
2924+ for (size_t i = 0; i<N; i++)
2925+ bf[i] = -bf[i];
2926+
2927+ b[0] = 1; // b[0] == alpha (scaling coefficient)
2928+
2929+ for (size_t i = 0; i<N; i++)
2930+ {
2931+ b[i + 1] = bf[i];
2932+ b[0] -= b[i + 1];
2933+ }
2934+ b[3] = 1 - (b[0] + b[1] + b[2]);
2935+ // Compute initialization matrix
2936+ calcTriggsSdikaM(bf, M);
2937+
2938+ SIMDtype *vCoefficients = (SIMDtype *)ALIGNED_ALLOCA(sizeof(SIMDtype) * (N + 1), sizeof(SIMDtype));
2939+
2940+ for (ssize_t i = 0; i <= N; ++i)
2941+ {
2942+ BroadcastSIMD(vCoefficients[i], (IntermediateType)b[i]);
2943+ }
2944+
2945+ #pragma omp parallel
2946+ {
2947+ AlignedImage<IntermediateType, sizeof(SIMDtype)> forwardFiltered; // TODO: can directly output to output buffer if output type is float
2948+ forwardFiltered.Resize(X_BLOCK_SIZE, height);
2949+
2950+ IntermediateType *borderValues = (IntermediateType *)alloca(X_BLOCK_SIZE * sizeof(IntermediateType));
2951+
2952+ #pragma omp for
2953+ for (ssize_t x0 = 0; x0 < width; x0 += X_BLOCK_SIZE)
2954+ {
2955+ ssize_t y = 0;
2956+ ssize_t xBlockSize = min(width - x0, X_BLOCK_SIZE);
2957+
2958+ ssize_t i = 0;
2959+ do
2960+ {
2961+ borderValues[i] = in[height - 1][x0 + i];
2962+ ++i;
2963+ } while (i < xBlockSize);
2964+
2965+ ssize_t yBlockSize = min(ssize_t(N), height);
2966+ // convolve pixels[0:filterSize - 1]
2967+ Convolve1DVertical<true, true>(forwardFiltered,
2968+ in.SubImage(x0, 0),
2969+ (IntermediateType *)NULL,
2970+ y, y + yBlockSize, xBlockSize, height,
2971+ vCoefficients,
2972+ M);
2973+
2974+ y += yBlockSize;
2975+ while (y < height)
2976+ {
2977+ yBlockSize = min(height - y, Y_BLOCK_SIZE);
2978+
2979+ Convolve1DVertical<true, false>(forwardFiltered,
2980+ in.SubImage(x0, 0),
2981+ (IntermediateType *)NULL,
2982+ y, y + yBlockSize, xBlockSize, height,
2983+ vCoefficients,
2984+ M);
2985+ y += yBlockSize;
2986+ }
2987+
2988+ //--------------- backward pass--------------------------
2989+ SimpleImage<IntermediateType> floatOut;
2990+ // if output type is fixed point, we still compute an intermediate result as float for better precision
2991+ if (convertOutput)
2992+ {
2993+ floatOut = forwardFiltered;
2994+ }
2995+ else
2996+ {
2997+ floatOut = SimpleImage<IntermediateType>((IntermediateType *)&out[0][x0], out.pitch);
2998+ }
2999+ y = height - 1;
3000+ yBlockSize = min(ssize_t(N), height);
3001+
3002+ // in-place operation (use forwardFiltered as both input & output) is possible due to internal register buffering
3003+ Convolve1DVertical<false, true>(floatOut,
3004+ forwardFiltered,
3005+ borderValues,
3006+ y, y - yBlockSize, xBlockSize, height,
3007+ vCoefficients,
3008+ M);
3009+
3010+ if (convertOutput)
3011+ {
3012+ ssize_t outCornerY = y + 1 - yBlockSize;
3013+ Copy2D<false, 1>(out.SubImage(x0, outCornerY), floatOut.SubImage(0, outCornerY), xBlockSize, yBlockSize);
3014+ }
3015+ y -= yBlockSize;
3016+ while (y >= 0)
3017+ {
3018+ yBlockSize = min(Y_BLOCK_SIZE, y + 1);
3019+
3020+ Convolve1DVertical<false, false>(floatOut,
3021+ forwardFiltered,
3022+ borderValues,
3023+ y, y - yBlockSize, xBlockSize, y,
3024+ vCoefficients,
3025+ M);
3026+
3027+ if (convertOutput)
3028+ {
3029+ ssize_t outCornerY = y + 1 - yBlockSize;
3030+ Copy2D<false, 1>(out.SubImage(x0, outCornerY), floatOut.SubImage(0, outCornerY), xBlockSize, yBlockSize);
3031+ }
3032+ y -= yBlockSize;
3033+ }
3034+ }
3035+ }
3036+}
3037+
3038+template <typename OutType, typename InType>
3039+void ConvolveVertical(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, float sigmaY)
3040+{
3041+ if (sigmaY > MAX_SIZE_FOR_SINGLE_PRECISION)
3042+ ConvolveVertical<double>(out, in, width, height, sigmaY);
3043+ else
3044+ ConvolveVertical<float>(out, in, width, height, sigmaY);
3045+}
3046+
3047+// 2D
3048+template <int channels, typename OutType, typename InType>
3049+void Convolve(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY)
3050+{
3051+ using namespace std::chrono;
3052+ typedef uint16_t HorizontalFilteredType;
3053+ AlignedImage<HorizontalFilteredType, sizeof(__m256)> horizontalFiltered;
3054+
3055+ const bool DO_TIMING = false;
3056+ high_resolution_clock::time_point t0;
3057+ if (DO_TIMING)
3058+ t0 = high_resolution_clock::now();
3059+
3060+ const bool TRANSPOSE = channels != 1; // means for the 1st and 2nd pass to transposes output
3061+
3062+ if (TRANSPOSE)
3063+ {
3064+ horizontalFiltered.Resize(height * channels, width);
3065+ }
3066+ else
3067+ {
3068+ horizontalFiltered.Resize(width * channels, height);
3069+ }
3070+ ConvolveHorizontal<TRANSPOSE, channels>(horizontalFiltered, in, width, height, sigmaX);
3071+
3072+#if 0
3073+ // save intermediate image
3074+ float scale;
3075+ if (typeid(InType) == typeid(uint8_t))
3076+ scale = 1.0f;
3077+ else if (typeid(InType) == typeid(uint16_t))
3078+ scale = 1.0f;
3079+ else
3080+ scale = 1.0f;
3081+ SaveImage("horizontal_filtered.png", horizontalFiltered, TRANSPOSE ? height : width, TRANPOSE ? width : height, channels, scale);
3082+#endif
3083+
3084+ if (DO_TIMING)
3085+ cout << "Thoriz=" << duration_cast<milliseconds>(high_resolution_clock::now() - t0).count() << " ms" << endl;
3086+
3087+ //---------------------------------------------------
3088+
3089+ if (DO_TIMING)
3090+ t0 = high_resolution_clock::now();
3091+ if (TRANSPOSE)
3092+ {
3093+ ConvolveHorizontal<true, channels>(out, horizontalFiltered, height, width, sigmaY, true);
3094+ }
3095+ else
3096+ {
3097+ ConvolveVertical(out, horizontalFiltered, width * channels, height, sigmaY);
3098+ }
3099+ if (DO_TIMING)
3100+ cout << "Tvert=" << duration_cast<milliseconds>(high_resolution_clock::now() - t0).count() << " ms" << endl;
3101+}
3102+
3103+#ifndef __SSSE3__
3104+ #define DO_FIR_IN_FLOAT // without mulhrs_epi16, int16 not competitive
3105+#else
3106+ #undef DO_FIR_IN_FLOAT
3107+#endif
3108+
3109+#ifdef DO_FIR_IN_FLOAT
3110+
3111+// in-place (out = in) operation not allowed
3112+template <int channels, bool symmetric, bool onBorder, typename OutType, typename InType, typename SIMD_Type>
3113+void ConvolveHorizontalFIR(SimpleImage<OutType> out, SimpleImage<InType> in,
3114+ ssize_t width, ssize_t height,
3115+ ssize_t xStart, ssize_t xEnd,
3116+ SIMD_Type *vFilter, int filterSize) // filterSize assumed to be odd
3117+{
3118+ if (channels == 4)
3119+ {
3120+ ssize_t y = 0;
3121+ do
3122+ {
3123+ ssize_t x = xStart;
3124+ #ifdef __AVX__
3125+ const ssize_t SIMD_WIDTH = 8,
3126+ PIXELS_PER_ITERATION = SIMD_WIDTH / channels;
3127+ __m256 vSum,
3128+ leftBorderValue, rightBorderValue;
3129+ if (onBorder)
3130+ {
3131+ __m128 temp;
3132+ LoadFloats(temp, &in[y][0 * channels]);
3133+ leftBorderValue = _mm256_setr_m128(temp, temp);
3134+
3135+ LoadFloats(temp, &in[y][(width - 1) * channels]);
3136+ rightBorderValue = _mm256_setr_m128(temp, temp);
3137+ }
3138+ goto middle;
3139+
3140+ do
3141+ {
3142+ StoreFloats(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum);
3143+ middle:
3144+ if (symmetric)
3145+ {
3146+ __m256 vIn;
3147+ vSum = vFilter[0] * LoadFloats(vIn, &in[y][x * channels]);
3148+ for (ssize_t i = 1; i <= filterSize; ++i)
3149+ {
3150+ __m256 filter = vFilter[i];
3151+ ssize_t srcX = x - i;
3152+ if (onBorder)
3153+ srcX = max(-(PIXELS_PER_ITERATION - 1), srcX); // hack: do this for now until LoadFloats is modified to support partial loads
3154+
3155+ __m256 leftNeighbor, rightNeighbor;
3156+ LoadFloats(leftNeighbor, &in[y][srcX * channels]);
3157+ srcX = x + i;
3158+ if (onBorder)
3159+ srcX = min(width - 1, srcX); // hack: do this for now until LoadFloats is modified to support partial loads
3160+
3161+ LoadFloats(rightNeighbor, &in[y][srcX * channels]);
3162+ if (onBorder)
3163+ {
3164+ __m256i notPastEnd = PartialVectorMask32(min(PIXELS_PER_ITERATION, max(ssize_t(0), width - i - x)) * channels * sizeof(float)),
3165+ beforeBeginning = PartialVectorMask32(min(PIXELS_PER_ITERATION, max(ssize_t(0), i - x)) * channels * sizeof(float));
3166+ leftNeighbor = _mm256_blendv_ps(leftNeighbor, leftBorderValue, _mm256_castsi256_ps(beforeBeginning));
3167+ rightNeighbor = _mm256_blendv_ps(rightBorderValue, rightNeighbor, _mm256_castsi256_ps(notPastEnd));
3168+ }
3169+ vSum = vSum + filter * (leftNeighbor + rightNeighbor);
3170+ }
3171+ }
3172+ else
3173+ {
3174+ vSum = _mm256_setzero_ps();
3175+ ssize_t i = 0;
3176+ // the smaller & simpler machine code for do-while probably outweighs the cost of the extra add
3177+ do
3178+ {
3179+ ssize_t srcX = x - filterSize / 2 + i;
3180+ // todo: border not handled
3181+ __m256 vIn;
3182+ LoadFloats(vIn, &in[y][srcX * channels]);
3183+ vSum = vSum + vFilter[i] * vIn;
3184+ ++i;
3185+ } while (i < filterSize);
3186+ }
3187+ x += PIXELS_PER_ITERATION;
3188+ } while (x < xEnd);
3189+ StoreFloats<true>(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum, (xEnd - (x - PIXELS_PER_ITERATION)) * channels);
3190+ #else
3191+ do
3192+ {
3193+ __m128 vSum;
3194+ if (symmetric)
3195+ {
3196+ __m128 vIn;
3197+ LoadFloats(vIn, &in[y][x * channels]);
3198+ vSum = vFilter[0] * vIn;
3199+ for (ssize_t i = 1; i <= filterSize; ++i)
3200+ {
3201+ ssize_t srcX = x - i;
3202+ if (onBorder)
3203+ srcX = max(ssize_t(0), srcX);
3204+
3205+ __m128 leftNeighbor, rightNeighbor;
3206+ LoadFloats(leftNeighbor, &in[y][srcX * channels]);
3207+
3208+ srcX = x + i;
3209+ if (onBorder)
3210+ srcX = min(width - 1, srcX);
3211+ LoadFloats(rightNeighbor, &in[y][srcX * channels]);
3212+
3213+ vSum = vSum + vFilter[i] * (leftNeighbor + rightNeighbor);
3214+ }
3215+ }
3216+ else
3217+ {
3218+ vSum = _mm_setzero_ps();
3219+ ssize_t i = 0;
3220+ do
3221+ {
3222+ ssize_t srcX = x - filterSize / 2 + i;
3223+ if (onBorder)
3224+ srcX = min(width - 1, max(ssize_t(0), srcX));
3225+
3226+ __m128 vIn;
3227+ LoadFloats(vIn, &in[y][srcX * channels]);
3228+ vSum = vSum + vFilter[i] * vIn;
3229+ ++i;
3230+ } while (i < filterSize);
3231+ }
3232+ StoreFloats(&out[y][x * channels], vSum);
3233+ ++x;
3234+ } while (x < xEnd);
3235+
3236+ #endif
3237+ ++y;
3238+ } while (y < height);
3239+ }
3240+ else
3241+ {
3242+ // 1 channel
3243+ // todo: can merge with 4 channel?
3244+ ssize_t y = 0;
3245+ do
3246+ {
3247+ ssize_t x = xStart;
3248+ #ifdef __AVX__
3249+ const ssize_t SIMD_WIDTH = 8;
3250+
3251+ __m256 leftBorderValue, rightBorderValue;
3252+ if (onBorder)
3253+ {
3254+ leftBorderValue = _mm256_set1_ps(in[y][0 * channels]);
3255+ rightBorderValue = _mm256_set1_ps(in[y][(width - 1) * channels]);
3256+ }
3257+ __m256 vSum;
3258+ // trashed performance from basic block reordering warning
3259+ goto middle2;
3260+ do
3261+ {
3262+ // write out values from previous iteration
3263+ StoreFloats(&out[y][(x - SIMD_WIDTH) * channels], vSum);
3264+ middle2:
3265+ if (symmetric)
3266+ {
3267+ __m256 vIn;
3268+ vSum = vFilter[0] * LoadFloats(vIn, &in[y][x * channels]);
3269+ for (ssize_t i = 1; i <= filterSize; ++i)
3270+ {
3271+ __m256 filter = vFilter[i];
3272+
3273+ ssize_t srcX = x - i;
3274+ if (onBorder)
3275+ srcX = max(-(SIMD_WIDTH - 1), srcX); // hack: do this for now until LoadFloats is modified to support partial loads
3276+
3277+ __m256 leftNeighbor, rightNeighbor;
3278+ LoadFloats(leftNeighbor, &in[y][srcX * channels]);
3279+ srcX = x + i;
3280+ if (onBorder)
3281+ srcX = min(width - 1, srcX); // hack: do this for now until LoadFloats is modified to support partial loads
3282+ LoadFloats(rightNeighbor, &in[y][srcX * channels]);
3283+
3284+ if (onBorder)
3285+ {
3286+ __m256i notPastEnd = PartialVectorMask32(min(SIMD_WIDTH, max(ssize_t(0), width - i - x)) * sizeof(float)),
3287+ beforeBeginning = PartialVectorMask32(min(SIMD_WIDTH, max(ssize_t(0), i - x)) * sizeof(float));
3288+ leftNeighbor = _mm256_blendv_ps(leftNeighbor, leftBorderValue, _mm256_castsi256_ps(beforeBeginning));
3289+ rightNeighbor = _mm256_blendv_ps(rightBorderValue, rightNeighbor, _mm256_castsi256_ps(notPastEnd));
3290+ }
3291+ vSum = vSum + filter * (leftNeighbor + rightNeighbor);
3292+ }
3293+ }
3294+ else
3295+ {
3296+ vSum = _mm256_setzero_ps();
3297+ ssize_t i = 0;
3298+ // the smaller & simpler machine code for do-while probably outweighs the cost of the extra add
3299+ do
3300+ {
3301+ ssize_t srcX = x - filterSize / 2 + i;
3302+ // todo: border not handled
3303+ __m256 vIn;
3304+ LoadFloats(vIn, &in[y][srcX * channels]);
3305+ vSum = vSum + vFilter[i] * vIn;
3306+ ++i;
3307+ } while (i < filterSize);
3308+ }
3309+ x += SIMD_WIDTH;
3310+ } while (x < xEnd);
3311+ StoreFloats<true>(&out[y][(x - SIMD_WIDTH) * channels], vSum, xEnd - (x - SIMD_WIDTH));
3312+ #else
3313+ // SSE only
3314+ const ssize_t SIMD_WIDTH = 4;
3315+
3316+ __m128 leftBorderValue, rightBorderValue;
3317+ if (onBorder)
3318+ {
3319+ leftBorderValue = _mm_set1_ps(in[y][0 * channels]);
3320+ rightBorderValue = _mm_set1_ps(in[y][(width - 1) * channels]);
3321+ }
3322+ __m128 vSum;
3323+ // trashed performance from basic block reordering warning
3324+ goto middle2;
3325+ do
3326+ {
3327+ // write out values from previous iteration
3328+ StoreFloats(&out[y][(x - SIMD_WIDTH) * channels], vSum);
3329+ middle2:
3330+ if (symmetric)
3331+ {
3332+ __m128 vIn;
3333+ vSum = vFilter[0] * LoadFloats(vIn, &in[y][x * channels]);
3334+ for (ssize_t i = 1; i <= filterSize; ++i)
3335+ {
3336+ __m128 filter = vFilter[i];
3337+
3338+ ssize_t srcX = x - i;
3339+ if (onBorder)
3340+ srcX = max(-(SIMD_WIDTH - 1), srcX); // hack: do this for now until LoadFloats is modified to support partial loads
3341+
3342+ __m128 leftNeighbor, rightNeighbor;
3343+ LoadFloats(leftNeighbor, &in[y][srcX * channels]);
3344+ srcX = x + i;
3345+ if (onBorder)
3346+ srcX = min(width - 1, srcX); // hack: do this for now until LoadFloats is modified to support partial loads
3347+ LoadFloats(rightNeighbor, &in[y][srcX * channels]);
3348+
3349+ if (onBorder)
3350+ {
3351+ __m128i notPastEnd = PartialVectorMask(min(SIMD_WIDTH, max(ssize_t(0), width - i - x)) * sizeof(float)),
3352+ beforeBeginning = PartialVectorMask(min(SIMD_WIDTH, max(ssize_t(0), i - x)) * sizeof(float));
3353+ leftNeighbor = Select(leftNeighbor, leftBorderValue, _mm_castsi128_ps(beforeBeginning));
3354+ rightNeighbor = Select(rightBorderValue, rightNeighbor, _mm_castsi128_ps(notPastEnd));
3355+ }
3356+ vSum = vSum + filter * (leftNeighbor + rightNeighbor);
3357+ }
3358+ }
3359+ else
3360+ {
3361+ vSum = _mm_setzero_ps();
3362+ ssize_t i = 0;
3363+ // the smaller & simpler machine code for do-while probably outweighs the cost of the extra add
3364+ do
3365+ {
3366+ ssize_t srcX = x - filterSize / 2 + i;
3367+ // todo: border not handled
3368+ __m128 vIn;
3369+ LoadFloats(vIn, &in[y][srcX * channels]);
3370+ vSum = vSum + vFilter[i] * vIn;
3371+ ++i;
3372+ } while (i < filterSize);
3373+ }
3374+ x += SIMD_WIDTH;
3375+ } while (x < xEnd);
3376+ StoreFloats<true>(&out[y][(x - SIMD_WIDTH) * channels], vSum, xEnd - (x - SIMD_WIDTH));
3377+ #endif
3378+ ++y;
3379+ } while (y < height);
3380+ }
3381+}
3382+
3383+#else
3384+
3385+// DO_FIR_IN_INT16
3386+
3387+// in-place (out = in) operation not allowed
3388+template <int channels, bool symmetric, bool onBorder, typename OutType, typename InType, typename SIMD_Type>
3389+void ConvolveHorizontalFIR(SimpleImage<OutType> out, SimpleImage<InType> in,
3390+ ssize_t width, ssize_t height,
3391+ ssize_t xStart, ssize_t xEnd,
3392+ SIMD_Type *vFilter, int filterSize)
3393+{
3394+ if (channels == 4)
3395+ {
3396+ ssize_t y = 0;
3397+ do
3398+ {
3399+ #ifdef __AVX2__
3400+ int16_t *convertedIn;
3401+ if (typeid(InType) == typeid(int16_t))
3402+ {
3403+ convertedIn = (int16_t *)&in[y][0];
3404+ }
3405+ else
3406+ {
3407+ convertedIn = (int16_t *)ALIGNED_ALLOCA(RoundUp(width * channels, ssize_t(sizeof(__m256i) / sizeof(int16_t))) * sizeof(int16_t), sizeof(__m256i));
3408+ for (ssize_t x = 0; x < width * channels; x += 16)
3409+ {
3410+ __m128i u8 = _mm_loadu_si128((__m128i *)&in[y][x]);
3411+ __m256i i16 = _mm256_slli_epi16(_mm256_cvtepu8_epi16(u8), 6);
3412+ _mm256_store_si256((__m256i *)&convertedIn[x], i16);
3413+ }
3414+ }
3415+ ssize_t x = xStart;
3416+ const ssize_t SIMD_WIDTH = 16,
3417+ PIXELS_PER_ITERATION = SIMD_WIDTH / channels;
3418+ __m256i vSum;
3419+ __m256i leftBorderValue, rightBorderValue;
3420+ if (onBorder)
3421+ {
3422+ __m128i temp = _mm_set1_epi64x(*(int64_t *)&convertedIn[0 * channels]);
3423+ leftBorderValue = _mm256_setr_m128i(temp, temp);
3424+ temp = _mm_set1_epi64x(*(int64_t *)&convertedIn[(width - 1) * channels]);
3425+ rightBorderValue = _mm256_setr_m128i(temp, temp);
3426+ }
3427+ goto middle2;
3428+ do
3429+ {
3430+ ScaleAndStoreInt16(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum);
3431+ middle2:
3432+ if (symmetric)
3433+ {
3434+ __m256i center = _mm256_loadu_si256((__m256i *)&convertedIn[x * channels]);
3435+ vSum = _mm256_mulhrs_epi16(vFilter[0], center);
3436+ for (ssize_t i = 1; i <= filterSize; ++i)
3437+ {
3438+ __m256i filter = vFilter[i];
3439+
3440+ ssize_t srcX = x - i;
3441+ if (onBorder)
3442+ srcX = max(-(PIXELS_PER_ITERATION - 1), srcX); // todo: use this for now until LoadAndScaleToInt16() supports partial loads
3443+
3444+ __m256i leftNeighbor, rightNeighbor;
3445+ leftNeighbor = _mm256_loadu_si256((__m256i *)&convertedIn[srcX * channels]);
3446+
3447+ srcX = x + i;
3448+ if (onBorder)
3449+ srcX = min(width - 1, srcX);
3450+ rightNeighbor = _mm256_loadu_si256((__m256i *)&convertedIn[srcX * channels]);
3451+
3452+ if (onBorder)
3453+ {
3454+ __m256i leftMask = PartialVectorMask32(min(PIXELS_PER_ITERATION, max(ssize_t(0), i - x)) * channels * sizeof(int16_t)),
3455+ rightMask = PartialVectorMask32(min(PIXELS_PER_ITERATION, width - (x + i)) * channels * sizeof(int16_t));
3456+ leftNeighbor = _mm256_blendv_epi8(leftNeighbor, leftBorderValue, leftMask);
3457+ rightNeighbor = _mm256_blendv_epi8(rightBorderValue, rightNeighbor, rightMask);
3458+ }
3459+ vSum = _mm256_adds_epi16(vSum, _mm256_mulhrs_epi16(filter, _mm256_adds_epi16(leftNeighbor, rightNeighbor)));
3460+ }
3461+ }
3462+ else
3463+ {
3464+ throw 0;
3465+ }
3466+ x += PIXELS_PER_ITERATION;
3467+ } while (x < xEnd);
3468+ ScaleAndStoreInt16<true>(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum, (xEnd - (x - PIXELS_PER_ITERATION)) * channels);
3469+ #else
3470+ // SSSE3 only
3471+ int16_t *convertedIn;
3472+ if (typeid(InType) == typeid(int16_t))
3473+ {
3474+ convertedIn = (int16_t *)&in[y][0];
3475+ }
3476+ else
3477+ {
3478+ convertedIn = (int16_t *)ALIGNED_ALLOCA(RoundUp(width * channels, ssize_t(sizeof(__m128i) / sizeof(int16_t))) * sizeof(int16_t), sizeof(__m128i));
3479+ for (ssize_t x = 0; x < width * channels; x += 8)
3480+ {
3481+ __m128i u8 = _mm_loadl_epi64((__m128i *)&in[y][x]);
3482+ __m128i i16 = _mm_slli_epi16(_mm_cvtepu8_epi16(u8), 6);
3483+ _mm_store_si128((__m128i *)&convertedIn[x], i16);
3484+ }
3485+ }
3486+ ssize_t x = xStart;
3487+ const ssize_t SIMD_WIDTH = 8,
3488+ PIXELS_PER_ITERATION = SIMD_WIDTH / channels;
3489+ __m128i vSum;
3490+ __m128i leftBorderValue, rightBorderValue;
3491+ if (onBorder)
3492+ {
3493+ leftBorderValue = _mm_set1_epi64x(*(int64_t *)&convertedIn[0 * channels]);
3494+ rightBorderValue = _mm_set1_epi64x(*(int64_t *)&convertedIn[(width - 1) * channels]);
3495+ }
3496+ goto middle3;
3497+ do
3498+ {
3499+ ScaleAndStoreInt16(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum);
3500+ middle3:
3501+ if (symmetric)
3502+ {
3503+ __m128i center;
3504+ vSum = _mm_mulhrs_epi16(Cast256To128(vFilter[0]), LoadAndScaleToInt16(center, &convertedIn[x * channels]));
3505+ for (ssize_t i = 1; i <= filterSize; ++i)
3506+ {
3507+ __m128i filter = Cast256To128(vFilter[i]);
3508+
3509+ ssize_t srcX = x - i;
3510+ if (onBorder)
3511+ srcX = max(-(PIXELS_PER_ITERATION - 1), srcX); // todo: use this for now until LoadAndScaleToInt16() supports partial loads
3512+
3513+ __m128i leftNeighbor = _mm_loadu_si128((__m128i *)&convertedIn[srcX * channels]);
3514+
3515+ srcX = x + i;
3516+ if (onBorder)
3517+ srcX = min(width - 1, srcX);
3518+
3519+ __m128i rightNeighbor = _mm_loadu_si128((__m128i *)&convertedIn[srcX * channels]);
3520+
3521+ if (onBorder)
3522+ {
3523+ __m128i leftMask = PartialVectorMask(min(PIXELS_PER_ITERATION, max(ssize_t(0), i - x)) * channels * sizeof(int16_t)),
3524+ rightMask = PartialVectorMask(min(PIXELS_PER_ITERATION, width - (x + i)) * channels * sizeof(int16_t));
3525+ leftNeighbor = _mm_blendv_epi8(leftNeighbor, leftBorderValue, leftMask);
3526+ rightNeighbor = _mm_blendv_epi8(rightBorderValue, rightNeighbor, rightMask);
3527+ }
3528+ vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(filter, _mm_adds_epi16(leftNeighbor, rightNeighbor)));
3529+ }
3530+ }
3531+ else
3532+ {
3533+ throw 0;
3534+ }
3535+ x += PIXELS_PER_ITERATION;
3536+ } while (x < xEnd);
3537+ ScaleAndStoreInt16<true>(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum, (xEnd - (x - PIXELS_PER_ITERATION)) * channels);
3538+ #endif
3539+
3540+ ++y;
3541+ } while (y < height);
3542+ }
3543+ else
3544+ {
3545+ #ifdef __GNUC__
3546+ const static void *labels[] =
3547+ {
3548+ NULL,
3549+ &&remainder1,
3550+ &&remainder2,
3551+ &&remainder3,
3552+ &&remainder4,
3553+
3554+ &&remainder5,
3555+ &&remainder6,
3556+ &&remainder7,
3557+ &&remainder8
3558+ };
3559+ #endif
3560+ // 1 channel
3561+ // todo: can merge with 4 channel?
3562+ ssize_t y = 0;
3563+ do
3564+ {
3565+ int16_t *convertedIn;
3566+ if (typeid(InType) == typeid(int16_t))
3567+ {
3568+ convertedIn = (int16_t *)&in[y][0];
3569+ }
3570+ else
3571+ {
3572+ convertedIn = (int16_t *)ALIGNED_ALLOCA(RoundUp(width, ssize_t(sizeof(__m256i) / sizeof(int16_t))) * sizeof(int16_t), sizeof(__m256i));
3573+ #ifdef __AVX2__
3574+ for (ssize_t x = 0; x < width; x += 16)
3575+ {
3576+ __m128i u8 = _mm_loadu_si128((__m128i *)&in[y][x]);
3577+ __m256i i16 = _mm256_slli_epi16(_mm256_cvtepu8_epi16(u8), 6);
3578+ _mm256_store_si256((__m256i *)&convertedIn[x], i16);
3579+ }
3580+ #else
3581+ for (ssize_t x = 0; x < width; x += 8)
3582+ {
3583+ __m128i u8 = _mm_loadl_epi64((__m128i *)&in[y][x]);
3584+ __m128i i16 = _mm_slli_epi16(_mm_cvtepu8_epi16(u8), 6);
3585+ _mm_store_si128((__m128i *)&convertedIn[x], i16);
3586+ }
3587+ #endif
3588+ }
3589+ ssize_t x = xStart;
3590+ const ssize_t SIMD_WIDTH = 8;
3591+ __m128i vSum;
3592+ __m128i leftBorderValue, rightBorderValue;
3593+ if (onBorder)
3594+ {
3595+ leftBorderValue = _mm_set1_epi16(convertedIn[0 * channels]);
3596+ rightBorderValue = _mm_set1_epi16(convertedIn[(width - 1) * channels]);
3597+ }
3598+ goto middle;
3599+ do
3600+ {
3601+ ScaleAndStoreInt16(&out[y][x - SIMD_WIDTH], vSum);
3602+ middle:
3603+ if (symmetric)
3604+ {
3605+ #ifdef __GNUC__
3606+ // up to 1.2x faster by using palignr instead of unaligned loads!
3607+ // the greatest difficulty with using palignr is that the offset must be a compile time constant
3608+ // solution? Duff's device
3609+ __m128i leftHalf[2],
3610+ rightHalf[2],
3611+ center = _mm_loadu_si128((__m128i *)&convertedIn[x]);
3612+
3613+ if (onBorder)
3614+ {
3615+ __m128i mask = PartialVectorMask(min(SIMD_WIDTH, width - x) * sizeof(int16_t));
3616+ center = _mm_blendv_epi8(rightBorderValue, center, mask);
3617+ }
3618+ rightHalf[0] = leftHalf[1] = center;
3619+
3620+ vSum = _mm_mulhrs_epi16(Cast256To128(vFilter[0]), rightHalf[0]);
3621+
3622+ ssize_t base = 0;
3623+ while (base < filterSize)
3624+ {
3625+ leftHalf[0] = rightHalf[0];
3626+ rightHalf[1] = leftHalf[1];
3627+ rightHalf[0] = _mm_loadu_si128((__m128i *)&convertedIn[x + base + 8]);
3628+ leftHalf[1] = _mm_loadu_si128((__m128i *)&convertedIn[x - base - 8]);
3629+
3630+ if (onBorder)
3631+ {
3632+ __m128i leftMask = PartialVectorMask(min(SIMD_WIDTH, max(ssize_t(0), (base + 8) - x)) * sizeof(int16_t)),
3633+ rightMask = PartialVectorMask(min(SIMD_WIDTH, width - (x + base + 8)) * sizeof(int16_t));
3634+ leftHalf[1] = _mm_blendv_epi8(leftHalf[1], leftBorderValue, leftMask);
3635+ rightHalf[0] = _mm_blendv_epi8(rightBorderValue, rightHalf[0], rightMask);
3636+ }
3637+
3638+ goto *labels[min(ssize_t(8), filterSize - base)];
3639+ __m128i v, v2;
3640+ remainder8:
3641+ v = rightHalf[0]; // same as palignr(right, left, 16)
3642+ v2 = leftHalf[1];
3643+ vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 8]), _mm_adds_epi16(v, v2)));
3644+ remainder7:
3645+ v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 14);
3646+ v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 2);
3647+ vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 7]), _mm_adds_epi16(v, v2)));
3648+ remainder6:
3649+ v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 12);
3650+ v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 4);
3651+ vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 6]), _mm_adds_epi16(v, v2)));
3652+ remainder5:
3653+ v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 10);
3654+ v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 6);
3655+ vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 5]), _mm_adds_epi16(v, v2)));
3656+ remainder4:
3657+ v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 8);
3658+ v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 8);
3659+ vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 4]), _mm_adds_epi16(v, v2)));
3660+ remainder3:
3661+ v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 6);
3662+ v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 10);
3663+ vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 3]), _mm_adds_epi16(v, v2)));
3664+ remainder2:
3665+ v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 4);
3666+ v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 12);
3667+ vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 2]), _mm_adds_epi16(v, v2)));
3668+ remainder1:
3669+ v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 2);
3670+ v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 14);
3671+ vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 1]), _mm_adds_epi16(v, v2)));
3672+ base += 8;
3673+ }
3674+ #else
3675+ __m128i center;
3676+ vSum = _mm_mulhrs_epi16(Cast256To128(vFilter[0]), LoadAndScaleToInt16(center, &convertedIn[x * channels]));
3677+ for (ssize_t i = 1; i <= filterSize; ++i)
3678+ {
3679+ __m128i filter = Cast256To128(vFilter[i]);
3680+
3681+ ssize_t srcX = x - i;
3682+ if (onBorder)
3683+ srcX = max(-(SIMD_WIDTH - 1), srcX); // todo: use this for now until LoadAndScaleToInt16() supports partial loads
3684+
3685+ __m128i leftNeighbor = _mm_loadu_si128((__m128i *)&convertedIn[srcX * channels]);
3686+
3687+ srcX = x + i;
3688+ if (onBorder)
3689+ srcX = min(width - 1, srcX);
3690+
3691+ __m128i rightNeighbor = _mm_loadu_si128((__m128i *)&convertedIn[srcX * channels]);
3692+
3693+ if (onBorder)
3694+ {
3695+ __m128i leftMask = PartialVectorMask(min(SIMD_WIDTH, max(ssize_t(0), i - x)) * sizeof(int16_t)),
3696+ rightMask = PartialVectorMask(min(SIMD_WIDTH, width - (x + i)) * sizeof(int16_t));
3697+ leftNeighbor = _mm_blendv_epi8(leftNeighbor, leftBorderValue, leftMask);
3698+ rightNeighbor = _mm_blendv_epi8(rightBorderValue, rightNeighbor, rightMask);
3699+ }
3700+ vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(filter, _mm_adds_epi16(leftNeighbor, rightNeighbor)));
3701+ }
3702+ #endif
3703+ }
3704+ else
3705+ {
3706+ throw 0;
3707+ }
3708+ x += SIMD_WIDTH;
3709+ } while (x < xEnd);
3710+ ScaleAndStoreInt16<true>(&out[y][x - SIMD_WIDTH], vSum, xEnd - (x - SIMD_WIDTH));
3711+ ++y;
3712+ } while (y < height);
3713+ }
3714+}
3715+#endif
3716+
3717+// handles blocking
3718+// in-place (out = in) operation not allowed
3719+template <int channels, typename OutType, typename InType>
3720+void ConvolveHorizontalFIR(SimpleImage<OutType> out,
3721+ SimpleImage<InType> in,
3722+ ssize_t width, ssize_t height, float sigmaX)
3723+{
3724+#ifdef DO_FIR_IN_FLOAT
3725+ typedef MyTraits<float>::SIMDtype SIMDtype;
3726+#else
3727+ typedef MyTraits<int16_t>::SIMDtype SIMDtype;
3728+#endif
3729+
3730+ ssize_t halfFilterSize = _effect_area_scr(sigmaX);
3731+ float *filter = (float *)alloca((halfFilterSize + 1) * sizeof(float));
3732+ _make_kernel(filter, sigmaX);
3733+
3734+ SIMDtype *vFilter = (SIMDtype *)ALIGNED_ALLOCA((halfFilterSize + 1) * sizeof(SIMDtype), sizeof(SIMDtype));
3735+
3736+ for (ssize_t i = 0; i <= halfFilterSize; ++i)
3737+ {
3738+ #ifdef DO_FIR_IN_FLOAT
3739+ BroadcastSIMD(vFilter[i], filter[i]);
3740+ #else
3741+ BroadcastSIMD(vFilter[i], clip_round_cast<int16_t>(filter[i] * 32768));
3742+ #endif
3743+ }
3744+
3745+ const ssize_t IDEAL_Y_BLOCK_SIZE = 1; // pointless for now, but might be needed in the future when SIMD code processes 2 rows at a time
3746+
3747+ #pragma omp parallel
3748+ {
3749+ #pragma omp for
3750+ for (ssize_t y = 0; y < height; y += IDEAL_Y_BLOCK_SIZE)
3751+ {
3752+ ssize_t yBlockSize = min(height - y, IDEAL_Y_BLOCK_SIZE);
3753+
3754+ ssize_t nonBorderStart = min(width, RoundUp(halfFilterSize, ssize_t(sizeof(__m256) / channels / sizeof(InType)))); // so that data for non-border region is vector aligned
3755+ if (nonBorderStart < width - halfFilterSize)
3756+ ConvolveHorizontalFIR<channels, true, false>(out.SubImage(0, y),
3757+ in.SubImage(0, y),
3758+ width, yBlockSize,
3759+ nonBorderStart, width - halfFilterSize,
3760+ vFilter, halfFilterSize);
3761+ ssize_t xStart = 0,
3762+ xEnd = nonBorderStart;
3763+ processEnd:
3764+ ConvolveHorizontalFIR<channels, true, true>(out.SubImage(0, y),
3765+ in.SubImage(0, y),
3766+ width, yBlockSize,
3767+ xStart, xEnd,
3768+ vFilter, halfFilterSize);
3769+ if (xStart == 0)
3770+ {
3771+ // avoid inline happy compiler from inlining another call to ConvolveHorizontalFIR()
3772+ xStart = max(nonBorderStart, width - halfFilterSize); // don't refilter anything in case the 2 border regions overlap
3773+ xEnd = width;
3774+ goto processEnd;
3775+ }
3776+ }
3777+ } // omp parallel
3778+}
3779+
3780+#ifdef DO_FIR_IN_FLOAT
3781+// in-place (out = in) operation not allowed
3782+template <bool symmetric, bool onBorder, typename OutType, typename InType, typename SIMD_Type>
3783+void ConvolveVerticalFIR(SimpleImage<OutType> out, SimpleImage<InType> in,
3784+ ssize_t width, ssize_t height,
3785+ ssize_t yStart, ssize_t yEnd,
3786+ SIMD_Type *vFilter, int filterSize)
3787+{
3788+ ssize_t y = yStart;
3789+ do
3790+ {
3791+ ssize_t x = 0;
3792+#ifdef __AVX__
3793+ const ssize_t SIMD_WIDTH = 8;
3794+ __m256 vSum;
3795+ goto middle;
3796+ do
3797+ {
3798+ StoreFloats(&out[y][x - SIMD_WIDTH], vSum); // write out data from previous iteration
3799+ middle:
3800+ if (symmetric)
3801+ {
3802+ __m256 vIn;
3803+ LoadFloats(vIn, &in[y][x]);
3804+ vSum = vFilter[0] * vIn;
3805+ for (ssize_t i = 1; i <= filterSize; ++i)
3806+ {
3807+ ssize_t srcY = y - i;
3808+ if (onBorder)
3809+ srcY = max(ssize_t(0), srcY);
3810+ __m256 bottom, top;
3811+ LoadFloats(bottom, &in[srcY][x]);
3812+
3813+ srcY = y + i;
3814+ if (onBorder)
3815+ srcY = min(height - 1, srcY);
3816+ LoadFloats(top, &in[srcY][x]);
3817+
3818+ vSum = vSum + vFilter[i] * (bottom + top);
3819+ }
3820+ }
3821+ else
3822+ {
3823+ // wouldn't be surprised if the smaller & simpler do-while code outweighs the cost of the extra add
3824+ vSum = _mm256_setzero_ps();
3825+ ssize_t i = 0;
3826+ do
3827+ {
3828+ ssize_t srcY = y - filterSize / 2 + i;
3829+ if (onBorder)
3830+ srcY = min(height - 1, max(ssize_t(0), srcY));
3831+ __m256 vIn;
3832+ LoadFloats(vIn, &in[srcY][x]);
3833+ vSum = vSum + vFilter[i] * vIn;
3834+ ++i;
3835+ } while (i < filterSize);
3836+ }
3837+ x += SIMD_WIDTH;
3838+ } while (x < width);
3839+ StoreFloats<true>(&out[y][x - SIMD_WIDTH], vSum, width - (x - SIMD_WIDTH));
3840+#else
3841+ // for SSE only
3842+ const ssize_t SIMD_WIDTH = 4;
3843+ __m128 vSum;
3844+ goto middle;
3845+ do
3846+ {
3847+ StoreFloats(&out[y][x - SIMD_WIDTH], vSum); // write out data from previous iteration
3848+ middle:
3849+ if (symmetric)
3850+ {
3851+ __m128 vIn;
3852+ LoadFloats(vIn, &in[y][x]);
3853+ vSum = Cast256To128(vFilter[0]) * vIn;
3854+ for (ssize_t i = 1; i <= filterSize; ++i)
3855+ {
3856+ ssize_t srcY = y - i;
3857+ if (onBorder)
3858+ srcY = max(ssize_t(0), srcY);
3859+ __m128 bottom, top;
3860+ LoadFloats(bottom, &in[srcY][x]);
3861+
3862+ srcY = y + i;
3863+ if (onBorder)
3864+ srcY = min(height - 1, srcY);
3865+ LoadFloats(top, &in[srcY][x]);
3866+
3867+ vSum = vSum + Cast256To128(vFilter[i]) * (bottom + top);
3868+ }
3869+ }
3870+ else
3871+ {
3872+ vSum = _mm_setzero_ps();
3873+ ssize_t i = 0;
3874+ do
3875+ {
3876+ ssize_t srcY = y - filterSize / 2 + i;
3877+ if (onBorder)
3878+ srcY = min(height - 1, max(ssize_t(0), srcY));
3879+ __m128 _vFilter = Cast256To128(vFilter[i]);
3880+
3881+ __m128 vIn;
3882+ LoadFloats(vIn, &in[srcY][x]);
3883+ vSum = vSum + _vFilter * vIn;
3884+ ++i;
3885+ } while (i < filterSize);
3886+ }
3887+ x += SIMD_WIDTH;
3888+ } while (x < width);
3889+ StoreFloats<true>(&out[y][x - SIMD_WIDTH], vSum, width - (x - SIMD_WIDTH));
3890+#endif
3891+ ++y;
3892+ } while (y < yEnd);
3893+}
3894+
3895+#else // DO_FIR_IN_FLOAT
3896+
3897+// in-place (out = in) operation not allowed
3898+template <bool symmetric, bool onBorder, typename OutType, typename InType, typename SIMD_Type>
3899+void ConvolveVerticalFIR(SimpleImage<OutType> out, SimpleImage<InType> in,
3900+ ssize_t width, ssize_t height,
3901+ ssize_t yStart, ssize_t yEnd,
3902+ SIMD_Type *vFilter, int filterSize)
3903+{
3904+ ssize_t y = yStart;
3905+ do
3906+ {
3907+ ssize_t x = 0;
3908+ #ifdef __AVX2__
3909+ const ssize_t SIMD_WIDTH = 16;
3910+ __m256i vSum;
3911+ goto middle;
3912+ do
3913+ {
3914+ ScaleAndStoreInt16(&out[y][x - SIMD_WIDTH], vSum); // store data from previous iteration
3915+ middle:
3916+ if (symmetric)
3917+ {
3918+ __m256i center;
3919+ vSum = _mm256_mulhrs_epi16(vFilter[0], LoadAndScaleToInt16(center, &in[y][x]));
3920+ for (ssize_t i = 1; i <= filterSize; ++i)
3921+ {
3922+ __m256i filter = vFilter[i];
3923+ ssize_t srcY = y + i;
3924+ if (onBorder)
3925+ srcY = min(srcY, height - 1);
3926+ __m256i topNeighbor;
3927+ LoadAndScaleToInt16(topNeighbor, &in[srcY][x]);
3928+
3929+ srcY = y - i;
3930+ if (onBorder)
3931+ srcY = max(srcY, ssize_t(0));
3932+ __m256i bottomNeighbor;
3933+ LoadAndScaleToInt16(bottomNeighbor, &in[srcY][x]);
3934+ vSum = _mm256_adds_epi16
3935+ (
3936+ vSum,
3937+ _mm256_mulhrs_epi16
3938+ (
3939+ filter,
3940+ _mm256_adds_epi16(bottomNeighbor, topNeighbor)
3941+ )
3942+ );
3943+ }
3944+ }
3945+ else
3946+ {
3947+ throw 0;
3948+ }
3949+ x += SIMD_WIDTH;
3950+ } while (x < width);
3951+ ScaleAndStoreInt16<true>(&out[y][x - SIMD_WIDTH], vSum, width - (x - SIMD_WIDTH));
3952+ #else
3953+ const ssize_t SIMD_WIDTH = 8;
3954+ __m128i vSum;
3955+ goto middle;
3956+ do
3957+ {
3958+ ScaleAndStoreInt16(&out[y][x - SIMD_WIDTH], vSum); // store data from previous iteration
3959+ middle:
3960+ if (symmetric)
3961+ {
3962+ __m128i center;
3963+ vSum = _mm_mulhrs_epi16(Cast256To128(vFilter[0]), LoadAndScaleToInt16(center, &in[y][x]));
3964+ for (ssize_t i = 1; i <= filterSize; ++i)
3965+ {
3966+ __m128i filter = Cast256To128(vFilter[i]);
3967+ ssize_t srcY = y + i;
3968+ if (onBorder)
3969+ srcY = min(srcY, height - 1);
3970+ __m128i topNeighbor;
3971+ LoadAndScaleToInt16(topNeighbor, &in[srcY][x]);
3972+
3973+ srcY = y - i;
3974+ if (onBorder)
3975+ srcY = max(srcY, ssize_t(0));
3976+ __m128i bottomNeighbor;
3977+ LoadAndScaleToInt16(bottomNeighbor, &in[srcY][x]);
3978+ vSum = _mm_adds_epi16
3979+ (
3980+ vSum,
3981+ _mm_mulhrs_epi16
3982+ (
3983+ filter,
3984+ _mm_adds_epi16(bottomNeighbor, topNeighbor)
3985+ )
3986+ );
3987+ }
3988+ }
3989+ else
3990+ {
3991+ throw 0;
3992+ }
3993+ x += SIMD_WIDTH;
3994+ } while (x < width);
3995+ ScaleAndStoreInt16<true>(&out[y][x - SIMD_WIDTH], vSum, width - (x - SIMD_WIDTH));
3996+ #endif
3997+ ++y;
3998+ } while (y < yEnd);
3999+}
4000+#endif
4001+
4002+// in-place (out = in) operation not allowed
4003+template <typename OutType, typename InType>
4004+void ConvolveVerticalFIR(SimpleImage<OutType> out,
4005+ SimpleImage<InType> in,
4006+ ssize_t width, ssize_t height,
4007+ float sigmaY)
4008+{
4009+#ifdef DO_FIR_IN_FLOAT
4010+ typedef MyTraits<float>::SIMDtype SIMDtype;
4011+#else
4012+ typedef MyTraits<int16_t>::SIMDtype SIMDtype;
4013+#endif
4014+ int halfFilterSize = _effect_area_scr(sigmaY);
4015+
4016+ float *filter = (float *)alloca((halfFilterSize + 1) * sizeof(float));
4017+
4018+ _make_kernel(filter, sigmaY);
4019+
4020+ SIMDtype *vFilter = (SIMDtype *)ALIGNED_ALLOCA((halfFilterSize + 1) * sizeof(SIMDtype), sizeof(SIMDtype));
4021+
4022+ for (ssize_t i = 0; i <= halfFilterSize; ++i)
4023+ {
4024+ #ifdef DO_FIR_IN_FLOAT
4025+ BroadcastSIMD(vFilter[i], filter[i]);
4026+ #else
4027+ BroadcastSIMD(vFilter[i], clip_round_cast<int16_t>(filter[i] * 32768));
4028+ #endif
4029+ }
4030+
4031+ const ssize_t IDEAL_Y_BLOCK_SIZE = 2; // currently, no advantage to making > 1
4032+
4033+ #pragma omp parallel
4034+ {
4035+ #pragma omp for
4036+ for (ssize_t y = 0; y < height; y += IDEAL_Y_BLOCK_SIZE)
4037+ {
4038+ ssize_t yBlockSize = min(height - y, IDEAL_Y_BLOCK_SIZE);
4039+ bool onBorder = y < halfFilterSize || y + IDEAL_Y_BLOCK_SIZE + halfFilterSize > height;
4040+ if (onBorder)
4041+ {
4042+ ConvolveVerticalFIR<true, true>(out, in,
4043+ width, height,
4044+ y, y + yBlockSize,
4045+ vFilter, halfFilterSize);
4046+
4047+ }
4048+ else
4049+ {
4050+ ConvolveVerticalFIR<true, false>(out, in,
4051+ width, height,
4052+ y, y + yBlockSize,
4053+ vFilter, halfFilterSize);
4054+ }
4055+ }
4056+ } // omp parallel
4057+}
4058+
4059+template <int channels, typename OutType, typename InType>
4060+void ConvolveFIR(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY)
4061+{
4062+ using namespace std::chrono;
4063+#ifdef DO_FIR_IN_FLOAT
4064+ AlignedImage<float, sizeof(__m256)> horizontalFiltered;
4065+#else
4066+ AlignedImage<int16_t, sizeof(__m256)> horizontalFiltered;
4067+#endif
4068+ horizontalFiltered.Resize(width * channels, height);
4069+
4070+ const bool DO_TIMING = false;
4071+
4072+ high_resolution_clock::time_point t0;
4073+ if (DO_TIMING)
4074+ t0 = high_resolution_clock::now();
4075+
4076+ ConvolveHorizontalFIR<channels>(horizontalFiltered, in, width, height, sigmaX);
4077+
4078+ if (DO_TIMING)
4079+ {
4080+ auto t1 = high_resolution_clock::now();
4081+ cout << "T_horiz=" << duration_cast<milliseconds>(t1 - t0).count() << " ms" << endl;
4082+ t0 = t1;
4083+ }
4084+
4085+ // todo: use sliding window to reduce cache pollution
4086+ float scale = 1.0f;
4087+#ifndef DO_FIR_IN_FLOAT
4088+ scale = 1.0f / 64;
4089+#endif
4090+
4091+ //SaveImage("horizontal_filtered.png", horizontalFiltered, width, height, channels, scale);
4092+ ConvolveVerticalFIR(out, horizontalFiltered, width * channels, height, sigmaY);
4093+ if (DO_TIMING)
4094+ cout << "T_v=" << duration_cast<milliseconds>(high_resolution_clock::now() - t0).count() << " ms" << endl;
4095+}
4096
4097=== modified file 'src/display/nr-filter-gaussian.cpp'
4098--- src/display/nr-filter-gaussian.cpp 2014-03-27 01:33:44 +0000
4099+++ src/display/nr-filter-gaussian.cpp 2016-12-22 06:18:34 +0000
4100@@ -12,13 +12,13 @@
4101 */
4102
4103 #include "config.h" // Needed for HAVE_OPENMP
4104-
4105 #include <algorithm>
4106 #include <cmath>
4107 #include <complex>
4108 #include <cstdlib>
4109 #include <glib.h>
4110 #include <limits>
4111+#include <typeinfo>
4112 #if HAVE_OPENMP
4113 #include <omp.h>
4114 #endif //HAVE_OPENMP
4115@@ -32,11 +32,18 @@
4116 #include <2geom/affine.h>
4117 #include "util/fixed_point.h"
4118 #include "preferences.h"
4119+#include <fstream>
4120+#include <iomanip>
4121+#include <cpuid.h>
4122+#include <chrono>
4123+#include "SimpleImage.h"
4124
4125 #ifndef INK_UNUSED
4126 #define INK_UNUSED(x) ((void)(x))
4127 #endif
4128
4129+using namespace std;
4130+
4131 // IIR filtering method based on:
4132 // L.J. van Vliet, I.T. Young, and P.W. Verbeek, Recursive Gaussian Derivative Filters,
4133 // in: A.K. Jain, S. Venkatesh, B.C. Lovell (eds.),
4134@@ -54,10 +61,12 @@
4135 // filters are used).
4136 static size_t const N = 3;
4137
4138+#if __cplusplus < 201103
4139 template<typename InIt, typename OutIt, typename Size>
4140 inline void copy_n(InIt beg_in, Size N, OutIt beg_out) {
4141 std::copy(beg_in, beg_in+N, beg_out);
4142 }
4143+#endif
4144
4145 // Type used for IIR filter coefficients (can be 10.21 signed fixed point, see Anisotropic Gaussian Filtering Using Fixed Point Arithmetic, Christoph H. Lampert & Oliver Wirjadi, 2006)
4146 typedef double IIRValue;
4147@@ -123,6 +132,11 @@
4148
4149 FilterGaussian::FilterGaussian()
4150 {
4151+ extern void (*GaussianBlurIIR_Y8)(SimpleImage<uint8_t>, SimpleImage<uint8_t>, ssize_t, ssize_t, float, float);
4152+ void InitializeSIMDFunctions();
4153+ if (GaussianBlurIIR_Y8 == NULL)
4154+ InitializeSIMDFunctions();
4155+
4156 _deviation_x = _deviation_y = 0.0;
4157 }
4158
4159@@ -142,8 +156,8 @@
4160 return (int)std::ceil(std::fabs(deviation) * 3.0);
4161 }
4162
4163-static void
4164-_make_kernel(FIRValue *const kernel, double const deviation)
4165+template <typename FIRValue>
4166+static void _make_kernel(FIRValue *const kernel, double const deviation)
4167 {
4168 int const scr_len = _effect_area_scr(deviation);
4169 g_assert(scr_len >= 0);
4170@@ -546,6 +560,559 @@
4171 };
4172 }
4173
4174+#ifdef _MSC_VER
4175+ #define FORCE_INLINE __forceinline
4176+ #define ALIGN(x) __declspec(aligned(x))
4177+#else
4178+ #define FORCE_INLINE inline __attribute__((always_inline))
4179+ #define ALIGN(x) __attribute__((alignment(x)))
4180+#endif
4181+
4182+void (*GaussianBlurIIR_Y8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY);
4183+void (*GaussianBlurIIR_R8G8B8A8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY);
4184+
4185+void (*GaussianBlurFIR_Y8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY);
4186+void (*GaussianBlurFIR_R8G8B8A8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY);
4187+
4188+void(*GaussianBlurHorizontalIIR_Y8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, bool canOverwriteInput);
4189+void(*GaussianBlurHorizontalIIR_R8G8B8A8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, bool canOverwriteInput);
4190+void(*GaussianBlurVerticalIIR)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaY); // works for grayscale & RGBA
4191+
4192+template <typename AnyType>
4193+void SaveImage(const char *path, SimpleImage<AnyType> in, int width, int height, int channels, float scale = 1.0f)
4194+{
4195+ cairo_surface_t *scaled = cairo_image_surface_create(channels == 1 ? CAIRO_FORMAT_A8 : CAIRO_FORMAT_ARGB32, width, height);
4196+
4197+ uint8_t *_scaled = cairo_image_surface_get_data(scaled);
4198+ ssize_t scaledPitch = cairo_image_surface_get_stride(scaled);
4199+ for (int y = 0; y < height; ++y)
4200+ {
4201+ for (int x = 0; x < width * channels; ++x)
4202+ {
4203+ _scaled[y * scaledPitch + x] = min(in[y][x] * scale, 255.0f);
4204+ }
4205+ }
4206+ cairo_surface_mark_dirty(scaled);
4207+ if (cairo_surface_write_to_png(scaled, path) != CAIRO_STATUS_SUCCESS)
4208+ throw 0;
4209+}
4210+
4211+#if defined(__x86_64__) || defined(__i386__) || defined (_M_X64) || defined(_M_IX86) // if x86 processor
4212+
4213+#include <immintrin.h>
4214+
4215+#ifndef __GNUC__
4216+ #define __SSE__
4217+ #define __SSE2__
4218+#endif
4219+
4220+const float MAX_SIZE_FOR_SINGLE_PRECISION = 30.0f; // switch to double if sigma > threshold or else round off error becomes so big that the output barely changes and you see annoying mach bands
4221+
4222+const size_t GUARANTEED_ALIGNMENT = 16;
4223+
4224+#define ALIGNED_ALLOCA(size, alignment) RoundUp((size_t)alloca(size + (((alignment) / GUARANTEED_ALIGNMENT) - 1) * GUARANTEED_ALIGNMENT), alignment)
4225+
4226+const ALIGN(32) uint8_t PARTIAL_VECTOR_MASK[64] =
4227+{
4228+ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
4229+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
4230+};
4231+
4232+#ifdef _MSC_VER
4233+#define __FMA__
4234+#define __AVX2__
4235+#define __AVX__
4236+#define __SSE4_1__
4237+#define __SSSE3__
4238+#include "gaussian_blur_templates.h"
4239+
4240+#else
4241+
4242+#pragma GCC push_options
4243+
4244+namespace AVX2
4245+{
4246+#pragma GCC target("fma,avx2")
4247+#define __FMA__
4248+#define __AVX2__
4249+#define __AVX__
4250+#define __SSE4_1__
4251+#define __SSSE3__
4252+#include "gaussian_blur_templates.h"
4253+}
4254+
4255+namespace AVX
4256+{
4257+#pragma GCC target("avx")
4258+#undef __AVX2__
4259+#undef __FMA__
4260+
4261+#include "gaussian_blur_templates.h"
4262+}
4263+
4264+namespace SSE2
4265+{
4266+//#ifdef __x86_64__
4267+// #pragma GCC target("default") // base x86_64 ISA already has SSE2
4268+//#else
4269+ #pragma GCC target("sse2")
4270+//#endif
4271+#undef __AVX__
4272+#undef __SSE4_1__
4273+#undef __SSSE3__
4274+#include "gaussian_blur_templates.h"
4275+}
4276+
4277+#pragma GCC pop_options
4278+
4279+#endif
4280+
4281+void InitializeSIMDFunctions()
4282+{
4283+#ifdef __GNUC__
4284+ enum SIMDarch { SSE2, AVX, AVX2 };
4285+ const char *SIMDarchNames[] = { "SSE2", "AVX", "AVX2" };
4286+ SIMDarch arch;
4287+ if (getenv("FORCE_SIMD") != NULL)
4288+ {
4289+ arch = (SIMDarch)atoi(getenv("FORCE_SIMD"));
4290+ }
4291+ else
4292+ {
4293+ unsigned cpuInfo[4];
4294+ // version in cpuid.h is buggy - forgets to clear ecx
4295+ #define __cpuid(level, a, b, c, d) \
4296+ __asm__("xor %%ecx, %%ecx\n" \
4297+ "cpuid\n" \
4298+ : "=a"(a), "=b"(b), "=c"(c), "=d"(d) \
4299+ : "0"(level))
4300+ int maxLevel = __get_cpuid_max(0, 0);
4301+
4302+ cpuInfo[1] = 0;
4303+ if (maxLevel >= 7)
4304+ __cpuid(7, cpuInfo[0], cpuInfo[1], cpuInfo[2], cpuInfo[3]);
4305+
4306+ if (cpuInfo[1] & bit_AVX2)
4307+ {
4308+ arch = AVX2;
4309+ }
4310+ else
4311+ {
4312+ __cpuid(1, cpuInfo[0], cpuInfo[1], cpuInfo[2], cpuInfo[3]);
4313+ if (cpuInfo[2] & bit_AVX)
4314+ arch = AVX;
4315+ else if (cpuInfo[3] & bit_SSE2)
4316+ arch = SSE2;
4317+ }
4318+ }
4319+ cout << "using " << SIMDarchNames[arch] << " functions" << endl;
4320+ switch (arch)
4321+ {
4322+ case SSE2:
4323+ GaussianBlurIIR_Y8 = SSE2::Convolve<1>;
4324+ GaussianBlurIIR_R8G8B8A8 = SSE2::Convolve<4>;
4325+ GaussianBlurFIR_Y8 = SSE2::ConvolveFIR<1>;
4326+ GaussianBlurFIR_R8G8B8A8 = SSE2::ConvolveFIR<4>;
4327+ GaussianBlurHorizontalIIR_Y8 = SSE2::ConvolveHorizontal<false, 1>;
4328+ GaussianBlurHorizontalIIR_R8G8B8A8 = SSE2::ConvolveHorizontal<false, 4>;
4329+ GaussianBlurVerticalIIR = SSE2::ConvolveVertical;
4330+ break;
4331+ case AVX:
4332+ GaussianBlurIIR_Y8 = AVX::Convolve<1>;
4333+ GaussianBlurIIR_R8G8B8A8 = AVX::Convolve<4>;
4334+ GaussianBlurFIR_Y8 = AVX::ConvolveFIR<1>;
4335+ GaussianBlurFIR_R8G8B8A8 = AVX::ConvolveFIR<4>;
4336+ GaussianBlurHorizontalIIR_Y8 = AVX::ConvolveHorizontal<false, 1>;
4337+ GaussianBlurHorizontalIIR_R8G8B8A8 = AVX::ConvolveHorizontal<false, 4>;
4338+ GaussianBlurVerticalIIR = AVX::ConvolveVertical;
4339+ break;
4340+ case AVX2:
4341+ GaussianBlurIIR_Y8 = AVX2::Convolve<1>;
4342+ GaussianBlurIIR_R8G8B8A8 = AVX2::Convolve<4>;
4343+ GaussianBlurFIR_Y8 = AVX2::ConvolveFIR<1>;
4344+ GaussianBlurFIR_R8G8B8A8 = AVX2::ConvolveFIR<4>;
4345+ GaussianBlurHorizontalIIR_Y8 = AVX2::ConvolveHorizontal<false, 1>;
4346+ GaussianBlurHorizontalIIR_R8G8B8A8 = AVX2::ConvolveHorizontal<false, 4>;
4347+ GaussianBlurVerticalIIR = AVX2::ConvolveVertical;
4348+ break;
4349+ }
4350+#else
4351+ GaussianBlurIIR_Y8 = Convolve<1>;
4352+ GaussianBlurIIR_R8G8B8A8 = Convolve<4>;
4353+ GaussianBlurFIR_Y8 = ConvolveFIR<1>;
4354+ GaussianBlurFIR_R8G8B8A8 = ConvolveFIR<4>;
4355+ GaussianBlurHorizontalIIR_Y8 = ConvolveHorizontal<false, 1>;
4356+ GaussianBlurHorizontalIIR_R8G8B8A8 = ConvolveHorizontal<false, 4>;
4357+ GaussianBlurVerticalIIR = ConvolveVertical;
4358+#endif
4359+}
4360+#else
4361+void InitializeSIMDFunctions()
4362+{
4363+}
4364+#endif
4365+
4366+
4367+#ifdef UNIT_TEST
4368+
4369+template <typename AnyType>
4370+void CompareImages(SimpleImage<AnyType> ref, SimpleImage<AnyType> actual, int w, int h)
4371+{
4372+ double avgDiff = 0,
4373+ maxDiff = 0,
4374+ maxErrorFrac = 0;
4375+ for (int y = 0; y < h; ++y)
4376+ {
4377+ for (int x = 0; x < w; ++x)
4378+ {
4379+ double diff = abs((double)actual[y][x] - (double)ref[y][x]);
4380+ maxDiff = max(diff, maxDiff);
4381+ avgDiff += diff;
4382+ if (ref[y][x] != 0)
4383+ {
4384+ double errorFrac = abs(diff / ref[y][x]);
4385+ maxErrorFrac = max(errorFrac, maxErrorFrac);
4386+ }
4387+ }
4388+ }
4389+ avgDiff /= (w * h);
4390+ cout << "avgDiff=" << setprecision(4) << setw(9) << avgDiff << " maxDiff=" << setw(4) << maxDiff << " maxErrorFrac=" << setprecision(4) << setw(8) << maxErrorFrac << endl;
4391+}
4392+
4393+void RefFilterIIR(cairo_surface_t *out, cairo_surface_t *in,
4394+ float deviation_x, float deviation_y)
4395+{
4396+ int threads = omp_get_max_threads();
4397+ const int MAX_THREADS = 16;
4398+
4399+ int h = cairo_image_surface_get_height(in),
4400+ w = cairo_image_surface_get_width(in);
4401+
4402+ IIRValue * tmpdata[MAX_THREADS];
4403+ for (int i = 0; i < threads; ++i)
4404+ tmpdata[i] = new IIRValue[std::max(w, h)*4];
4405+
4406+ gaussian_pass_IIR(Geom::X, deviation_x, in, out, tmpdata, threads);
4407+ gaussian_pass_IIR(Geom::Y, deviation_y, out, out, tmpdata, threads);
4408+
4409+ for (int i = 0; i < threads; ++i)
4410+ delete[] tmpdata[i];
4411+}
4412+
4413+void RefFilterFIR(cairo_surface_t *out, cairo_surface_t *in,
4414+ float deviation_x, float deviation_y)
4415+{
4416+ int threads = omp_get_max_threads();
4417+ gaussian_pass_FIR(Geom::X, deviation_x, in, out, threads);
4418+ gaussian_pass_FIR(Geom::Y, deviation_y, out, out, threads);
4419+}
4420+
4421+
4422+cairo_surface_t *ConvertToGrayscale(cairo_surface_t *s)
4423+{
4424+ int w = cairo_image_surface_get_width(s),
4425+ h = cairo_image_surface_get_height(s);
4426+ cairo_surface_t *temp = cairo_image_surface_create(CAIRO_FORMAT_ARGB32, w, h),
4427+ *grayScale = cairo_image_surface_create(CAIRO_FORMAT_A8, w, h);
4428+
4429+ cairo_t *r = cairo_create(temp);
4430+ // set to 0
4431+ cairo_set_source_rgb(r, 0, 0, 0);
4432+ cairo_paint(r);
4433+
4434+ // convert to gray scale
4435+ cairo_set_operator(r, CAIRO_OPERATOR_HSL_LUMINOSITY);
4436+ cairo_set_source_surface(r, s, 0, 0);
4437+ cairo_paint(r);
4438+ cairo_destroy(r);
4439+
4440+ ssize_t inPitch = cairo_image_surface_get_stride(temp),
4441+ outPitch = cairo_image_surface_get_stride(grayScale);
4442+ uint8_t *in = cairo_image_surface_get_data(temp),
4443+ *out = cairo_image_surface_get_data(grayScale);
4444+ for (int y = 0; y < h; ++y)
4445+ {
4446+ for (int x = 0; x < w; ++x)
4447+ {
4448+ out[y * outPitch + x] = in[y * inPitch + x * 4];
4449+ }
4450+ }
4451+ cairo_surface_destroy(temp);
4452+ cairo_surface_mark_dirty(grayScale);
4453+ return grayScale;
4454+}
4455+
4456+void CopySurface(cairo_surface_t *out, cairo_surface_t *in)
4457+{
4458+ int pitch = cairo_image_surface_get_stride(in),
4459+ h = cairo_image_surface_get_height(in);
4460+
4461+ memcpy(cairo_image_surface_get_data(out), cairo_image_surface_get_data(in), pitch * h);
4462+ cairo_surface_mark_dirty(out);
4463+}
4464+
4465+extern "C" int main(int argc, char **argv)
4466+{
4467+ using namespace boost::chrono;
4468+ bool compareOrBenchmark = false;
4469+ const char *imagePath = "../drmixx/rasterized/99_showdown_carcass.png";
4470+ _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); // why does Intel need microcode to handle denormals? NVIDIA handles it in hardware fine
4471+
4472+ cairo_surface_t *in;
4473+ bool result;
4474+ for (int i = 1; i < argc; ++i)
4475+ {
4476+ if (strcmp(argv[i], "-b") == 0)
4477+ compareOrBenchmark = true;
4478+ else
4479+ imagePath = argv[i];
4480+ }
4481+ in = cairo_image_surface_create_from_png(imagePath);
4482+
4483+ // OMG, not aligning to 16 bytes can almost slow things down 2x!
4484+ //iluScale(RoundUp(ilGetInteger(IL_IMAGE_WIDTH), 4), RoundUp(ilGetInteger(IL_IMAGE_HEIGHT), 4), 0);
4485+
4486+ if (cairo_surface_status(in) != CAIRO_STATUS_SUCCESS)
4487+ {
4488+ cerr << "error loading" << endl;
4489+ return -1;
4490+ }
4491+
4492+ //if (!iluScale(38, 44, 1))
4493+ // cerr << "error scaling" << endl;
4494+ int originalHeight = cairo_image_surface_get_height(in),
4495+ originalWidth = cairo_image_surface_get_width(in);
4496+
4497+ cairo_surface_t *grayScaleIn = ConvertToGrayscale(in);
4498+ if (GaussianBlurIIR_Y8 == NULL)
4499+ InitializeSIMDFunctions();
4500+
4501+ auto IterateCombinations = [&](int adjustedWidth0, int adjustedWidth1, int adjustedHeight0, int adjustedHeight1, auto callback)
4502+ {
4503+ for (int adjustedHeight = adjustedHeight0; adjustedHeight <= adjustedHeight1; ++adjustedHeight)
4504+ {
4505+ for (int adjustedWidth = adjustedWidth0; adjustedWidth <= adjustedWidth1; ++adjustedWidth)
4506+ {
4507+ for (int channels = 1; channels <= 4; channels += 3)
4508+ {
4509+ cairo_surface_t *modifiedIn = cairo_surface_create_similar_image(in, channels == 1 ? CAIRO_FORMAT_A8 : CAIRO_FORMAT_ARGB32, adjustedWidth, adjustedHeight);
4510+ if (cairo_surface_status(modifiedIn) != CAIRO_STATUS_SUCCESS)
4511+ {
4512+ cerr << "error creating surface" << endl;
4513+ continue;
4514+ }
4515+ cairo_t *ct = cairo_create(modifiedIn);
4516+ cairo_scale(ct, double(adjustedWidth) / originalWidth, double(adjustedHeight) / originalHeight);
4517+ // scale/convert to given size/format
4518+ if (channels == 1)
4519+ {
4520+ cairo_set_source_rgb(ct, 1, 1, 1);
4521+ cairo_mask_surface(ct, grayScaleIn, 0, 0);
4522+ }
4523+ else
4524+ {
4525+ cairo_set_source_surface(ct, in, 0, 0);
4526+ cairo_paint(ct);
4527+ }
4528+ cairo_destroy(ct);
4529+
4530+ cairo_surface_t *out = cairo_surface_create_similar_image(modifiedIn, cairo_image_surface_get_format(modifiedIn), adjustedWidth, adjustedHeight);
4531+
4532+ for (int FIRorIIR = 0; FIRorIIR < 2; ++FIRorIIR)
4533+ {
4534+ for (int inPlace = 0; inPlace < 2; ++inPlace)
4535+ {
4536+ if (inPlace)
4537+ CopySurface(out, modifiedIn); // restore overwritten input image
4538+
4539+ callback(out, inPlace ? out : modifiedIn, modifiedIn, inPlace, FIRorIIR);
4540+ }
4541+ }
4542+ cairo_surface_destroy(modifiedIn);
4543+ cairo_surface_destroy(out);
4544+ }
4545+ }
4546+ }
4547+ };
4548+ if (!compareOrBenchmark)
4549+ {
4550+ auto CompareFunction = [&](cairo_surface_t *out, cairo_surface_t *in, cairo_surface_t *backupIn, bool inPlace, bool FIRorIIR)
4551+ {
4552+ // here, we assume input & output have the same format
4553+ int channels = cairo_image_surface_get_format(in) == CAIRO_FORMAT_ARGB32 ? 4 : 1;
4554+
4555+ SimpleImage <uint8_t> _in(cairo_image_surface_get_data(in), cairo_image_surface_get_stride(in)),
4556+ _out(cairo_image_surface_get_data(out), cairo_image_surface_get_stride(out));
4557+ int width = cairo_image_surface_get_width(in),
4558+ height = cairo_image_surface_get_height(in);
4559+
4560+ cairo_surface_t *refOut = cairo_surface_create_similar_image(in, channels == 1 ? CAIRO_FORMAT_A8 : CAIRO_FORMAT_ARGB32, width, height);
4561+
4562+ SimpleImage <uint8_t> _refOut(cairo_image_surface_get_data(refOut), cairo_image_surface_get_stride(refOut));
4563+ bool originalSize = width == originalWidth && height == originalHeight;
4564+
4565+ // test the correctness of different sigmas only for the original image size
4566+ // testing multiple sigmas for scaled image sizes is a waste
4567+ float DEFAULT_SIGMA = 5.0f;
4568+
4569+ float sigmaX0 = originalSize ? 0.5f : DEFAULT_SIGMA,
4570+ sigmaX1 = originalSize ? (FIRorIIR ? 64 : 4) : DEFAULT_SIGMA;
4571+ for (float sigmaX = sigmaX0; sigmaX <= sigmaX1; sigmaX *= 2)
4572+ {
4573+ float sigmaY = 1.2f * sigmaX;
4574+
4575+ cout << width << "x" << height
4576+ << setw(10) << (channels == 4 ? " RGBA" : " grayscale")
4577+ << (FIRorIIR ? " IIR" : " FIR")
4578+ << setw(15) << (inPlace ? " in-place" : " out-of-place")
4579+ << " sigmaX=" << setw(3) << sigmaX << " ";
4580+ double dt = 0;
4581+
4582+ if (inPlace)
4583+ {
4584+ CopySurface(out, backupIn);
4585+ }
4586+
4587+
4588+ if (FIRorIIR)
4589+ {
4590+ if (channels == 1)
4591+ GaussianBlurIIR_Y8(_out, _in, width, height, sigmaX, sigmaY);
4592+ else
4593+ GaussianBlurIIR_R8G8B8A8(_out, _in, width, height, sigmaX, sigmaY);
4594+ }
4595+ else
4596+ {
4597+ if (channels == 1)
4598+ GaussianBlurFIR_Y8(_out, _in, width, height, sigmaX, sigmaY);
4599+ else
4600+ GaussianBlurFIR_R8G8B8A8(_out, _in, width, height, sigmaX, sigmaY);
4601+ }
4602+
4603+ // ---------------------reference
4604+ cairo_surface_t *refIn;
4605+ if (inPlace)
4606+ {
4607+ refIn = refOut;
4608+ CopySurface(refIn, backupIn);
4609+ }
4610+ else
4611+ {
4612+ refIn = in;
4613+ }
4614+
4615+ if (FIRorIIR)
4616+ RefFilterIIR(refOut, refIn, sigmaX, sigmaY);
4617+ else
4618+ RefFilterFIR(refOut, refIn, sigmaX, sigmaY);
4619+
4620+ if (0)//FIRorIIR && width == 1466 && sigmaX == 0.5f)
4621+ {
4622+ cout << " dumping ";
4623+ cairo_surface_write_to_png(refOut, "filtered_ref.png");
4624+ cairo_surface_write_to_png(out, "filtered_opt.png");
4625+ exit(1);
4626+ }
4627+
4628+ CompareImages(_refOut, _out, width, height);
4629+ }
4630+ cairo_surface_destroy(refOut);
4631+ };
4632+
4633+ const int SIMD_Y_BLOCK_SIZE = 2, // for checking SIMD remainder handling issues
4634+ SIMD_X_BLOCK_SIZE = 4;
4635+ int h0 = RoundDown(originalHeight, SIMD_Y_BLOCK_SIZE),
4636+ w0 = RoundDown(originalWidth, SIMD_X_BLOCK_SIZE);
4637+ IterateCombinations(w0, w0 + SIMD_X_BLOCK_SIZE, h0, h0 + SIMD_Y_BLOCK_SIZE, CompareFunction);
4638+ //IterateCombinations(4, 8, 4, 8, CompareFunction);
4639+
4640+ }
4641+ else
4642+ {
4643+ // benchmark
4644+#if defined(_WIN32) && defined(_DEBUG)
4645+ const int REPEATS = 1;
4646+#else
4647+ const int REPEATS = 50 * max(1.0f, sqrtf(omp_get_max_threads())); // assume speedup is sqrt(#cores)
4648+#endif
4649+ auto BenchmarkFunction = [&](cairo_surface_t *out, cairo_surface_t *in, cairo_surface_t *backupIn, bool inPlace, bool FIRorIIR)
4650+ {
4651+ // here, we assume input & output have the same format
4652+ int channels = cairo_image_surface_get_format(in) == CAIRO_FORMAT_ARGB32 ? 4 : 1;
4653+
4654+ SimpleImage <uint8_t> _in(cairo_image_surface_get_data(in), cairo_image_surface_get_stride(in)),
4655+ _out(cairo_image_surface_get_data(out), cairo_image_surface_get_stride(out));
4656+ int width = cairo_image_surface_get_width(in),
4657+ height = cairo_image_surface_get_height(in);
4658+ const bool useRefCode = false;
4659+
4660+ float sigma0, sigma1;
4661+ if (FIRorIIR)
4662+ {
4663+ // test single precision & double precision throughput
4664+ sigma0 = MAX_SIZE_FOR_SINGLE_PRECISION * 0.75f;
4665+ sigma1 = sigma0 * 2.0f;
4666+ }
4667+ else
4668+ {
4669+ sigma0 = 0.5f;
4670+ sigma1 = 4;
4671+ }
4672+
4673+ for (float sigma = sigma0; sigma <= sigma1; sigma *= 2)
4674+ {
4675+ cout << width << "x" << height
4676+ << setw(10) << (channels == 4 ? " RGBA" : " grayscale")
4677+ << (FIRorIIR ? " IIR" : " FIR")
4678+ << setw(15) << (inPlace ? " in-place" : " out-of-place")
4679+ << " sigma=" << setw(3) << sigma;
4680+ high_resolution_clock::duration dt(0);
4681+ for (int i = 0; i < REPEATS; ++i)
4682+ {
4683+ if (inPlace)
4684+ CopySurface(out, backupIn); // copy backup to input/output
4685+
4686+ auto t0 = high_resolution_clock::now();
4687+ if (useRefCode)
4688+ {
4689+ if (FIRorIIR)
4690+ RefFilterIIR(out, in, sigma, sigma);
4691+ else
4692+ RefFilterFIR(out, in, sigma, sigma);
4693+ }
4694+ else
4695+ {
4696+ if (FIRorIIR)
4697+ {
4698+ if (channels == 1)
4699+ GaussianBlurIIR_Y8(_out, _in, width, height, sigma, sigma);
4700+ else
4701+ GaussianBlurIIR_R8G8B8A8(_out, _in, width, height, sigma, sigma);
4702+ }
4703+ else
4704+ {
4705+ if (channels == 1)
4706+ GaussianBlurFIR_Y8(_out, _in, width, height, sigma, sigma);
4707+ else
4708+ GaussianBlurFIR_R8G8B8A8(_out, _in, width, height, sigma, sigma);
4709+ }
4710+ }
4711+ dt += high_resolution_clock::now() - t0;
4712+ }
4713+ cout << setw(9) << setprecision(3) << double(width * height * REPEATS) / duration_cast<microseconds>(dt).count() << " Mpix/s" << endl;
4714+ }
4715+ };
4716+ int roundedWidth = RoundUp(originalWidth, 4),
4717+ roundedHeight = RoundUp(originalHeight, 4);
4718+ IterateCombinations(roundedWidth, roundedWidth, roundedHeight, roundedHeight, BenchmarkFunction);
4719+ }
4720+ cairo_surface_destroy(in);
4721+ cairo_surface_destroy(grayScaleIn);
4722+ return 0;
4723+}
4724+
4725+#endif
4726+
4727 void FilterGaussian::render_cairo(FilterSlot &slot)
4728 {
4729 cairo_surface_t *in = slot.getcairo(_input);
4730@@ -645,22 +1212,97 @@
4731 }
4732 cairo_surface_flush(downsampled);
4733
4734+ SimpleImage<uint8_t> im((uint8_t *)cairo_image_surface_get_data(downsampled), cairo_image_surface_get_stride(downsampled));
4735+
4736+ // 2D filter benefits
4737+ // 1. intermediate image may have higher precision than uint8
4738+ // 2. reduced cache pollution from useless flushing of intermediate image to memory
4739+ if (scr_len_x > 0 && scr_len_y > 0 && use_IIR_x == use_IIR_y && GaussianBlurIIR_Y8 != NULL)
4740+ {
4741+ if (fmt == CAIRO_FORMAT_ARGB32) {
4742+ if (use_IIR_x) {
4743+ GaussianBlurIIR_R8G8B8A8(im, // out
4744+ im, // in
4745+ cairo_image_surface_get_width(downsampled),
4746+ cairo_image_surface_get_height(downsampled),
4747+ deviation_x, deviation_y);
4748+ }
4749+ else {
4750+ GaussianBlurFIR_R8G8B8A8(im, // out
4751+ im, // in
4752+ cairo_image_surface_get_width(downsampled),
4753+ cairo_image_surface_get_height(downsampled),
4754+ deviation_x, deviation_y);
4755+ }
4756+ }
4757+ else {
4758+ if (use_IIR_x) {
4759+ GaussianBlurIIR_Y8(im,
4760+ im,
4761+ cairo_image_surface_get_width(downsampled),
4762+ cairo_image_surface_get_height(downsampled),
4763+ deviation_x, deviation_y);
4764+ }
4765+ else {
4766+ GaussianBlurFIR_Y8(im,
4767+ im,
4768+ cairo_image_surface_get_width(downsampled),
4769+ cairo_image_surface_get_height(downsampled),
4770+ deviation_x, deviation_y);
4771+ }
4772+ }
4773+ }
4774+ else
4775+ {
4776 if (scr_len_x > 0) {
4777- if (use_IIR_x) {
4778- gaussian_pass_IIR(Geom::X, deviation_x, downsampled, downsampled, tmpdata, threads);
4779+ if (use_IIR_x && GaussianBlurIIR_Y8 != NULL) {
4780+ if (fmt == CAIRO_FORMAT_ARGB32)
4781+ {
4782+ GaussianBlurHorizontalIIR_R8G8B8A8(im, // out
4783+ im, // in
4784+ cairo_image_surface_get_width(downsampled),
4785+ cairo_image_surface_get_height(downsampled),
4786+ deviation_x, false);
4787+ }
4788+ else {
4789+ GaussianBlurHorizontalIIR_Y8(im, // out
4790+ im, // in
4791+ cairo_image_surface_get_width(downsampled),
4792+ cairo_image_surface_get_height(downsampled),
4793+ deviation_x, false);
4794+ //gaussian_pass_IIR(Geom::X, deviation_x, downsampled, downsampled, tmpdata, threads);
4795+ }
4796 } else {
4797+ // optimized 1D FIR filter can't work in-place
4798 gaussian_pass_FIR(Geom::X, deviation_x, downsampled, downsampled, threads);
4799 }
4800 }
4801
4802 if (scr_len_y > 0) {
4803- if (use_IIR_y) {
4804- gaussian_pass_IIR(Geom::Y, deviation_y, downsampled, downsampled, tmpdata, threads);
4805+ if (use_IIR_y && GaussianBlurIIR_Y8 != NULL) {
4806+ if (fmt == CAIRO_FORMAT_ARGB32)
4807+ {
4808+ GaussianBlurVerticalIIR(im, // out
4809+ im, // in
4810+ cairo_image_surface_get_width(downsampled) * 4,
4811+ cairo_image_surface_get_height(downsampled),
4812+ deviation_y);
4813+ }
4814+ else
4815+ {
4816+ GaussianBlurVerticalIIR(im, // out
4817+ im, // in
4818+ cairo_image_surface_get_width(downsampled),
4819+ cairo_image_surface_get_height(downsampled),
4820+ deviation_y);
4821+ //gaussian_pass_IIR(Geom::Y, deviation_y, downsampled, downsampled, tmpdata, threads);
4822+ }
4823 } else {
4824+ // optimized 1D FIR filter can't work in-place
4825 gaussian_pass_FIR(Geom::Y, deviation_y, downsampled, downsampled, threads);
4826 }
4827 }
4828-
4829+ }
4830 // free the temporary data
4831 if ( use_IIR_x || use_IIR_y ) {
4832 for(int i = 0; i < threads; ++i) {