Merge lp:~simdgenius/inkscape/inkscape into lp:~inkscape.dev/inkscape/trunk

Proposed by Yale Zhang
Status: Needs review
Proposed branch: lp:~simdgenius/inkscape/inkscape
Merge into: lp:~inkscape.dev/inkscape/trunk
Diff against target: 4832 lines (+4736/-8)
3 files modified
src/display/SimpleImage.h (+80/-0)
src/display/gaussian_blur_templates.h (+4006/-0)
src/display/nr-filter-gaussian.cpp (+650/-8)
To merge this branch: bzr merge lp:~simdgenius/inkscape/inkscape
Reviewer Review Type Date Requested Status
Mc Abstain
Review via email: mp+307251@code.launchpad.net

Description of the change

This vectorized version of Gaussian blur has a speed up of about 5x for IIR filtering (more expected) and ~20x for FIR on a modern processor supporting AVX2.

Searching the mailing lists, users have been complaining about blurs being very since the beginning of Inkscape. I think Jasper's use of recursive filters (IIR) was a big help and so was OpenMP multithreading. But more is better, especially for artists who can't be interrupted by slow, refreshing screens.

The code is a monstrosity in terms of size, but it kind of has to be given all the different cases ({FIR, IIR} x {int16, float, double} x {grayscale, RGBA}. Plus, it needs to support SSE2 (any x86_64 has SSE2), AVX, and AVX2 through runtime dispatch. So, I fear about the maintainability.

I think I can rewrite maybe 1/2 of the code to use GCC vector extensions instead of direct intrinsics. There was a suggestion from Boudewijn Rempt to use the Vc library that Krita uses, but I don't see much benefit because you can't expect to efficiently abstract all the SIMD operations, especially the ones that operate horizontally across vector lanes (e.g. dot product, horizontal adds, permutes/shuffles).

The other concern is how SIMD remainders are handled. The code currently uses partial vector stores so that it never writes past the end of an image, so it should never crash. But it can do whole vector loads that go past the end all the time. So, this will render memory checking tools like AddressSanitizer and Valgrind useless from all the false positives. Ideally, I'd like all images rows to be padded to 16 or 32 bytes.

-----------------other tasks--------------------------
*further speed up IIR - the current cost is ~44 cycles/pixel for RGBA. This is much slower than a lower bound of 7 cycles/pixel = (4 multiplies + 3 adds) x 2 passes / (2 pixels/iteration for AVX), assuming 1 instruction/cycle.

*rewrite the stand alone unit test/benchmark as a Google GTest.

*document the functions with an SVG animation

To post a comment you must log in.
lp:~simdgenius/inkscape/inkscape updated
15139. By Martin Owens

Remove the reset on the glyph tangent, it breaks text on path (bug lp:1627523)

15140. By Tavmjong Bah

Update attributes list for rename of 'mesh' to 'meshgradient'.

15141. By Martin Owens

Add a prune method to saving svg files that removes Adobe's i:pgf tag.

15142. By Martin Owens

Adjust dock size to minimum width during canvas table size allocation signal.

15143. By Martin Owens

Merge in the new eraser mode (2) which uses clip instead of cut.

15144. By Tavmjong Bah

Prevent a crash if a mesh is defined in bounding box coordinates.

15145. By Jabiertxof

Fix a bug on eraser mode when previous clip are shapes not paths

15146. By Martin Owens

Merge in jabiertxof's hover information for measure tool

15147. By Mc

xverbs branch merge by dmitry-zhulanov
Add somes "xverbs" that are like verbs but can take arguments, and yaml file parsing for batch commandline operations.

15148. By Martin Owens

Merge jabier's Measure line LPE

15149. By Jabiertxof

Fix bug:1630821 on LPE selected nodes

15150. By Jabiertxof

Fix bug:1630796 on flatten button

15151. By Jabiertxof

Fix bug:1605334 FeImage X and Y position

15152. By Jabiertxof

Fix bug:1630796 on flatten button. Attemp 2

15153. By Jabiertxof

Remove unnecesary header from last push

15154. By Jabiertxof

Fix bug:1622321 on powerstroke

15155. By Sandra Snan

[Bug #770681] KEY MAPPING: Comma and period hijacked by scaling.

15156. By Mc

merge trunk-refactoring

15157. By Tavmjong Bah

Provide simple "preview" for mesh gradients.

15158. By Liam P. White

Fix palette flickering, probably.

15159. By Mc

merge a fix and refactor

15160. By victor.westmann

[Packaging] NSIS translation update for pt_br

15161. By victor.westmann

[Bug #1627166] Brazilian Portuguese translation for 0.92.

15162. By Sveinn í Felli

[Bug #1426423] Updated Icelandic translation for 0.92.

15163. By Tavmjong Bah

Render mesh gradients that reference other mesh gradients.

15164. By marenhachmann

[Bug #1630635] Wrong tool tip for new text line height setting.

15165. By Tavmjong Bah

Better handling of mesh gradients in Paint Selector dialog.

15166. By Tavmjong Bah

Remove unused/undefined function.

15167. By Tavmjong Bah

Ensure newly created meshes have correct 'gradientUnits'.

15168. By Tavmjong Bah

Do not create unused "vector" gradient when creating mesh gradient.

15169. By Tavmjong Bah

Code cleanup: simplify initial mesh color calculation.

15170. By Jabiertxof

Fix bug:1633521 on powerstroke

15171. By Tavmjong Bah

Implement copying of objects with mesh gradients.

15172. By victor.westmann

[Bug #1627166] Brazilian Portuguese translation for 0.92.

15173. By Tavmjong Bah

Add option to scale mesh to fit in bounding box.

15174. By jazzynico

[Bug #1633999] xcf export fails if layer names contain non-ASCII characters.

15175. By Tavmjong Bah

Use geometric bounding box for fill, visual for stroke in creating mesh.

15176. By Mc

update author list in about dialog from AUTHORS file

15177. By Tavmjong Bah

Implement 'vector-effect' value 'non-scaling-stroke'. No GUI yet.

15178. By FirasH

[Bug #1574561] Italian translation update.

15179. By Jabiertxof

Fix bug:1634641 crash on delete

15180. By Mc

Fix CMake dependency order

15181. By Tavmjong Bah

Add 'vector-effect' to attributes test.

15182. By Kris

cosmetic change

15183. By Mc

Fix gradient comparison.

15184. By Mc

update translators

15185. By marenhachmann

[Bug #1635332] Update for German translation.

15186. By Jabiertxof <email address hidden>

Fix bug#1635442

15187. By Patrick Storz

CMake: inkscape.com needs the "-mconsole" linker flag to be useful

15188. By Jordi Mas

[Bug #1636086] Update Catalan translation for Inkscape 0.92.

15189. By Mc

CPPification: almost all sp_object_set_whatever and sp_selection_whatever global functions are now methods of ObjectSet*, with these additional benefits:
- They can now act on any SelectionSet, not just the current selection;
- Whenever possible, they don't need a desktop anymore and can run if called from GUI.

I hope I did not break too many things in the process.

*: So instead of callink sp_selection_move(desktop,x,y), you call myobjectset->move(x,y)

15190. By Mc

Fix test

15191. By Mc

Fix signals

15192. By Mc

Prevent image drag/drop from grouping

15193. By Mc

Fix regression in loop prevention

15194. By Jordi Mas

[Bug #1636086] Update Catalan translation for Inkscape 0.92.

15195. By Mc

allows for denser screens in zoom correction factors

15196. By FirasH

[Bug #1574561] Italian translation update.

15197. By Jabiertxof

Close the bounding box path LPE

15198. By Jabiertxof

Fix fill between many LPE to start up with current path

15199. By Jabiertxof

Update branding folder

15200. By Jabiertxof

Fix bug:1013141 crash deleting LPE

15201. By houz

fix none color in palettes with scrollbars

15202. By houz

fix prefs icon

15203. By Mc

Add some unit tests for object-set cppification

15204. By Mc

Revert two changes from r15177

15205. By Mc

Fix crash in some commandline usage

15206. By Jabiertxof

Execution of update_po_files.sh

15207. By Tavmjong Bah

Render meshes with old syntax using camelCase.

15208. By Jabiertxof

Reformat branding folder

15209. By Jabiertxof

Update branding folder, remove fonts

15210. By Jabiertxof

Fix bug:1639083 crach closing segment with shortcut LPE

15211. By Jabiertxof

Fix bug:1639098

15212. By Jabiertxof

Fix change between multiples LPE in the same item

15213. By Jabiertxof

Fix a bug on duplicate item with multiples LPE on it. previously the LPE become "clones" if more than 1 LPE on the item.
Also wee need to discuss what happends on LPE copied what are inside a group, fork them or clone, currently are cloned
This can be a feature or a bug in the same user with diferent works. My proposal is fork it and add a item in paste LPEs to allow cloned LPE on paste

15214. By Jabiertxof

Fix last commit not working, LPE are cloned on copies

15215. By Jabiertxof

Move a header place

15216. By Jabiertxof

Fix bug on apply bend LPE from pen/cil without cliboard, nothin happends previously

15217. By Jabiertxof

Minor tweak

15218. By Mc

further cppification

15219. By Jabiertxof

Fix some bugs on pen/cil dropdown shapes

15220. By Mc

merge recursive unlink clones branch

15221. By mpasteven

fix cursor on big endian systems

15222. By Jabiertxof

1639832 Blend and blur unspected results

15223. By Mc

annotate custom builds, and add correct revno into make dist tarballs

15224. By Mattia Rizzolo

reproducible builds

15225. By jazzynico

[Bug #262341] Tooltips for LPE tool modes do not show up as translated.

15226. By suv-lp

[Bug #1638472] Quadrant points of ellipse/circle fail to snap (as source or target).

15227. By gordcaswell

[Bug #1639081] recently-used.xbel remaining when run portably.

15228. By FirasH

[Bug #1574561] Italian translation update.

15229. By Tavmjong Bah

Improve mesh handling in Fill and Stroke dialog.
Create new meshes with alternating color/white pattern
(makes it more obvious a mesh has been created).

15230. By Tavmjong Bah

Enable swapping of fill and stroke when one is a mesh.

15231. By Tavmjong Bah

Click-drag selects nodes rather than creates new mesh if mesh already exists.

15232. By Mc

merge boolop branch: Move boolop functions from sp_selected_path_<op> to ObjectSet::path<op>

15233. By Mc

resizable undocked dialogs

15234. By Jordi Mas

[Bug #1636086] Update Catalan translation for Inkscape 0.92.

15235. By mathog

patch for bug 1405292, start clipping with COPY instead of OR so GDI clipping works

15236. By Mc

fix test

15237. By Mc

fix automatic dockbar resizing

15238. By Mc

Fix filter editor update

15239. By Mc

Add a make inkscape_pot to regen potfile

15240. By Mc

Fix selection toolbar icons missing on start

15241. By Mc

Fix rare crash on undo break apart

15242. By Mc

update potfile

15243. By Tavmjong Bah

Fit to bounding box: correct transform when mesh has a non-identity gradient transform.

15244. By Mc

fix build

15245. By Patrick Storz

Packaging: Merge all fixes from 0.92.x branch for NSIS and WiX installers (Windows .exe and .msi)

15246. By Patrick Storz

Tutorials: Rename image files to follow "name.lang_id.ext" scheme
(Allows NSIS installer to autodetect localized files per locale and adjust installation components accordingly)

15247. By Yuri Chornoivan

[Bug #1407331] Ukrainian translation update for 0.92.

15248. By Sylvain Chiron

Translations. French translation update.

15249. By jazzynico

[Bug #1590529] Italian Updates for inkscape docs (0.92.x)

15250. By Tavmjong Bah

Implement tweaking of mesh handle colors.

15251. By Tavmjong Bah

Split selected rows/columns in half using Insert key.

15252. By Jordi Mas

[Bug #1636086] Update Catalan translation for Inkscape 0.92.

15253. By Tavmjong Bah

Ensure getVector() and getArray() return a valid gradient pointer.

15254. By Tavmjong Bah

Do not return invalid vector gradient when switching from mesh to linear/radial gradient.

15255. By Tavmjong Bah

Fix path outline function for meshes with nrow != ncolumn.

15256. By Tavmjong Bah

Fix status bar messages for meshes and gradients.

15257. By Tavmjong Bah

Remove debug line from last commit.

15258. By Tavmjong Bah

Another fix for the status bar with mesh gradients.

15259. By Jabiertxof

Fix #1627817. Bug in knot LPE

15260. By Tavmjong Bah

Improve mouse handling for mesh:
  * Double clicking an object will create a new mesh if one does not exist,
    otherwise clicking a line should now reliably divide the row/column.
  * Click-dragging will create a new mesh if one does not exist,
    otherwise it will do a rubberband selection of corner nodes.
    With Shift will add nodes, without will replace selected nodes.

15261. By su_v

Add Shift-I shortcut for insert node.

15262. By Tavmjong Bah

Preserve selection of corner nodes for some corner operations.

15263. By Jabiertxof

Fix #1643408. Bug in pap LPE

15264. By Tavmjong Bah

Keep corner nodes selected when possible for corner operations.

15265. By gordcaswell

[Bug #1643730] Inkscape Portable language selection not maintained.

15266. By helix84

Fix a typo in inkscape-preferences.cpp.

15267. By Tavmjong Bah

Select mesh nodes by clicking on control lines.

15268. By helix84

 * [INTL:zh_TW] Traditional Chinese translation update

15269. By Lucas Vieites

[Bug #1643818] Updated es.po for 0.92.

15270. By FirasH

[Bug #1574561] Italian translation update.

15271. By Tavmjong Bah

Remove deprecated GtkWidget:wide-seperators which is ignored as of 3.20.

15272. By Tavmjong Bah

Remove deprecated GtkWidget-separator-height, ignored as of 3.20.

15273. By Tavmjong Bah

Provide a way to update a legacy document to account for the 90 to 96 dpi change.
This method relies on setting the 'viewBox'.

15274. By Jabiertxof <jtx@jtx>

Add stroke dash empty to allow render only fills and markers. Tested in FF and Chromium

15275. By scootergrisen

[Bug #1644934] Translation to danish.

15276. By Patrick Storz

Translations/Packaging: Convert Danish translation to UTF8

15277. By jazzynico

[Bug #1644886] Color profiles not loaded on Windows (partial fix).

15278. By Patrick Storz

CMake: Add ${INKSCAPE_SHARE_INSTALL}
This is set to "share/inkscape" by default, on Windows we need to be able to install directly into "share" however

15279. By Patrick Storz

CMake: Explicitly call python
At least on Windows this breaks if Python is not associated with .py files (and even if it is an arbitrary Python version that might be installed on the system is used)

15280. By Patrick Storz

Remove unneeded "#include <arpa/inet.h>" in "cairo-utils.cpp"

15281. By jazzynico

[Bug #1641111] extension Visualize Path/Measure path... fails

15282. By scootergrisen

[Bug #1471443] Updated danish translation for 0.92.

15283. By Mc

update filter list when pasting and on import

15284. By Jabiertxof

Fixes transforms bug in meassure line LPE pointed in IRC by CR and suv

15285. By Jabiertxof

Reorganize SVG Structure have clean meassure line structure

15286. By Tavmjong Bah

Give mesh corner nodes a different color from handle nodes (following node tool coloring).

15287. By FirasH

[Bug #1574561] Italian translation update.

15288. By Tavmjong Bah

Fix bug with mesh handle update when corner moved via keys.

15289. By Tavmjong Bah

Add toggles for handle visibility, editing fill, and editing stroke.

15290. By Tavmjong Bah

Ensure new mesh is immediately editable.

15291. By Alexander Brock <email address hidden>

Improve precision of offset_cubic

15292. By Mc

prevent use of string concat for compatibility with old cmake

15293. By Jabiertxof

Add triangle knot.

15294. By Jabiertxof <jtx@jtx>

Improvements and fixes for buds pointed by suv on measure line LPE

15295. By Jabiertxof <jtx@jtx>

Fix a typo

15296. By Jabiertxof

Enable node resizing in mesh tool.

15297. By Jabiertxof <jtx@jtx>

Fix names in measure line LPE

15298. By Jabiertxof

Highlight mesh handles when corner or handle selected.
Highlight mesh control lines when corner/handle hovered over.

15299. By FirasH

[Bug #1574561] Italian translation update.

15300. By Tavmjong Bah

Fix memory leak (incomplete clear).

15301. By Jabiertxof

Add dpiswitcher extension and option to scale legacy documents with it.

15302. By Jabiertxof <jtx@jtx>

Fixes for measure LPE and speed path based LPE operations

15303. By Jabiertxof <jtx@jtx>

Remove obsolete comment

15304. By Jabiertxof <jtx@jtx>

Fix measure LPE to fit future extra objects based LPE

15305. By Jabiertxof

fix bug #1644621 on show handles

15306. By Jabiertxof

fix bug #1644621 on show handles. Fix start knot on closed paths

15307. By Tavmjong Bah

Add option to save a backup when updating file for dpi change.

15308. By Jabiertxof

Fix for reopened bug #1643408

15309. By Jabiertxof

Fix for reopened bug #1643408- minor typyng

15310. By Jabiertxof <jtx@jtx>

Improve measure line to allow similar LPE

15311. By Tavmjong Bah

Correct error messages.

15312. By Tavmjong Bah

Improve working of Type (Smoothing) menu.

15313. By Jabiertxof <jtx@jtx>

'upport' changes to LPE's rotate copies and mirrot symmetry

15314. By Tavmjong Bah

Don't add fortify source flag for debug builds. Avoids tons of warnings.

15315. By Tavmjong Bah

Add button to access outer text style ('font-size', 'line-height'). These determine the minimum line spacing.

15316. By Tavmjong Bah

Correct outer text style input for non-px based files.

15317. By Tavmjong Bah

Fix a bug where initially text has no fill and but has a stroke.

15318. By Jabiertxof <jtx@jtx>

Fix headers on LPE's

15319. By Tavmjong Bah

Fix line-height when converting between different units for flowed text.

15320. By Jabiertxof <jtx@jtx>

Apply suv patch to handle containers https://bugs.launchpad.net/inkscape/+bug/1389723/comments/95

15321. By Tavmjong Bah

Add option to unset 'line-height' (as well as determine where it is set).

15322. By Tavmjong Bah

Add missing 'pt' unit to test of legacy files.

15323. By Jabiertxof

Apply su_v patch to DPISwitcher: https://launchpadlibrarian.net/297886893/0000-fix-dpiswitcher-scaling-v1.diff

15324. By Tavmjong Bah

Add test internal scaling to account for DPI change.

15325. By Tavmjong Bah

Fixes for internal document scaling and add a second test option.

15326. By Tavmjong Bah

Fix crash from last commit due to bad preference path.

15327. By Tavmjong Bah

Save state of backup button.

15328. By Tavmjong Bah

Prevent crash when iterator invalidated after converting shape to path.

15329. By Tavmjong Bah

Fix bug where conical gradient drawn in wrong arc.

15330. By Jabiertxof <jtx@jtx>

Fix a bug on transforms in mirror symmetry

15331. By Jabiertxof <jtx@jtx>

Use Geom::Reflection instead custom method an copy rotate and mirror LPE

15332. By Jabiertxof <jtx@jtx>

Some coding style fixes

15333. By Jabiertxof

Remove 'desktop' usage on measure line LPE

15334. By Jabiertxof

Add translatable strings to trunk

15335. By Jabiertxof

Add update_helperpaths not member of nodetool class to easy call from outside

15336. By Jabiertxof

Remove unneeded static var from previous commit

15337. By Jabiertxof <jtx@jtx>

Remove some ocurrences of desktop in knot functions

15338. By Jabiertxof <jtx@jtx>

Fix strings in mirror symetry and copy rotate LPE

15339. By Jabiertxof <jtx@jtx>

Remove string from tip

15340. By Jabiertxof

Fix undo incosistences in mirror LPE and in copy rotate LPE

15341. By Jabiertxof

Regenerate PO files for new translations

15342. By Yuri Chornoivan

Translation update for Ukranian

15343. By Yuri Chornoivan

[Bug #1407331] Ukrainian translation update for 0.92.

15344. By Jabiertxof <jtx@jtx>

Remove more SPDesktop from LPE's

15345. By Jabiertxof <jtx@jtx>

Add string translatable pointed by Maren

15346. By Jabiertxof <jtx@jtx>

Update po and pot

15347. By Jabiertxof

Update pot files. for extrange thing intltool generate output as untitled.pot instesad inkscape.pot not sure how to fix

15348. By Jabiertxof

Update .pot file generated with cmake and the resulting po. add info to update_po_files.sh added by Mc

15349. By Yale Zhang

q

Revision history for this message
Jabiertxof (jabiertxof) wrote :

Hi Yale could you fix some coding style. I try to do a review but is a complex code and my skills are very low in this things. What about coexistant with current code and give a prefs check to enable?
Inline some diff comments

Revision history for this message
Jabiertxof (jabiertxof) wrote :

Apply the diff clean. Give compiliong errors: https://inkscape.org/en/paste/10710/

Revision history for this message
Yale Zhang (simdgenius) wrote :

Jabier, thanks for reviewing. The compile error is because
_mm_cvtsi64_m64() isn't supported for i686. I can fix it and replace
as much of the SSE intrinsic functions code with GCC vector extensions
over the next few days.

For now, can you test on x86-64?

-yale

On Tue, Feb 21, 2017 at 10:30 AM, Jabiertxof <email address hidden> wrote:
> Apply the diff clean. Give compiliong errors: https://inkscape.org/en/paste/10710/
> --
> https://code.launchpad.net/~simdgenius/inkscape/inkscape/+merge/307251
> You are the owner of lp:~simdgenius/inkscape/inkscape.

Revision history for this message
Jabiertxof (jabiertxof) wrote :

Hi Yale, sorry for the delay. I could compile this weekend at home in a 64Bit debian.
In terms of usage I do a fast try because need to revert the changes to fix a bug, In this little experience blurs go very fast, some portions not render sometimes but not your problem Is usual in trunk. I confirm that in hi zoom levels still very slow.

Thanks for your great work, ping me if want any help or specific test.

Revision history for this message
Yale Zhang (simdgenius) wrote :

Appreciate your testing. Anything for me to do before it can be accepted?

On my side, I still need to do 2 things:

1. rewrite as much of the SSE intrinsic functions with GCC vector
extensions as possible
2. change the way vector remainders are handled - currently it loads
past the end of arrays, but never writes past the end. But even this
can crash, which I've seen a few times.

"some portions not render sometimes but not your problem Is usual in trunk"
Yes, I've noticed it too. It happens when the filter quality isn't set
to maximum.

On Mon, Feb 27, 2017 at 1:47 AM, Jabiertxof <email address hidden> wrote:
> Hi Yale, sorry for the delay. I could compile this weekend at home in a 64Bit debian.
> In terms of usage I do a fast try because need to revert the changes to fix a bug, In this little experience blurs go very fast, some portions not render sometimes but not your problem Is usual in trunk. I confirm that in hi zoom levels still very slow.
>
> Thanks for your great work, ping me if want any help or specific test.
> --
> https://code.launchpad.net/~simdgenius/inkscape/inkscape/+merge/307251
> You are the owner of lp:~simdgenius/inkscape/inkscape.

Revision history for this message
Jabiertxof (jabiertxof) wrote :

Hi Yale.
Also you need to handle 32bit platforms and re-read the Tav message to the list about more comments in your code, he is a too much mode advanced dev than me.

Also "some portions not render sometimes but not your problem Is usual in trunk" is only my opinion maybe Im not right and could be part of your code, please be sure. My intention with the comment is reflect is not the only time withe spaces in render happends, but your comment at high level disapear give me to a recheck it well.

Certainy Im not the best man for review your code and if I approve i tell in the darkness, what is not good. In my opinion you need to reply Tav message and try approve with him.

Cheers and thanks for your hard work, Jabier.

Revision history for this message
Jabiertxof (jabiertxof) wrote :

Also you can drop a line to check if 0.93 has support for 32Bit OS. I think XP is droped but there is others

Revision history for this message
Mc (mc...) wrote :

I had a look at the code and asked a friend more familiar than me at those low-level calls, and it looks like it's doing sane things.

-> If it can compile on {linux,windows,mac}, {32bit,64bit} with {gcc,clang} (and maybe icc, visualstudio) and don't use features that will be removed in future c++ or compilers versions, I'd gladly take a crash course in whatever those arcane "_mm_packus_epi32" (I took one at random) and such actually do^^

Actually, my main fear about this merge is that if you ever leave and something ever goes wrong in that code parts, there may be no one knowing enough about how this works to actually fix stuff...

Marking as "Abstain", but consider it as an "Approve" iff the "If it can compile on all platforms/archs/compilers" above is ok

review: Abstain
Revision history for this message
Yale Zhang (simdgenius) wrote :

 If it can compile on {linux,windows,mac}, {32bit,64bit} with
{gcc,clang} (and maybe icc, visualstudio).
How serious are we about supporting Visual Studio? If it's to be
supported (possible now since we're using CMake), then I can't rewrite
the intrinsics code with GCC vector extensions since that's only
supported on GCC and clang. In theory you could make a C++ class to
abstract the vectors like the VC library that Krita uses, but I think
that's not worth the effort and you can't expect to abstract all
vector operations, especially the ones that operate horizontally
across vector lanes (e.g. shuffles).

"there may be no one knowing enough about how this works to actually
fix stuff..."
Exactly. That's why I want to rewrite as much of the _mm_ intrinsics
with GCC vector extensions.

On Mon, Feb 27, 2017 at 4:11 PM, Mc <email address hidden> wrote:
> Review: Abstain
>
> I had a look at the code and asked a friend more familiar than me at those low-level calls, and it looks like it's doing sane things.
>
> -> If it can compile on {linux,windows,mac}, {32bit,64bit} with {gcc,clang} (and maybe icc, visualstudio) and don't use features that will be removed in future c++ or compilers versions, I'd gladly take a crash course in whatever those arcane "_mm_packus_epi32" (I took one at random) and such actually do^^
>
> Actually, my main fear about this merge is that if you ever leave and something ever goes wrong in that code parts, there may be no one knowing enough about how this works to actually fix stuff...
>
> Marking as "Abstain", but consider it as an "Approve" iff the "If it can compile on all platforms/archs/compilers" above is ok
> --
> https://code.launchpad.net/~simdgenius/inkscape/inkscape/+merge/307251
> You are the owner of lp:~simdgenius/inkscape/inkscape.

Revision history for this message
Mc (mc...) wrote :

gcc and clang are sufficient, I think.

Unmerged revisions

15349. By Yale Zhang

q

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== added file 'src/display/SimpleImage.h'
--- src/display/SimpleImage.h 1970-01-01 00:00:00 +0000
+++ src/display/SimpleImage.h 2016-12-22 06:18:34 +0000
@@ -0,0 +1,80 @@
1#ifndef SIMPLE_IMAGE_H
2#define SIMPLE_IMAGE_H
3
4#if _MSC_VER
5 #ifdef _M_IX86
6 typedef int ssize_t;
7 #else
8 typedef __int64 ssize_t;
9 #endif
10#else
11 #include <stddef.h>
12#endif
13
14// a minimal image representation that allows 2D indexing with [][] and that's completely reusable
15template <typename AnyType>
16class SimpleImage
17{
18public:
19 SimpleImage()
20 {
21 }
22 SimpleImage(AnyType *b, ssize_t p)
23 {
24 buffer = b;
25 pitch = p;
26 }
27 AnyType *operator[](ssize_t y)
28 {
29 return (AnyType *)((uint8_t *)buffer + y * pitch);
30 }
31 SimpleImage<AnyType> SubImage(ssize_t x, ssize_t y)
32 {
33 return SimpleImage<AnyType>(&(*this)[y][x], pitch);
34 }
35 AnyType *buffer;
36 ssize_t pitch;
37};
38
39template <typename IntType>
40IntType RoundDown(IntType a, IntType b)
41{
42 return (a / b) * b;
43}
44
45template <typename IntType>
46IntType RoundUp(IntType a, IntType b)
47{
48 return RoundDown(a + b - 1, b);
49}
50
51#ifdef _WIN32
52 #define aligned_alloc(a, s) _aligned_malloc(s, a)
53 #define aligned_free(x) _aligned_free(x)
54#else
55 #define aligned_free(x) free(x)
56#endif
57
58template <typename AnyType, ssize_t alignment>
59class AlignedImage : public SimpleImage<AnyType>
60{
61public:
62 AlignedImage()
63 {
64 this->buffer = NULL;
65 }
66 void Resize(int width, int height)
67 {
68 if (this->buffer != NULL)
69 aligned_free(this->buffer);
70 this->pitch = RoundUp(ssize_t(width * sizeof(AnyType)), alignment);
71 this->buffer = (AnyType *)aligned_alloc(alignment, this->pitch * height);
72 }
73 ~AlignedImage()
74 {
75 if (this->buffer != NULL)
76 aligned_free(this->buffer);
77 }
78};
79
80#endif
081
=== added file 'src/display/gaussian_blur_templates.h'
--- src/display/gaussian_blur_templates.h 1970-01-01 00:00:00 +0000
+++ src/display/gaussian_blur_templates.h 2016-12-22 06:18:34 +0000
@@ -0,0 +1,4006 @@
1#ifdef __SSE3__
2 // not any faster, at least on Haswell?
3 #define _mm_loadu_pd(p) _mm_castsi128_pd(_mm_lddqu_si128((__m128i *)(p)))
4 #define _mm_loadu_ps(p) _mm_castsi128_ps(_mm_lddqu_si128((__m128i *)(p)))
5 #define _mm_loadu_si128(p) _mm_lddqu_si128(p)
6
7 #define _mm256_loadu_pd(p) _mm256_castsi256_pd(_mm256_lddqu_si256((__m256i *)(p)))
8 #define _mm256_loadu_ps(p) _mm256_castsi256_ps(_mm256_lddqu_si256((__m256i *)(p)))
9 #define _mm256_loadu_si128(p) _mm256_lddqu_si256(p)
10#else
11 #undef _mm_loadu_pd
12 #undef _mm_loadu_ps
13 #undef _mm_loadu_si128
14 #undef _mm256_loadu_pd
15 #undef _mm256_loadu_ps
16 #undef _mm256_loadu_si128
17#endif
18
19template <typename AnyType>
20struct MyTraits
21{
22};
23
24template <>
25struct MyTraits<float>
26{
27#ifdef __AVX__
28 typedef __m256 SIMDtype;
29#else
30 typedef __m128 SIMDtype;
31#endif
32};
33
34template <>
35struct MyTraits<int16_t>
36{
37#ifdef __AVX2__
38 typedef __m256i SIMDtype;
39#else
40 typedef __m128i SIMDtype;
41#endif
42};
43
44template <>
45struct MyTraits<double>
46{
47#ifdef __AVX__
48 typedef __m256d SIMDtype;
49#else
50 typedef __m128d SIMDtype;
51#endif
52};
53
54#if defined(__AVX__) && defined(__GNUC__)
55FORCE_INLINE __m256 _mm256_setr_m128(__m128 lo, __m128 hi)
56{
57 return _mm256_insertf128_ps(_mm256_castps128_ps256(lo), hi, 1);
58}
59
60FORCE_INLINE __m256i _mm256_setr_m128i(__m128i lo, __m128i hi)
61{
62 return _mm256_insertf128_si256(_mm256_castsi128_si256(lo), hi, 1);
63}
64
65FORCE_INLINE __m256d _mm256_setr_m128d(__m128d lo, __m128d hi)
66{
67 return _mm256_insertf128_pd(_mm256_castpd128_pd256(lo), hi, 1);
68}
69#endif
70
71#ifdef __FMA__
72//#pragma GCC push_options
73//#pragma GCC target("fma")
74FORCE_INLINE __m128 MultiplyAdd(__m128 a, __m128 b, __m128 c)
75{
76 return _mm_fmadd_ps(a, b, c);
77}
78
79FORCE_INLINE __m256 MultiplyAdd(__m256 a, __m256 b, __m256 c)
80{
81 return _mm256_fmadd_ps(a, b, c);
82}
83
84FORCE_INLINE __m256d MultiplyAdd(__m256d a, __m256d b, __m256d c)
85{
86 return _mm256_fmadd_pd(a, b, c);
87}
88//#pragma GCC pop_options
89#endif
90
91#ifndef __GNUC__
92FORCE_INLINE __m128d operator + (__m128d a, __m128d b)
93{
94 return _mm_add_pd(a, b);
95}
96
97FORCE_INLINE __m128d operator - (__m128d a, __m128d b)
98{
99 return _mm_sub_pd(a, b);
100}
101
102FORCE_INLINE __m128d operator * (__m128d a, __m128d b)
103{
104 return _mm_mul_pd(a, b);
105}
106
107FORCE_INLINE __m256d operator + (__m256d a, __m256d b)
108{
109 return _mm256_add_pd(a, b);
110}
111
112FORCE_INLINE __m256d operator - (__m256d a, __m256d b)
113{
114 return _mm256_sub_pd(a, b);
115}
116
117FORCE_INLINE __m256d operator * (__m256d a, __m256d b)
118{
119 return _mm256_mul_pd(a, b);
120}
121
122FORCE_INLINE __m128 operator + (__m128 a, __m128 b)
123{
124 return _mm_add_ps(a, b);
125}
126
127FORCE_INLINE __m128 operator - (__m128 a, __m128 b)
128{
129 return _mm_sub_ps(a, b);
130}
131
132FORCE_INLINE __m128 operator * (__m128 a, __m128 b)
133{
134 return _mm_mul_ps(a, b);
135}
136
137FORCE_INLINE __m256 operator + (__m256 a, __m256 b)
138{
139 return _mm256_add_ps(a, b);
140}
141
142FORCE_INLINE __m256 operator - (__m256 a, __m256 b)
143{
144 return _mm256_sub_ps(a, b);
145}
146
147FORCE_INLINE __m256 operator * (__m256 a, __m256 b)
148{
149 return _mm256_mul_ps(a, b);
150}
151#endif
152
153#ifdef __AVX__
154FORCE_INLINE float ExtractElement0(__m256 x)
155{
156 return _mm_cvtss_f32(_mm256_castps256_ps128(x));
157}
158
159FORCE_INLINE double ExtractElement0(__m256d x)
160{
161 return _mm_cvtsd_f64(_mm256_castpd256_pd128(x));
162}
163#endif
164
165FORCE_INLINE float ExtractElement0(__m128 x)
166{
167 return _mm_cvtss_f32(x);
168}
169
170FORCE_INLINE double ExtractElement0(__m128d x)
171{
172 return _mm_cvtsd_f64(x);
173}
174
175template<int SIZE>
176static void calcTriggsSdikaInitialization(double const M[N*N], float uold[N][SIZE], float const uplus[SIZE], float const vplus[SIZE], float const alpha, float vold[N][SIZE])
177{
178 __m128 v4f_alpha = _mm_set1_ps(alpha);
179 ssize_t c;
180 for (c = 0; c + 4 <= SIZE; c += 4)
181 {
182 __m128 uminp[N];
183 for(ssize_t i=0; i<N; i++)
184 uminp[i] = _mm_loadu_ps(&uold[i][c]) - _mm_loadu_ps(&uplus[c]);
185
186 __m128 v4f_vplus = _mm_loadu_ps(&vplus[c]);
187
188 for(ssize_t i=0; i<N; i++)
189 {
190 __m128 voldf = _mm_setzero_ps();
191 for(ssize_t j=0; j<N; j++)
192 {
193 voldf = voldf + uminp[j] * _mm_set1_ps(M[i*N+j]);
194 }
195 // Properly takes care of the scaling coefficient alpha and vplus (which is already appropriately scaled)
196 // This was arrived at by starting from a version of the blur filter that ignored the scaling coefficient
197 // (and scaled the final output by alpha^2) and then gradually reintroducing the scaling coefficient.
198 _mm_storeu_ps(&vold[i][c], voldf * v4f_alpha + v4f_vplus);
199 }
200 }
201 while (c < SIZE)
202 {
203 double uminp[N];
204 for(ssize_t i=0; i<N; i++) uminp[i] = uold[i][c] - uplus[c];
205 for(ssize_t i=0; i<N; i++) {
206 double voldf = 0;
207 for(ssize_t j=0; j<N; j++) {
208 voldf += uminp[j]*M[i*N+j];
209 }
210 // Properly takes care of the scaling coefficient alpha and vplus (which is already appropriately scaled)
211 // This was arrived at by starting from a version of the blur filter that ignored the scaling coefficient
212 // (and scaled the final output by alpha^2) and then gradually reintroducing the scaling coefficient.
213 vold[i][c] = voldf*alpha;
214 vold[i][c] += vplus[c];
215 }
216 ++c;
217 }
218}
219
220template<int SIZE>
221static void calcTriggsSdikaInitialization(double const M[N*N], double uold[N][SIZE], double const uplus[SIZE], double const vplus[SIZE], double const alpha, double vold[N][SIZE])
222{
223 __m128d v2f_alpha = _mm_set1_pd(alpha);
224 ssize_t c;
225 for (c = 0; c <= SIZE - 2; c += 2)
226 {
227 __m128d uminp[N];
228 for(ssize_t i=0; i<N; i++)
229 uminp[i] = _mm_loadu_pd(&uold[i][c]) - _mm_loadu_pd(&uplus[c]);
230
231 __m128d v2f_vplus = _mm_loadu_pd(&vplus[c]);
232
233 for(ssize_t i=0; i<N; i++)
234 {
235 __m128d voldf = _mm_setzero_pd();
236 for(ssize_t j=0; j<N; j++)
237 {
238 voldf = voldf + uminp[j] * _mm_load1_pd(&M[i*N+j]);
239 }
240 // Properly takes care of the scaling coefficient alpha and vplus (which is already appropriately scaled)
241 // This was arrived at by starting from a version of the blur filter that ignored the scaling coefficient
242 // (and scaled the final output by alpha^2) and then gradually reintroducing the scaling coefficient.
243 _mm_storeu_pd(&vold[i][c], voldf * v2f_alpha + v2f_vplus);
244 }
245 }
246 while (c < SIZE)
247 {
248 double uminp[N];
249 for(ssize_t i=0; i<N; i++) uminp[i] = uold[i][c] - uplus[c];
250 for(ssize_t i=0; i<N; i++) {
251 double voldf = 0;
252 for(ssize_t j=0; j<N; j++) {
253 voldf += uminp[j]*M[i*N+j];
254 }
255 // Properly takes care of the scaling coefficient alpha and vplus (which is already appropriately scaled)
256 // This was arrived at by starting from a version of the blur filter that ignored the scaling coefficient
257 // (and scaled the final output by alpha^2) and then gradually reintroducing the scaling coefficient.
258 vold[i][c] = voldf*alpha;
259 vold[i][c] += vplus[c];
260 }
261 ++c;
262 }
263}
264
265FORCE_INLINE __m128i PartialVectorMask(ssize_t n)
266{
267 return _mm_loadu_si128((__m128i *)&PARTIAL_VECTOR_MASK[sizeof(PARTIAL_VECTOR_MASK) / 2 - n]);
268}
269
270#ifdef __AVX__
271FORCE_INLINE __m256i PartialVectorMask32(ssize_t n)
272{
273 return _mm256_loadu_si256((__m256i *)&PARTIAL_VECTOR_MASK[sizeof(PARTIAL_VECTOR_MASK) / 2 - n]);
274}
275#endif
276
277#if !defined(_WIN32) && !defined(_MSC_VER)
278 // using _mm_maskmove_si64() is preferable to _mm_maskmoveu_si128(), but for some reason on Windows, it causes memory corruption
279 // could it be due to mixing x87 and MMX?
280 #define CAN_USE_MMX
281#endif
282
283#ifdef CAN_USE_MMX
284// return __m64 so that it can be used by _mm_movemask_si64()
285FORCE_INLINE __m64 PartialVectorMask8(ssize_t n)
286{
287 return _mm_cvtsi64_m64(*(int64_t *)&PARTIAL_VECTOR_MASK[sizeof(PARTIAL_VECTOR_MASK) / 2 - n]);
288}
289#else
290FORCE_INLINE __m128i PartialVectorMask8(ssize_t n)
291{
292 return _mm_loadl_epi64((__m128i *)&PARTIAL_VECTOR_MASK[sizeof(PARTIAL_VECTOR_MASK) / 2 - n]);
293}
294#endif
295
296#ifdef __AVX__
297FORCE_INLINE __m256d LoadDoubles(__m256d &out, double *x)
298{
299 return out = _mm256_loadu_pd(x);
300}
301
302FORCE_INLINE __m256d LoadDoubles(__m256d &out, float *x)
303{
304 return out = _mm256_cvtps_pd(_mm_loadu_ps(x));
305}
306
307FORCE_INLINE __m256d LoadDoubles(__m256d &out, uint8_t *x)
308{
309 return out = _mm256_cvtepi32_pd(_mm_cvtepu8_epi32(_mm_cvtsi32_si128(*(int32_t *)x)));
310}
311
312FORCE_INLINE __m256d LoadDoubles(__m256d &out, uint16_t *x)
313{
314 return out = _mm256_cvtepi32_pd(_mm_cvtepu16_epi32(_mm_loadl_epi64((__m128i *)x)));
315}
316
317FORCE_INLINE __m256 LoadFloats(__m256 &out, float *x) // seriously? compiler needs to be told to inline this when PIC on?
318{
319 return out = _mm256_loadu_ps(x);
320}
321
322FORCE_INLINE __m256 LoadFloats(__m256 &out, uint8_t *x)
323{
324 __m128i temp = _mm_loadl_epi64((__m128i *)x);
325#ifdef __AVX2__
326 out = _mm256_cvtepi32_ps(_mm256_cvtepu8_epi32(temp));
327#else
328 out = _mm256_cvtepi32_ps(_mm256_setr_m128i(_mm_cvtepu8_epi32(temp), _mm_cvtepu8_epi32(_mm_shuffle_epi32(temp, _MM_SHUFFLE(0, 0, 0, 1)))));
329#endif
330 return out;
331}
332
333FORCE_INLINE __m256 LoadFloats(__m256 &out, uint16_t *x)
334{
335 __m128i temp = _mm_loadu_si128((__m128i *)x);
336 __m256i i32;
337#ifdef __AVX2__
338 i32 = _mm256_cvtepu16_epi32(temp);
339#else
340 __m128i zero = _mm_setzero_si128();
341 i32 = _mm256_setr_m128i(_mm_unpacklo_epi16(temp, zero), _mm_unpackhi_epi16(temp, zero));
342#endif
343 return out = _mm256_cvtepi32_ps(i32);
344}
345
346template <bool partial = false> // no, this parameter isn't redundant - without it, there will be a redundant n == 4 check when partial = 0
347FORCE_INLINE void StoreDoubles(double *out, __m256d x, ssize_t n = 4)
348{
349 if (partial)
350 _mm256_maskstore_pd(out, PartialVectorMask32(n * sizeof(double)), x);
351 else
352 _mm256_storeu_pd(out, x);
353}
354
355template <bool partial = false>
356FORCE_INLINE void StoreDoubles(float *out, __m256d x, ssize_t n = 4)
357{
358 __m128 f32 = _mm256_cvtpd_ps(x);
359 if (partial)
360 _mm_maskstore_ps(out, PartialVectorMask(n * sizeof(float)), f32);
361 else
362 _mm_storeu_ps(out, f32);
363}
364
365template <bool partial = false>
366FORCE_INLINE void StoreDoubles(uint16_t *out, __m256d x, ssize_t n = 4)
367{
368 __m128i i32 = _mm256_cvtpd_epi32(x),
369 u16 = _mm_packus_epi32(i32, i32);
370 if (partial)
371 {
372#ifdef CAN_USE_MMX
373 _mm_maskmove_si64(_mm_movepi64_pi64(u16), PartialVectorMask8(n * sizeof(int16_t)), (char *)out);
374#else
375 _mm_maskmoveu_si128(u16, PartialVectorMask8(n * sizeof(int16_t)), (char *)out);
376#endif
377 }
378 else
379 _mm_storel_epi64((__m128i *)out, u16);
380}
381
382template <bool partial = false>
383FORCE_INLINE void StoreDoubles(uint8_t *out, __m256d x, ssize_t n = 4)
384{
385 __m128i i32 = _mm256_cvtpd_epi32(x),
386 u16 = _mm_packus_epi32(i32, i32),
387 u8 = _mm_packus_epi16(u16, u16);
388 if (partial)
389 {
390#ifdef CAN_USE_MMX
391 _mm_maskmove_si64(_mm_movepi64_pi64(u8), PartialVectorMask8(n), (char *)out);
392#else
393 _mm_maskmoveu_si128(u8, PartialVectorMask8(n), (char *)out);
394#endif
395 }
396 else
397 *(int32_t *)out = _mm_cvtsi128_si32(u8);
398}
399
400FORCE_INLINE void StoreDoubles(uint8_t *out, __m256d x)
401{
402 __m128i vInt = _mm_cvtps_epi32(_mm256_cvtpd_ps(x));
403 *(int32_t *)out = _mm_cvtsi128_si32(_mm_packus_epi16(_mm_packus_epi32(vInt, vInt), vInt));
404}
405
406template <bool partial = false> // no, this parameter isn't redundant - without it, there will be a redundant n == 8 check when partial = 0
407FORCE_INLINE void StoreFloats(float *out, __m256 x, ssize_t n = 8)
408{
409 if (partial)
410 _mm256_maskstore_ps(out, PartialVectorMask32(n * sizeof(float)), x);
411 else
412 _mm256_storeu_ps(out, x);
413}
414
415template <bool partial = false>
416FORCE_INLINE void StoreFloats(uint16_t *out, __m256 x, ssize_t n = 8)
417{
418 __m256i i32 = _mm256_cvtps_epi32(x);
419 __m128i u16 = _mm_packus_epi32(_mm256_castsi256_si128(i32), _mm256_extractf128_si256(i32, 1));
420 if (partial)
421 _mm_maskmoveu_si128(u16, PartialVectorMask(n * sizeof(int16_t)), (char *)out);
422 else
423 _mm_storeu_si128((__m128i *)out, u16);
424}
425
426template <bool partial = false>
427FORCE_INLINE void StoreFloats(uint8_t *out, __m256 x, ssize_t n = 8)
428{
429 __m256i i32 = _mm256_cvtps_epi32(x);
430 __m128i i32Hi = _mm256_extractf128_si256(i32, 1),
431 u16 = _mm_packus_epi32(_mm256_castsi256_si128(i32), i32Hi),
432 u8 = _mm_packus_epi16(u16, u16);
433 if (partial)
434 {
435#ifdef CAN_USE_MMX
436 _mm_maskmove_si64(_mm_movepi64_pi64(u8), PartialVectorMask8(n), (char *)out);
437#else
438 _mm_maskmoveu_si128(u8, PartialVectorMask8(n), (char *)out);
439#endif
440 }
441 else
442 _mm_storel_epi64((__m128i *)out, u8);
443}
444#endif
445
446#ifdef __AVX__
447FORCE_INLINE __m256 BroadcastSIMD(__m256 &out, float x)
448{
449 return out = _mm256_set1_ps(x);
450}
451
452FORCE_INLINE __m256d BroadcastSIMD(__m256d &out, double x)
453{
454 return out = _mm256_set1_pd(x);
455}
456
457FORCE_INLINE __m256i BroadcastSIMD(__m256i &out, int16_t x)
458{
459 return out = _mm256_set1_epi16(x);
460}
461#endif
462
463FORCE_INLINE __m128 BroadcastSIMD(__m128 &out, float x)
464{
465 return out = _mm_set1_ps(x);
466}
467
468FORCE_INLINE __m128d BroadcastSIMD(__m128d &out, double x)
469{
470 return out = _mm_set1_pd(x);
471}
472
473FORCE_INLINE __m128i BroadcastSIMD(__m128i &out, int16_t x)
474{
475 return out = _mm_set1_epi16(x);
476}
477
478
479FORCE_INLINE __m128 LoadFloats(__m128 &out, float *x)
480{
481 return out = _mm_loadu_ps(x);
482}
483
484FORCE_INLINE __m128 LoadFloats(__m128 &out, uint8_t *x)
485{
486 __m128i u8 = _mm_cvtsi32_si128(*(int32_t *)x),
487 i32;
488#ifdef __SSE4_1__
489 i32 = _mm_cvtepu8_epi32(u8);
490#else
491 __m128i zero = _mm_setzero_si128();
492 i32 = _mm_unpacklo_epi16(_mm_unpacklo_epi8(u8, zero), zero);
493#endif
494 return out = _mm_cvtepi32_ps(i32);
495}
496
497FORCE_INLINE __m128 LoadFloats(__m128 &out, uint16_t *x)
498{
499 __m128i u16 = _mm_loadl_epi64((__m128i *)x),
500 i32;
501#ifdef __SSE4_1__
502 i32 = _mm_cvtepu16_epi32(u16);
503#else
504 __m128i zero = _mm_setzero_si128();
505 i32 = _mm_unpacklo_epi16(u16, zero);
506#endif
507 return out = _mm_cvtepi32_ps(i32);
508}
509
510
511template <bool partial = false> // no, this parameter isn't redundant - without it, there will be a redundant n == 4 check when partial = 0
512FORCE_INLINE void StoreFloats(float *out, __m128 x, ssize_t n = 4)
513{
514 if (partial)
515 {
516#ifdef __AVX__
517 _mm_maskstore_ps(out, PartialVectorMask(n * sizeof(float)), x);
518#else
519 _mm_maskmoveu_si128(_mm_castps_si128(x), PartialVectorMask(n * sizeof(float)), (char *)out);
520#endif
521 }
522 else
523 {
524 _mm_storeu_ps(out, x);
525 }
526}
527
528template <bool partial = false>
529FORCE_INLINE void StoreFloats(uint16_t *out, __m128 x, ssize_t n = 4)
530{
531 __m128i i32 = _mm_cvtps_epi32(x),
532#ifdef __SSE4_1__
533 u16 = _mm_packus_epi32(i32, i32);
534#else
535 u16 = _mm_max_epi16(_mm_packs_epi32(i32, i32), _mm_setzero_si128()); // can get away with treating as int16 for now
536#endif
537 if (partial)
538 {
539#ifdef CAN_USE_MMX
540 _mm_maskmove_si64(_mm_movepi64_pi64(u16), PartialVectorMask8(n * sizeof(int16_t)), (char *)out);
541#else
542 _mm_maskmoveu_si128(u16, PartialVectorMask(n * sizeof(int16_t)), (char *)out);
543#endif
544 }
545 else
546 _mm_storel_epi64((__m128i *)out, u16);
547}
548
549template <bool partial = false>
550FORCE_INLINE void StoreFloats(uint8_t *out, __m128 x, ssize_t n = 4)
551{
552 __m128i i32 = _mm_cvtps_epi32(x),
553 u8 = _mm_packus_epi16(_mm_packs_epi32(i32, i32), i32); // should use packus_epi32, but that's only in SSE4
554 if (partial)
555 {
556#ifdef CAN_USE_MMX
557 _mm_maskmove_si64(_mm_movepi64_pi64(u8), PartialVectorMask8(n), (char *)out);
558#else
559 _mm_maskmoveu_si128(u8, PartialVectorMask(n), (char *)out);
560#endif
561 }
562 else
563 *(int32_t *)out = _mm_cvtsi128_si32(u8);
564}
565
566
567FORCE_INLINE __m128d LoadDoubles(__m128d &out, double *x)
568{
569 return out = _mm_loadu_pd(x);
570}
571
572FORCE_INLINE __m128d LoadDoubles(__m128d &out, uint8_t *x)
573{
574 __m128i u8 = _mm_cvtsi32_si128(*(uint16_t *)x),
575 i32;
576#ifdef __SSE4_1__
577 i32 = _mm_cvtepu8_epi32(u8);
578#else
579 __m128i zero = _mm_setzero_si128();
580 i32 = _mm_unpacklo_epi16(_mm_unpacklo_epi8(u8, zero), zero);
581#endif
582 return out = _mm_cvtepi32_pd(i32);
583}
584
585FORCE_INLINE __m128d LoadDoubles(__m128d &out, uint16_t *x)
586{
587 __m128i u16 = _mm_cvtsi32_si128(*(uint32_t *)x),
588 i32;
589#ifdef __SSE4_1__
590 i32 = _mm_cvtepu16_epi32(u16);
591#else
592 __m128i zero = _mm_setzero_si128();
593 i32 = _mm_unpacklo_epi16(u16, zero);
594#endif
595 return out = _mm_cvtepi32_pd(i32);
596}
597
598template <bool partial = false>
599FORCE_INLINE void StoreDoubles(double *out, __m128d x, ssize_t n = 2)
600{
601 if (partial)
602 {
603 #ifdef __AVX__
604 _mm_maskstore_pd(out, PartialVectorMask(n * sizeof(double)), x);
605 #else
606 _mm_maskmoveu_si128(_mm_castpd_si128(x), PartialVectorMask(n * sizeof(double)), (char *)out);
607 #endif
608 }
609 else
610 {
611 _mm_storeu_pd(out, x);
612 }
613}
614
615template <bool partial = false>
616FORCE_INLINE void StoreDoubles(float *out, __m128d x, ssize_t n = 2)
617{
618 __m128 f32 = _mm_cvtpd_ps(x);
619 if (partial)
620 {
621 #ifdef CAN_USE_MMX
622 _mm_maskmove_si64(_mm_movepi64_pi64(_mm_castps_si128(f32)), PartialVectorMask8(n * sizeof(float)), (char *)out);
623 #else
624 _mm_maskmoveu_si128(_mm_castps_si128(f32), PartialVectorMask8(n * sizeof(float)), (char *)out);
625 #endif
626 }
627 else
628 {
629 _mm_storel_pi((__m64 *)out, f32);
630 }
631}
632
633template <bool partial = false>
634FORCE_INLINE void StoreDoubles(uint16_t *out, __m128d x, ssize_t n = 2)
635{
636 __m128i i32 = _mm_cvtpd_epi32(x),
637#ifdef __SSE4_1__
638 u16 = _mm_packus_epi32(i32, i32);
639#else
640 u16 = _mm_max_epi16(_mm_packs_epi32(i32, i32), _mm_setzero_si128()); // can get away with using i16 for now
641#endif
642 if (partial)
643 {
644 #ifdef CAN_USE_MMX
645 _mm_maskmove_si64(_mm_movepi64_pi64(u16), PartialVectorMask8(n * sizeof(int16_t)), (char *)out);
646 #else
647 _mm_maskmoveu_si128(u16, PartialVectorMask8(n * sizeof(int16_t)), (char *)out);
648 #endif
649 }
650 else
651 {
652 *(uint32_t *)out = _mm_cvtsi128_si32(u16);
653 }
654}
655
656template <bool partial = false>
657FORCE_INLINE void StoreDoubles(uint8_t *out, __m128d x, ssize_t n = 2)
658{
659 __m128i i32 = _mm_cvtpd_epi32(x),
660#ifdef __SSE4_1__
661 u16 = _mm_packus_epi32(i32, i32),
662#else
663 u16 = _mm_max_epi16(_mm_packs_epi32(i32, i32), _mm_setzero_si128()), // can get away with using i16 for now
664#endif
665 u8 = _mm_packus_epi16(u16, u16);
666
667 if (partial)
668 {
669 #ifdef CAN_USE_MMX
670 _mm_maskmove_si64(_mm_movepi64_pi64(u8), PartialVectorMask8(n), (char *)out);
671 #else
672 _mm_maskmoveu_si128(u8, PartialVectorMask8(n), (char *)out);
673 #endif
674 }
675 else
676 {
677 *(uint16_t *)out = _mm_cvtsi128_si32(u8);
678 }
679}
680
681#ifdef __AVX__
682FORCE_INLINE __m256 Load4x2Floats(uint8_t *row0, uint8_t *row1)
683{
684 return _mm256_cvtepi32_ps(_mm256_setr_m128i(_mm_cvtepu8_epi32(_mm_cvtsi32_si128(*(int32_t *)row0)),
685 _mm_cvtepu8_epi32(_mm_cvtsi32_si128(*(int32_t *)row1))));
686}
687
688FORCE_INLINE __m256 Load4x2Floats(uint16_t *row0, uint16_t *row1)
689{
690 return _mm256_cvtepi32_ps(_mm256_setr_m128i(_mm_cvtepu16_epi32(_mm_loadl_epi64((__m128i *)row0)),
691 _mm_cvtepu16_epi32(_mm_loadl_epi64((__m128i *)row1))));
692}
693
694FORCE_INLINE __m256 Load4x2Floats(float *row0, float *row1)
695{
696 return _mm256_setr_m128(_mm_loadu_ps(row0), _mm_loadu_ps(row1));
697}
698#endif
699
700FORCE_INLINE __m128i LoadAndScaleToInt16(__m128i &out, uint8_t *x)
701{
702 // convert from [0-255] to [0-16383]
703 // leave 1 spare bit so that 2 values can be added without overflow for symmetric filters
704 __m128i u8 = _mm_loadl_epi64((__m128i *)x),
705 i16;
706#ifdef __SSE4_1__
707 i16 = _mm_cvtepu8_epi16(u8);
708#else
709 i16 = _mm_unpacklo_epi8(u8, _mm_setzero_si128());
710#endif
711 return out = _mm_slli_epi16(i16, 6);
712}
713
714__m128i LoadAndScaleToInt16(__m128i &out, int16_t *x)
715{
716 return out = _mm_loadu_si128((__m128i *)x);
717}
718
719#ifdef __AVX2__
720
721FORCE_INLINE __m256i LoadAndScaleToInt16(__m256i &out, uint8_t *x)
722{
723 // convert from [0-255] to [0-16383]
724 // leave 1 spare bit so that 2 values can be added without overflow for symmetric filters
725 return out = _mm256_slli_epi16(_mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i *)x)), 6);
726}
727
728FORCE_INLINE __m256i LoadAndScaleToInt16(__m256i &out, int16_t *x)
729{
730 return out = _mm256_loadu_si256((__m256i *)x);
731}
732
733#endif
734
735template <bool partial = false>
736FORCE_INLINE void ScaleAndStoreInt16(uint8_t *out, __m128i x, ssize_t n = 8)
737{
738 __m128i i16 = _mm_srai_epi16(_mm_adds_epi16(x, _mm_set1_epi16(32)), 6),
739 u8 = _mm_packus_epi16(i16, i16);
740 if (partial)
741 {
742#ifdef CAN_USE_MMX
743 _mm_maskmove_si64(_mm_movepi64_pi64(u8), PartialVectorMask8(n), (char *)out);
744#else
745 _mm_maskmoveu_si128(u8, PartialVectorMask8(n), (char *)out);
746#endif
747 }
748 else
749 _mm_storel_epi64((__m128i *)out, u8);
750}
751
752template <bool partial = false>
753FORCE_INLINE void ScaleAndStoreInt16(int16_t *out, __m128i i16, ssize_t n = 8)
754{
755 if (partial)
756 _mm_maskmoveu_si128(i16, PartialVectorMask(n * sizeof(int16_t)), (char *)out);
757 else
758 _mm_storeu_si128((__m128i *)out, i16);
759}
760
761#ifdef __AVX2__
762
763template <bool partial = false>
764FORCE_INLINE void ScaleAndStoreInt16(uint8_t *out, __m256i x, ssize_t n = 16)
765{
766 __m256i i16 = _mm256_srai_epi16(_mm256_adds_epi16(x, _mm256_set1_epi16(32)), 6);
767 __m128i u8 = _mm256_castsi256_si128(_mm256_packus_epi16(i16, _mm256_permute2f128_si256(i16, i16, 1)));
768 if (partial)
769 _mm_maskmoveu_si128(u8, PartialVectorMask(n), (char *)out);
770 else
771 _mm_storeu_si128((__m128i *)out, u8);
772}
773
774template <bool partial = false>
775FORCE_INLINE void ScaleAndStoreInt16(int16_t *out, __m256i i16, ssize_t n = 16)
776{
777 if (partial)
778 {
779 _mm_maskmoveu_si128(_mm256_castsi256_si128(i16), PartialVectorMask(min(ssize_t(8), n) * sizeof(int16_t)), (char *)out);
780 _mm_maskmoveu_si128(_mm256_extractf128_si256(i16, 1), PartialVectorMask(max(ssize_t(0), n - 8) * sizeof(int16_t)), (char *)&out[8]);
781 }
782 else
783 _mm256_storeu_si256((__m256i *)out, i16);
784}
785#endif
786
787// selectors are doubles to avoid int-float domain transition
788FORCE_INLINE __m128d Select(__m128d a, __m128d b, __m128d selectors)
789{
790#ifdef __SSE4_1__
791 return _mm_blendv_pd(a, b, selectors);
792#else
793 return _mm_or_pd(_mm_andnot_pd(selectors, a), _mm_and_pd(selectors, b));
794#endif
795}
796
797// selectors are floats to avoid int-float domain transition
798FORCE_INLINE __m128 Select(__m128 a, __m128 b, __m128 selectors)
799{
800#ifdef __SSE4_1__
801 return _mm_blendv_ps(a, b, selectors);
802#else
803 return _mm_or_ps(_mm_andnot_ps(selectors, a), _mm_and_ps(selectors, b));
804#endif
805}
806
807// even these simple ops need to be redeclared for each SIMD architecture due to VEX and non-VEX encodings of SSE instructions
808#ifdef __AVX__
809FORCE_INLINE __m128 Cast256To128(__m256 v)
810{
811 return _mm256_castps256_ps128(v);
812}
813
814FORCE_INLINE __m128d Cast256To128(__m256d v)
815{
816 return _mm256_castpd256_pd128(v);
817}
818FORCE_INLINE __m128i Cast256To128(__m256i v)
819{
820 return _mm256_castsi256_si128(v);
821}
822#endif
823
824FORCE_INLINE __m128 Cast256To128(__m128 v)
825{
826 return v;
827}
828
829FORCE_INLINE __m128d Cast256To128(__m128d v)
830{
831 return v;
832}
833
834FORCE_INLINE __m128i Cast256To128(__m128i v)
835{
836 return v;
837}
838
839
840// does 1D IIR convolution on multiple rows (height) of data
841// IntermediateType must be float or double
842template <bool transposeOut, bool isForwardPass, bool isBorder, int channels, typename OutType, typename InType, typename IntermediateType>
843FORCE_INLINE void Convolve1DHorizontalRef(SimpleImage <OutType> out,
844 SimpleImage <InType> in,
845 IntermediateType *borderValues, // [y][color]
846 ssize_t xStart, ssize_t xEnd, ssize_t width, ssize_t height,
847 typename MyTraits<IntermediateType>::SIMDtype *vCoefficients, double M[N * N])
848{
849 ssize_t xStep = isForwardPass ? 1 : -1;
850
851 ssize_t y = 0;
852 do
853 {
854 ssize_t c = 0;
855 do
856 {
857 IntermediateType prevOut[N];
858 ssize_t x = xStart;
859 if (isBorder && !isForwardPass)
860 {
861 // xStart must be width - 1
862 IntermediateType u[N + 1][1]; // u[0] = last forward filtered value, u[1] = 2nd last forward filtered value, ...
863 for (ssize_t i = 0; i < N + 1; ++i)
864 {
865 u[i][0] = in[y][(xStart + i * xStep) * channels + c];
866 }
867 IntermediateType backwardsInitialState[N][1];
868 calcTriggsSdikaInitialization<1>(M, u, &borderValues[y * channels + c], &borderValues[y * channels + c], ExtractElement0(vCoefficients[0]), backwardsInitialState);
869 for (ssize_t i = 0; i < N; ++i)
870 prevOut[i] = backwardsInitialState[i][0];
871
872 if (transposeOut)
873 out[x][y * channels + c] = clip_round_cast<OutType, IntermediateType>(prevOut[0]);
874 else
875 out[y][x * channels + c] = clip_round_cast<OutType, IntermediateType>(prevOut[0]);
876 x += xStep;
877 if (x == xEnd)
878 goto nextIteration; // do early check here so that we can still use do-while for forward pass
879 }
880 else if (isBorder && isForwardPass)
881 {
882 for (ssize_t i = 0; i < N; ++i)
883 prevOut[i] = in[y][0 * channels + c];
884 }
885 else
886 {
887 for (ssize_t i = 0; i < N; ++i)
888 {
889 prevOut[i] = transposeOut ? out[xStart - (i + 1) * xStep][y * channels + c]
890 : out[y][(xStart - (i + 1) * xStep) * channels + c];
891 }
892 }
893
894 do
895 {
896 IntermediateType sum = prevOut[0] * ExtractElement0(vCoefficients[1])
897 + prevOut[1] * ExtractElement0(vCoefficients[2])
898 + prevOut[2] * ExtractElement0(vCoefficients[3])
899 + in[y][x * channels + c] * ExtractElement0(vCoefficients[0]); // add last for best accuracy since this terms tends to be the smallest
900 if (transposeOut)
901 out[x][y * channels + c] = clip_round_cast<OutType, IntermediateType>(sum);
902 else
903 out[y][x * channels + c] = clip_round_cast<OutType, IntermediateType>(sum);
904 prevOut[2] = prevOut[1];
905 prevOut[1] = prevOut[0];
906 prevOut[0] = sum;
907 x += xStep;
908 } while (x != xEnd);
909 ++c;
910 } while (c < channels);
911 nextIteration:
912 ++y;
913 } while (y < height);
914}
915
916template <int channels, bool transposeOut, ssize_t xStep, int i0, int i1, int i2, typename OutType, typename InType>
917FORCE_INLINE void DoOneIIR(SimpleImage<OutType> out, SimpleImage<InType> in, __m256d &vSum, __m256d &vIn, ssize_t x, ssize_t y, __m256d vCoefficients[N + 1], __m256d prevOut[N])
918{
919 vSum = vIn * vCoefficients[0];
920 LoadDoubles(vIn, &in[y][(x + xStep) * channels]); // load data for next iteration early to hide latency (software pipelining)
921
922 // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3]
923#ifdef __FMA__
924 // this expression uses fewer MADs than the max. possible, but has a shorter critical path and is actually faster
925 vSum = MultiplyAdd(prevOut[i2], vCoefficients[3], vSum) + MultiplyAdd(prevOut[i1], vCoefficients[2], prevOut[i0] * vCoefficients[1]);
926#else
927 vSum = prevOut[i0] * vCoefficients[1]
928 + prevOut[i1] * vCoefficients[2]
929 + prevOut[i2] * vCoefficients[3]
930 + vIn * vCoefficients[0];
931#endif
932 if (transposeOut)
933 StoreDoubles(&out[x][y * channels], vSum);
934 else
935 StoreDoubles(&out[y][x * channels], vSum);
936}
937
938// input is always untransposed
939// for reverse pass, input is output from forward pass
940// for transposed output, in-place operation isn't possible
941// hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined
942template <bool transposeOut, bool isForwardPass, bool isBorder, int channels, typename OutType, typename InType, typename SIMD_Type>
943static /*FORCE_INLINE*/ void Convolve1DHorizontal(SimpleImage<OutType> out,
944 SimpleImage<InType> in,
945 double *borderValues,
946 ssize_t xStart, ssize_t xEnd, ssize_t width, ssize_t height,
947 SIMD_Type *vCoefficients, double M[N * N])
948{
949#if 0
950
951 Convolve1DHorizontalRef<transposeOut, isForwardPass, isBorder, channels>(out,
952 in,
953 borderValues,
954 xStart, xEnd, width, height,
955 vCoefficients, M);
956 return;
957
958#endif
959 const ssize_t xStep = isForwardPass ? 1 : -1;
960 if (channels == 4)
961 {
962#ifdef __AVX__
963 ssize_t y = 0;
964 do
965 {
966 __m256d prevOut[N];
967
968 ssize_t x = xStart;
969 if (isBorder && !isForwardPass)
970 {
971 // condition: xStart must be width - 1
972 double u[N + 1][channels]; //[x][channels]
973 for (ssize_t i = 0; i < N + 1; ++i)
974 {
975 __m256d temp;
976 _mm256_storeu_pd(u[i], LoadDoubles(temp, &in[y][(xStart + i * xStep) * channels]));
977 }
978 double backwardsInitialState[N][channels];
979 calcTriggsSdikaInitialization<channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState);
980 for (ssize_t i = 0; i < N; ++i)
981 LoadDoubles(prevOut[i], backwardsInitialState[i]);
982
983 if (transposeOut)
984 StoreDoubles(&out[x][y * channels], prevOut[0]);
985 else
986 StoreDoubles(&out[y][x * channels], prevOut[0]);
987
988 x += xStep;
989 if (x == xEnd)
990 goto nextIteration;
991 }
992 else if (isBorder && isForwardPass)
993 {
994 __m256d firstPixel;
995 LoadDoubles(firstPixel, &in[y][0 * channels]);
996 for (ssize_t i = 0; i < N; ++i)
997 prevOut[i] = firstPixel;
998 }
999 else
1000 {
1001 for (ssize_t i = 0; i < N; ++i)
1002 {
1003 if (transposeOut)
1004 LoadDoubles(prevOut[i], &out[xStart - (i + 1) * xStep][y * channels]);
1005 else
1006 LoadDoubles(prevOut[i], &out[y][(xStart - (i + 1) * xStep) * channels]);
1007 }
1008 }
1009
1010#if 0 // no measurable speedup
1011 // same as loop below, but unrolled 3 times to increase ||ism, hide latency, and reduce overhead of shifting the sliding window (prevOut)
1012 __m256d vIn;
1013 LoadDoubles(vIn, &in[y][xStart * channels]);
1014 for ( ; isForwardPass ? (x < xEnd - 3) : (x > xEnd + 3); )
1015 {
1016 __m256d vSum;
1017 DoOneIIR<channels, transposeOut, xStep, 0, 1, 2>(out, in, vSum, vIn, x, y, vCoefficients, prevOut);
1018 prevOut[2] = vSum;
1019 x += xStep;
1020
1021 DoOneIIR<channels, transposeOut, xStep, 2, 0, 1>(out, in, vSum, vIn, x, y, vCoefficients, prevOut);
1022 prevOut[1] = vSum;
1023 x += xStep;
1024
1025 DoOneIIR<channels, transposeOut, xStep, 1, 2, 0>(out, in, vSum, vIn, x, y, vCoefficients, prevOut);
1026 prevOut[0] = vSum;
1027 x += xStep;
1028 }
1029#endif
1030 while (isForwardPass ? (x < xEnd) : (x > xEnd))
1031 {
1032 __m256d vIn, vSum;
1033 LoadDoubles(vIn, &in[y][x * channels]),
1034
1035 // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3]
1036 #ifdef __FMA__
1037 // this expression uses fewer MADs than the max. possible, but has a shorter critical path and is actually faster
1038 vSum = MultiplyAdd(vIn, vCoefficients[0], prevOut[2] * vCoefficients[3]) + MultiplyAdd(prevOut[1], vCoefficients[2], prevOut[0] * vCoefficients[1]);
1039 #else
1040 vSum = prevOut[0] * vCoefficients[1]
1041 + prevOut[1] * vCoefficients[2]
1042 + prevOut[2] * vCoefficients[3]
1043 + vIn * vCoefficients[0];
1044 #endif
1045 if (transposeOut)
1046 StoreDoubles(&out[x][y * channels], vSum);
1047 else
1048 StoreDoubles(&out[y][x * channels], vSum);
1049
1050 prevOut[2] = prevOut[1];
1051 prevOut[1] = prevOut[0];
1052 prevOut[0] = vSum;
1053 x += xStep;
1054 }
1055 nextIteration:
1056 ++y;
1057 } while (y < height);
1058#else
1059 // todo: yuck, find some way to refactor (emulate m256d perhaps?)
1060 ssize_t y = 0;
1061 do
1062 {
1063 __m128d prevOut[N][2];
1064
1065 ssize_t x = xStart;
1066 if (isBorder && !isForwardPass)
1067 {
1068 // condition: xStart must be width - 1
1069 double u[N + 1][channels]; //[x][channels]
1070 for (ssize_t i = 0; i < N + 1; ++i)
1071 {
1072 __m128d temp;
1073 _mm_storeu_pd(u[i], LoadDoubles(temp, &in[y][(xStart + i * xStep) * channels]));
1074 _mm_storeu_pd(&u[i][2], LoadDoubles(temp, &in[y][(xStart + i * xStep) * channels + 2]));
1075 }
1076 double backwardsInitialState[N][channels];
1077 calcTriggsSdikaInitialization<channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState);
1078 for (ssize_t i = 0; i < N; ++i)
1079 {
1080 LoadDoubles(prevOut[i][0], backwardsInitialState[i]);
1081 LoadDoubles(prevOut[i][1], &backwardsInitialState[i][2]);
1082 }
1083
1084 if (transposeOut)
1085 {
1086 StoreDoubles(&out[x][y * channels], prevOut[0][0]);
1087 StoreDoubles(&out[x][y * channels + 2], prevOut[0][1]);
1088 }
1089 else
1090 {
1091 StoreDoubles(&out[y][x * channels], prevOut[0][0]);
1092 StoreDoubles(&out[y][x * channels + 2], prevOut[0][1]);
1093 }
1094
1095 x += xStep;
1096 if (x == xEnd)
1097 goto nextIteration;
1098 }
1099 else if (isBorder && isForwardPass)
1100 {
1101 __m128d firstPixel[2];
1102 LoadDoubles(firstPixel[0], &in[y][0 * channels]);
1103 LoadDoubles(firstPixel[1], &in[y][0 * channels + 2]);
1104 for (ssize_t i = 0; i < N; ++i)
1105 {
1106 prevOut[i][0] = firstPixel[0];
1107 prevOut[i][1] = firstPixel[1];
1108 }
1109 }
1110 else
1111 {
1112 for (ssize_t i = 0; i < N; ++i)
1113 {
1114 if (transposeOut)
1115 {
1116 LoadDoubles(prevOut[i][0], &out[xStart - (i + 1) * xStep][y * channels]);
1117 LoadDoubles(prevOut[i][1], &out[xStart - (i + 1) * xStep][y * channels + 2]);
1118 }
1119 else
1120 {
1121 LoadDoubles(prevOut[i][0], &out[y][(xStart - (i + 1) * xStep) * channels]);
1122 LoadDoubles(prevOut[i][1], &out[y][(xStart - (i + 1) * xStep) * channels + 2]);
1123 }
1124 }
1125 }
1126
1127 while (isForwardPass ? (x < xEnd) : (x > xEnd))
1128 {
1129 __m128d vIn[2], vSum[2];
1130 LoadDoubles(vIn[0], &in[y][x * channels]),
1131 LoadDoubles(vIn[1], &in[y][x * channels + 2]),
1132
1133 // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3]
1134 vSum[0] = prevOut[0][0] * vCoefficients[1][0]
1135 + prevOut[1][0] * vCoefficients[2][0]
1136 + prevOut[2][0] * vCoefficients[3][0]
1137 + vIn[0] * vCoefficients[0][0];
1138
1139 vSum[1] = prevOut[0][1] * vCoefficients[1][1]
1140 + prevOut[1][1] * vCoefficients[2][1]
1141 + prevOut[2][1] * vCoefficients[3][1]
1142 + vIn[1] * vCoefficients[0][1];
1143 if (transposeOut)
1144 {
1145 StoreDoubles(&out[x][y * channels], vSum[0]);
1146 StoreDoubles(&out[x][y * channels + 2], vSum[1]);
1147 }
1148 else
1149 {
1150 StoreDoubles(&out[y][x * channels], vSum[0]);
1151 StoreDoubles(&out[y][x * channels + 2], vSum[1]);
1152 }
1153 prevOut[2][0] = prevOut[1][0];
1154 prevOut[2][1] = prevOut[1][1];
1155 prevOut[1][0] = prevOut[0][0];
1156 prevOut[1][1] = prevOut[0][1];
1157 prevOut[0][0] = vSum[0];
1158 prevOut[0][1] = vSum[1];
1159 x += xStep;
1160 }
1161 nextIteration:
1162 ++y;
1163 } while (y < height);
1164#endif
1165 }
1166 else
1167 {
1168 ssize_t y = 0;
1169 do
1170 {
1171 if (isForwardPass)
1172 {
1173 ssize_t x = xStart;
1174 __m128d feedback[2],
1175 k0[2];
1176 #ifdef __AVX__
1177 k0[0] = Cast256To128(vCoefficients[0]);
1178 k0[1] = _mm256_extractf128_pd(vCoefficients[0], 1);
1179 #else
1180 k0[0] = vCoefficients[0];
1181 k0[1] = vCoefficients[1];
1182 #endif
1183
1184 if (isBorder && isForwardPass)
1185 {
1186 // xStart must be 0
1187 feedback[0] = feedback[1] = _mm_set1_pd(in[y][0]);
1188 }
1189 else
1190 {
1191 LoadDoubles(feedback[0], &out[y][xStart - 3 * xStep]);
1192 LoadDoubles(feedback[1], &out[y][xStart - 1 * xStep]);
1193
1194 feedback[1] = _mm_shuffle_pd(feedback[0], feedback[1], _MM_SHUFFLE2(0, 1));
1195 feedback[0] = _mm_shuffle_pd(feedback[0], feedback[0], _MM_SHUFFLE2(0, 0));
1196 }
1197 // feedback = [garbage y-3 y-2 y-1]
1198 for (; x != xEnd; x += xStep)
1199 {
1200 __m128d _in = _mm_set1_pd(in[y][x]),
1201 newOutput;
1202 #ifdef __SSE4_1__
1203 feedback[0] = _mm_blend_pd(feedback[0], _in, 0x1);
1204 newOutput = _mm_add_pd(_mm_dp_pd(feedback[0], k0[0], 0x31),
1205 _mm_dp_pd(feedback[1], k0[1], 0x31));
1206 feedback[0] = _mm_blend_pd(feedback[0], newOutput, 0x1); // insert back input
1207 #else
1208 __m128d FIRST_ELEMENT_MASK = _mm_castsi128_pd(_mm_set_epi64x(0, ~uint64_t(0)));
1209 feedback[0] = Select(feedback[0], _in, FIRST_ELEMENT_MASK);
1210
1211 __m128d partialDP = _mm_add_pd
1212 (
1213 _mm_mul_pd(feedback[0], k0[0]),
1214 _mm_mul_pd(feedback[1], k0[1])
1215 );
1216 newOutput = _mm_add_pd
1217 (
1218 partialDP,
1219 _mm_shuffle_pd(partialDP, partialDP, _MM_SHUFFLE2(0, 1))
1220 );
1221 feedback[0] = Select(feedback[0], newOutput, FIRST_ELEMENT_MASK); // insert back input
1222 #endif
1223 out[y][x] = _mm_cvtsd_f64(newOutput);
1224 feedback[0] = _mm_shuffle_pd(feedback[0], feedback[1], _MM_SHUFFLE2(0, 0));
1225 feedback[1] = _mm_shuffle_pd(feedback[1], feedback[0], _MM_SHUFFLE2(0, 1));
1226 }
1227 }
1228 else
1229 {
1230 __m128d feedback[2], k4[2];
1231 #ifdef __AVX__
1232 k4[0] = Cast256To128(vCoefficients[4]);
1233 k4[1] = _mm256_extractf128_pd(vCoefficients[4], 1);
1234 #else
1235 k4[0] = vCoefficients[8];
1236 k4[1] = vCoefficients[9];
1237 #endif
1238 ssize_t x = xStart;
1239 if (isBorder && !isForwardPass)
1240 {
1241 // xstart must be width - 1
1242 double u[N + 1][1]; //[x][y][channels]
1243 for (ssize_t i = 0; i < N + 1; ++i)
1244 {
1245 u[i][0] = in[y][xStart + i * xStep];
1246 }
1247 #define ROUND_UP(a, b) ((a + b - 1) / b * b)
1248 double backwardsInitialState[ROUND_UP(N, 2)][1]; // pad so vector loads don't go past end
1249 calcTriggsSdikaInitialization<1>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState);
1250
1251 feedback[0] = _mm_load_pd(&backwardsInitialState[0][0]);
1252 feedback[1] = _mm_load_pd(&backwardsInitialState[2][0]);
1253
1254 out[y][x] = backwardsInitialState[0][0];
1255 x += xStep;
1256 if (x == xEnd)
1257 continue;
1258 }
1259 else
1260 {
1261 LoadDoubles(feedback[0], &out[y][xStart - xStep]);
1262 LoadDoubles(feedback[1], &out[y][xStart - 3 * xStep]);
1263 }
1264
1265 for (; x != xEnd; x += xStep)
1266 {
1267 __m128d _in = _mm_set1_pd(in[y][x]),
1268 newOutput;
1269 #ifdef __SSE4_1__
1270 feedback[1] = _mm_blend_pd(feedback[1], _in, 0x2);
1271 newOutput = _mm_add_pd(_mm_dp_pd(feedback[0], k4[0], 0x32),
1272 _mm_dp_pd(feedback[1], k4[1], 0x32));
1273 feedback[1] = _mm_blend_pd(feedback[1], newOutput, 0x2); // insert back input
1274 #else
1275 __m128d LAST_ELEMENT_MASK = _mm_castsi128_pd(_mm_set_epi64x(~uint64_t(0), 0));
1276 feedback[1] = Select(feedback[1], _in, LAST_ELEMENT_MASK);
1277
1278 __m128d partialDP = _mm_add_pd
1279 (
1280 _mm_mul_pd(feedback[0], k4[0]),
1281 _mm_mul_pd(feedback[1], k4[1])
1282 );
1283 newOutput = _mm_add_pd
1284 (
1285 partialDP,
1286 _mm_shuffle_pd(partialDP, partialDP, _MM_SHUFFLE2(0, 0))
1287 );
1288 feedback[1] = Select(feedback[1], newOutput, LAST_ELEMENT_MASK);
1289 #endif
1290
1291 __m128d temp = _mm_shuffle_pd(feedback[1], feedback[1], _MM_SHUFFLE2(0, 1));
1292 out[y][x] = _mm_cvtsd_f64(temp);
1293
1294 feedback[1] = _mm_shuffle_pd(feedback[0], feedback[1], _MM_SHUFFLE2(1, 1));
1295 feedback[0] = _mm_shuffle_pd(feedback[1], feedback[0], _MM_SHUFFLE2(0, 1));
1296 }
1297 }
1298 ++y;
1299 } while (y < height);
1300 }
1301}
1302
1303
1304template <int channels, bool transposeOut, ssize_t xStep, int i0, int i1, int i2, typename OutType, typename InType>
1305FORCE_INLINE void DoOneIIR(SimpleImage<OutType> out, SimpleImage<InType> in, __m256 &vSum, __m256 &vIn, ssize_t x, ssize_t y, __m256 vCoefficients[N + 1], __m256 prevOut[N])
1306{
1307 vSum = vIn * vCoefficients[0];
1308
1309 // load data for next iteration early to hide latency (software pipelining)
1310 vIn = Load4x2Floats(&in[y][(x + xStep) * channels],
1311 &in[y + 1][(x + xStep) * channels]);
1312
1313 // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3]
1314#ifdef __FMA__
1315 // this expression uses fewer MADs than the max. possible, but has a shorter critical path and is actually faster
1316 vSum = MultiplyAdd(prevOut[i2], vCoefficients[3], vSum) + MultiplyAdd(prevOut[i1], vCoefficients[2], prevOut[i0] * vCoefficients[1]);
1317#else
1318 vSum = prevOut[i0] * vCoefficients[1]
1319 + prevOut[i1] * vCoefficients[2]
1320 + prevOut[i2] * vCoefficients[3]
1321 + vIn * vCoefficients[0];
1322#endif
1323 if (transposeOut)
1324 {
1325 StoreFloats(&out[x][y * channels], _mm256_castps256_ps128(vSum));
1326 StoreFloats(&out[x][(y + 1) * channels], _mm256_extractf128_ps(vSum, 1));
1327 }
1328 else
1329 {
1330 StoreFloats(&out[y][x * channels], _mm256_castps256_ps128(vSum));
1331 StoreFloats(&out[y + 1][x * channels], _mm256_extractf128_ps(vSum, 1));
1332 }
1333}
1334
1335// hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined
1336template <bool transposeOut, bool isForwardPass, bool isBorder, int channels, typename OutType, typename InType, typename SIMD_Type>
1337static /*FORCE_INLINE*/void Convolve1DHorizontal(SimpleImage<OutType> out,
1338 SimpleImage<InType> in,
1339 float *borderValues,
1340 ssize_t xStart, ssize_t xEnd, ssize_t width, ssize_t height,
1341 SIMD_Type *vCoefficients, double M[N * N])
1342{
1343#if 0
1344 MyTraits<float>::SIMDtype coefficients2[4];
1345
1346 if (channels == 1)
1347 {
1348 coefficients2[0] = _mm256_set1_ps(((float *)vCoefficients)[0]);
1349 coefficients2[1] = _mm256_set1_ps(((float *)vCoefficients)[3]);
1350 coefficients2[2] = _mm256_set1_ps(((float *)vCoefficients)[2]);
1351 coefficients2[3] = _mm256_set1_ps(((float *)vCoefficients)[1]);
1352 vCoefficients = coefficients2;
1353 }
1354 Convolve1DHorizontalRef<transposeOut, isForwardPass, isBorder, channels>(out,
1355 in,
1356 borderValues,
1357 xStart, xEnd, width, height,
1358 vCoefficients, M);
1359 return;
1360#endif
1361 const ssize_t xStep = isForwardPass ? 1 : -1;
1362
1363 if (channels == 4)
1364 {
1365 ssize_t y = 0;
1366#ifdef __AVX__
1367 for (; y <= height - 2; y += 2) // AVX code process 2 rows at a time
1368 {
1369 __m256 prevOut[N];
1370
1371 ssize_t x = xStart;
1372
1373 if (isBorder && !isForwardPass)
1374 {
1375 float u[N + 1][2 * channels]; //[x][y][channels]
1376 for (ssize_t i = 0; i < N + 1; ++i)
1377 {
1378 __m128 temp;
1379 _mm_storeu_ps(&u[i][0], LoadFloats(temp, &in[y][(xStart + i * xStep) * channels]));
1380 _mm_storeu_ps(&u[i][channels], LoadFloats(temp, &in[y + 1][(xStart + i * xStep) * channels]));
1381 }
1382 float backwardsInitialState[N][2 * channels];
1383 calcTriggsSdikaInitialization<2 * channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState);
1384 for (ssize_t i = 0; i < N; ++i)
1385 LoadFloats(prevOut[i], backwardsInitialState[i]);
1386
1387 if (transposeOut)
1388 {
1389 StoreFloats(&out[x][y * channels], _mm256_castps256_ps128(prevOut[0]));
1390 StoreFloats(&out[x][(y + 1) * channels], _mm256_extractf128_ps(prevOut[0], 1));
1391 }
1392 else
1393 {
1394 StoreFloats(&out[y][x * channels], _mm256_castps256_ps128(prevOut[0]));
1395 StoreFloats(&out[y + 1][x * channels], _mm256_extractf128_ps(prevOut[0], 1));
1396 }
1397 x += xStep;
1398 if (x == xEnd)
1399 continue;
1400 }
1401 else if (isBorder && isForwardPass)
1402 {
1403 // xStart must be 0
1404 __m256 firstPixel = Load4x2Floats(&in[y][0 * channels], &in[y + 1][0 * channels]);
1405 for (ssize_t i = 0; i < N; ++i)
1406 prevOut[i] = firstPixel;
1407 }
1408 else
1409 {
1410 for (ssize_t i = 0; i < N; ++i)
1411 {
1412 prevOut[i] = transposeOut ? Load4x2Floats(&out[xStart - (i + 1) * xStep][y * channels],
1413 &out[xStart - (i + 1) * xStep][(y + 1) * channels])
1414 : Load4x2Floats(&out[y][(xStart - (i + 1) * xStep) * channels],
1415 &out[y + 1][(xStart - (i + 1) * xStep) * channels]);
1416 }
1417 }
1418
1419#if 0 // 2x slower than no unrolling - too many register spills?
1420 // same as loop below, but unrolled 3 times to increase ||ism, hide latency, and reduce overhead of shifting the sliding window (prevOut)
1421 __m256 vIn = Load4x2Floats(&in[y][xStart * channels],
1422 &in[y + 1][xStart * channels]);
1423
1424 for (; isForwardPass ? (x < xEnd - 3) : (x > xEnd + 3); )
1425 {
1426 __m256 vSum;
1427 DoOneIIR<channels, transposeOut, xStep, 0, 1, 2>(out, in, vSum, vIn, x, y, vCoefficients, prevOut);
1428 prevOut[2] = vSum;
1429 x += xStep;
1430
1431 DoOneIIR<channels, transposeOut, xStep, 2, 0, 1>(out, in, vSum, vIn, x, y, vCoefficients, prevOut);
1432 prevOut[1] = vSum;
1433 x += xStep;
1434
1435 DoOneIIR<channels, transposeOut, xStep, 1, 2, 0>(out, in, vSum, vIn, x, y, vCoefficients, prevOut);
1436 prevOut[0] = vSum;
1437 x += xStep;
1438 }
1439#endif
1440 for (; x != xEnd; x += xStep)
1441 {
1442 __m256 vIn = Load4x2Floats(&in[y][x * channels],
1443 &in[y + 1][x * channels]),
1444 vSum;
1445
1446 // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3]
1447#ifdef __FMA__
1448 // this expression uses fewer MADs than the max. possible, but has a shorter critical path and is actually faster
1449 vSum = MultiplyAdd(vIn, vCoefficients[0], prevOut[2] * vCoefficients[3]) + MultiplyAdd(prevOut[1], vCoefficients[2], prevOut[0] * vCoefficients[1]);
1450#else
1451 vSum = prevOut[0] * vCoefficients[1]
1452 + prevOut[1] * vCoefficients[2]
1453 + prevOut[2] * vCoefficients[3]
1454 + vIn * vCoefficients[0];
1455#endif
1456
1457 if (transposeOut)
1458 {
1459 StoreFloats(&out[x][y * channels], _mm256_castps256_ps128(vSum));
1460 StoreFloats(&out[x][(y + 1) * channels], _mm256_extractf128_ps(vSum, 1));
1461 }
1462 else
1463 {
1464 StoreFloats(&out[y][x * channels], _mm256_castps256_ps128(vSum));
1465 StoreFloats(&out[y + 1][x * channels], _mm256_extractf128_ps(vSum, 1));
1466 }
1467 prevOut[2] = prevOut[1];
1468 prevOut[1] = prevOut[0];
1469 prevOut[0] = vSum;
1470 }
1471 }
1472#endif
1473 for (; y < height; ++y)
1474 {
1475 __m128 prevOut[N];
1476 ssize_t x = xStart;
1477
1478 if (isBorder && !isForwardPass)
1479 {
1480 float u[N + 1][channels]; //[x][channels]
1481 for (ssize_t i = 0; i < N + 1; ++i)
1482 {
1483 __m128 temp;
1484 _mm_storeu_ps(u[i], LoadFloats(temp, &in[y][(xStart + i * xStep) * channels]));
1485 }
1486 float backwardsInitialState[N][channels];
1487 calcTriggsSdikaInitialization<channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState);
1488 for (ssize_t i = 0; i < N; ++i)
1489 LoadFloats(prevOut[i], backwardsInitialState[i]);
1490
1491 if (transposeOut)
1492 StoreFloats(&out[x][y * channels], prevOut[0]);
1493 else
1494 StoreFloats(&out[y][x * channels], prevOut[0]);
1495 x += xStep;
1496 if (x == xEnd)
1497 continue;
1498 }
1499 else if (isBorder && isForwardPass)
1500 {
1501 // xStart must be 0
1502 __m128 firstPixel;
1503 LoadFloats(firstPixel, &in[y][0 * channels]);
1504 for (ssize_t i = 0; i < N; ++i)
1505 prevOut[i] = firstPixel;
1506 }
1507 else
1508 {
1509 for (ssize_t i = 0; i < N; ++i)
1510 {
1511 if (transposeOut)
1512 LoadFloats(prevOut[i], &out[xStart - (i + 1) * xStep][y * channels]);
1513 else
1514 LoadFloats(prevOut[i], &out[y][(xStart - (i + 1) * xStep) * channels]);
1515 }
1516 }
1517
1518 do
1519 {
1520 __m128 vIn, vSum;
1521 LoadFloats(vIn, &in[y][x * channels]);
1522 // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3]
1523#ifdef __FMA__
1524 // this expression uses fewer MADs than the max. possible, but has a shorter critical path and is actually faster
1525 vSum = MultiplyAdd(vIn, Cast256To128(vCoefficients[0]), prevOut[2] * Cast256To128(vCoefficients[3])) + MultiplyAdd(prevOut[1], Cast256To128(vCoefficients[2]), prevOut[0] * Cast256To128(vCoefficients[1]));
1526#else
1527 vSum = prevOut[0] * Cast256To128(vCoefficients[1])
1528 + prevOut[1] * Cast256To128(vCoefficients[2])
1529 + prevOut[2] * Cast256To128(vCoefficients[3])
1530 + vIn * Cast256To128(vCoefficients[0]);
1531#endif
1532 if (transposeOut)
1533 {
1534 StoreFloats(&out[x][y * channels], vSum);
1535 }
1536 else
1537 {
1538 StoreFloats(&out[y][x * channels], vSum);
1539 }
1540 prevOut[2] = prevOut[1];
1541 prevOut[1] = prevOut[0];
1542 prevOut[0] = vSum;
1543 x += xStep;
1544 } while (x != xEnd);
1545 }
1546 }
1547 else
1548 {
1549 //static_assert(!transposeOut, "transpose not supported");
1550 ssize_t y = 0;
1551
1552 const ssize_t Y_BLOCK_SIZE = 8;
1553#ifdef __AVX__
1554 for (; y <= height - Y_BLOCK_SIZE; y += Y_BLOCK_SIZE)
1555 {
1556 if (isForwardPass)
1557 {
1558 ssize_t x = xStart;
1559 __m256 feedback[4],
1560 k0 = vCoefficients[0],
1561 k1 = vCoefficients[1],
1562 k2 = vCoefficients[2],
1563 k3 = vCoefficients[3];
1564
1565 if (isBorder && isForwardPass)
1566 {
1567 // xStart must be 0
1568 for (ssize_t i = 0; i < 4; ++i)
1569 {
1570 feedback[i] = _mm256_setr_m128(_mm_set1_ps(in[y + i * 2][0]),
1571 _mm_set1_ps(in[y + i * 2 + 1][0]));
1572 }
1573 }
1574 else
1575 {
1576 for (ssize_t i = 0; i < 4; ++i)
1577 {
1578 feedback[i] = Load4x2Floats(&out[y + i * 2][xStart - 3 * xStep],
1579 &out[y + i * 2 + 1][xStart - 3 * xStep]);
1580 feedback[i] = _mm256_shuffle_ps(feedback[i], feedback[i], _MM_SHUFFLE(2, 1, 0, 0));
1581 }
1582 }
1583 // feedback0 = [garbage y-3 y-2 y-1]
1584 for (; x <= xEnd - 4; x += 4)
1585 {
1586 __m256 _in[4];
1587 for (ssize_t i = 0; i < 4; ++i)
1588 _in[i] = Load4x2Floats(&in[y + i * 2][x], &in[y + i * 2 + 1][x]);
1589
1590 for (int i = 0; i < 4; ++i)
1591 {
1592 feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x11);
1593 feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k0, 0xf1), 0x11); // insert back input
1594 }
1595
1596 for (int i = 0; i < 4; ++i)
1597 {
1598 feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x22);
1599 feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k1, 0xf2), 0x22); // insert back input
1600 }
1601
1602 for (int i = 0; i < 4; ++i)
1603 {
1604 feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x44);
1605 feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k2, 0xf4), 0x44); // insert back input
1606 }
1607
1608 for (ssize_t i = 0; i < 4; ++i)
1609 {
1610 feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x88);
1611 feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k3, 0xf8), 0x88); // insert back input
1612
1613 _mm_storeu_ps((float *)&out[y + i * 2][x], _mm256_castps256_ps128(feedback[i]));
1614 _mm_storeu_ps((float *)&out[y + i * 2 + 1][x], _mm256_extractf128_ps(feedback[i], 1));
1615 }
1616 }
1617 for (; x != xEnd; x += xStep)
1618 {
1619 // todo: make these scalar loads to avoid buffer overflow
1620 __m256 _in[4];
1621 for (ssize_t i = 0; i < 4; ++i)
1622 {
1623 _in[i] = Load4x2Floats(&in[y + i * 2][x],
1624 &in[y + i * 2 + 1][x]);
1625 }
1626
1627 for (int i = 0; i < 4; ++i)
1628 {
1629 feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x11);
1630 feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k0, 0xf1), 0x11); // insert back input
1631 }
1632
1633 for (ssize_t i = 0; i < 4; ++i)
1634 {
1635 out[y + i * 2][x] = _mm_cvtss_f32(_mm256_castps256_ps128(feedback[i]));
1636 out[y + i * 2 + 1][x] = _mm_cvtss_f32(_mm256_extractf128_ps(feedback[i], 1));
1637 }
1638
1639 for (int i = 0; i < 4; ++i)
1640 feedback[i] = _mm256_shuffle_ps(feedback[i], feedback[i], _MM_SHUFFLE(0, 3, 2, 0));
1641 }
1642 }
1643 else
1644 {
1645 __m256 feedback[4],
1646 k4 = vCoefficients[4],
1647 k5 = vCoefficients[5],
1648 k6 = vCoefficients[6],
1649 k7 = vCoefficients[7];
1650 ssize_t x = xStart;
1651 if (isBorder && !isForwardPass)
1652 {
1653 // xstart must be width - 1
1654 float u[N + 1][8 * channels]; //[x][y][channels]
1655 for (ssize_t i = 0; i < N + 1; ++i)
1656 {
1657 for (ssize_t _y = 0; _y < 8; ++_y)
1658 u[i][_y] = in[y + _y][xStart + i * xStep];
1659 }
1660 float backwardsInitialState[N][8 * channels];
1661 calcTriggsSdikaInitialization<8 * channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState);
1662
1663 for (ssize_t i = 0; i < 4; ++i)
1664 {
1665 float temp[2][N + 1];
1666 for (ssize_t j = 0; j < N; ++j)
1667 {
1668 temp[0][j] = backwardsInitialState[j][i * 2];
1669 temp[1][j] = backwardsInitialState[j][i * 2 + 1];
1670 }
1671 feedback[i] = Load4x2Floats(temp[0], temp[1]);
1672 }
1673
1674 for (ssize_t _y = 0; _y < Y_BLOCK_SIZE; ++_y)
1675 out[y + _y][x] = backwardsInitialState[0][_y];
1676
1677 x += xStep;
1678 if (x == xEnd)
1679 continue;
1680 }
1681 else
1682 {
1683 for (ssize_t i = 0; i < 4; ++i)
1684 {
1685 feedback[i] = Load4x2Floats(&out[y + i * 2][xStart - xStep],
1686 &out[y + i * 2 + 1][xStart - xStep]);
1687 }
1688 }
1689 for (; x - 4 >= xEnd; x -= 4)
1690 {
1691 __m256 _in[4];
1692 for (ssize_t i = 0; i < 4; ++i)
1693 {
1694 _in[i] = Load4x2Floats(&in[y + i * 2][x - 3],
1695 &in[y + i * 2 + 1][x - 3]);
1696 }
1697
1698 for (int i = 0; i < 4; ++i)
1699 {
1700 feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x88);
1701 feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k4, 0xf8), 0x88); // insert back input
1702 }
1703
1704 for (int i = 0; i < 4; ++i)
1705 {
1706 feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x44);
1707 feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k5, 0xf4), 0x44); // insert back input
1708 }
1709
1710 for (int i = 0; i < 4; ++i)
1711 {
1712 feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x22);
1713 feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k6, 0xf2), 0x22); // insert back input
1714 }
1715
1716 for (ssize_t i = 0; i < 4; ++i)
1717 {
1718 feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x11);
1719 feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k7, 0xf1), 0x11); // insert back input
1720
1721 StoreFloats(&out[y + i * 2][x - 3], _mm256_castps256_ps128(feedback[i]));
1722 StoreFloats(&out[y + i * 2 + 1][x - 3], _mm256_extractf128_ps(feedback[i], 1));
1723 }
1724 }
1725
1726 for ( ; x != xEnd; x += xStep)
1727 {
1728 // todo: make these scalar loads to avoid buffer overflow
1729 __m256 _in[4];
1730 for (ssize_t i = 0; i < 4; ++i)
1731 {
1732 _in[i] = Load4x2Floats(&in[y + i * 2][x - 3],
1733 &in[y + i * 2 + 1][x - 3]);
1734 }
1735
1736 for (int i = 0; i < 4; ++i)
1737 {
1738 feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x88);
1739 feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k4, 0xf8), 0x88); // insert back input
1740 }
1741
1742 for (ssize_t i = 0; i < 4; ++i)
1743 {
1744 __m256 temp = _mm256_shuffle_ps(feedback[i], feedback[i], _MM_SHUFFLE(0, 0, 0, 3));
1745 out[y + i * 2][x] = _mm_cvtss_f32(_mm256_castps256_ps128(temp));
1746 out[y + i * 2 + 1][x] = _mm_cvtss_f32(_mm256_extractf128_ps(temp, 1));
1747 }
1748
1749
1750 for (int i = 0; i < 4; ++i)
1751 feedback[i] = _mm256_shuffle_ps(feedback[i], feedback[i], _MM_SHUFFLE(2, 1, 0, 3));
1752 }
1753 }
1754 }
1755#endif
1756 for (; y < height; ++y)
1757 {
1758 if (isForwardPass)
1759 {
1760 ssize_t x = xStart;
1761 __m128 feedback0,
1762 k0 = Cast256To128(vCoefficients[0]);
1763
1764 if (isBorder && isForwardPass)
1765 {
1766 // xStart must be 0
1767 feedback0 = _mm_set1_ps(in[y][0]);
1768 }
1769 else
1770 {
1771 LoadFloats(feedback0, &out[y][xStart - 3 * xStep]);
1772 feedback0 = _mm_shuffle_ps(feedback0, feedback0, _MM_SHUFFLE(2, 1, 0, 0));
1773 }
1774 // feedback0 = [garbage y-3 y-2 y-1]
1775 for (; x != xEnd; x += xStep)
1776 {
1777 __m128 _in0 = _mm_set1_ps(in[y][x]);
1778
1779 #ifdef __SSE4_1__
1780 feedback0 = _mm_blend_ps(feedback0, _in0, 0x1);
1781 feedback0 = _mm_blend_ps(feedback0, _mm_dp_ps(feedback0, k0, 0xf1), 0x1); // insert back input
1782 #else
1783 const __m128 FIRST_ELEMENT_MASK = _mm_castsi128_ps(_mm_set_epi32(0, 0, 0, ~0));
1784 feedback0 = Select(feedback0, _in0, FIRST_ELEMENT_MASK);
1785
1786 __m128 partialDP = _mm_mul_ps(feedback0, k0);
1787 partialDP = _mm_add_ps(partialDP, _mm_shuffle_ps(partialDP, partialDP, _MM_SHUFFLE(0, 0, 3, 2)));
1788 __m128 DP = _mm_add_ps(partialDP, _mm_shuffle_ps(partialDP, partialDP, _MM_SHUFFLE(0, 0, 0, 1)));
1789
1790 feedback0 = Select(feedback0, DP, FIRST_ELEMENT_MASK); // insert back input
1791 #endif
1792 out[y][x] = _mm_cvtss_f32(feedback0);
1793 feedback0 = _mm_shuffle_ps(feedback0, feedback0, _MM_SHUFFLE(0, 3, 2, 0));
1794 }
1795 }
1796 else
1797 {
1798 __m128 feedback0,
1799 k4 = Cast256To128(vCoefficients[4]);
1800
1801 ssize_t x = xStart;
1802 if (isBorder && !isForwardPass)
1803 {
1804 // xstart must be width - 1
1805 float u[N + 1][channels]; //[x][y][channels]
1806 for (ssize_t i = 0; i < N + 1; ++i)
1807 {
1808 u[i][0] = in[y][xStart + i * xStep];
1809 }
1810 float backwardsInitialState[N][channels];
1811 calcTriggsSdikaInitialization<channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState);
1812
1813 float temp[N + 1];
1814 for (ssize_t i = 0; i < N; ++i)
1815 {
1816 temp[i] = backwardsInitialState[i][0];
1817 }
1818 LoadFloats(feedback0, temp);
1819
1820 out[y][x] = backwardsInitialState[0][0];
1821 x += xStep;
1822 if (x == xEnd)
1823 continue;
1824 }
1825 else
1826 {
1827 LoadFloats(feedback0, &out[y][xStart - xStep]);
1828 }
1829
1830 for (; x != xEnd; x += xStep)
1831 {
1832 __m128 _in0 = _mm_set1_ps(in[y][x]);
1833
1834 #ifdef __SSE4_1__
1835 feedback0 = _mm_blend_ps(feedback0, _in0, 0x8);
1836 feedback0 = _mm_blend_ps(feedback0, _mm_dp_ps(feedback0, k4, 0xf8), 0x8); // insert back input
1837 #else
1838 const __m128 LAST_ELEMENT_MASK = _mm_castsi128_ps(_mm_set_epi32(~0, 0, 0, 0));
1839 feedback0 = Select(feedback0, _in0, LAST_ELEMENT_MASK);
1840
1841 __m128 partialDP = _mm_mul_ps(feedback0, k4);
1842 partialDP = _mm_add_ps(partialDP, _mm_shuffle_ps(partialDP, partialDP, _MM_SHUFFLE(1, 0, 0, 0)));
1843 __m128 DP = _mm_add_ps(partialDP, _mm_shuffle_ps(partialDP, partialDP, _MM_SHUFFLE(2, 0, 0, 0)));
1844
1845 feedback0 = Select(feedback0, DP, LAST_ELEMENT_MASK); // insert back input
1846 #endif
1847 __m128 temp = _mm_shuffle_ps(feedback0, feedback0, _MM_SHUFFLE(0, 0, 0, 3));
1848 out[y][x] = _mm_cvtss_f32(temp);
1849 feedback0 = _mm_shuffle_ps(feedback0, feedback0, _MM_SHUFFLE(2, 1, 0, 3));
1850 }
1851 }
1852 }
1853 }
1854}
1855
1856
1857// does 1D IIR convolution on multiple rows (height) of data
1858// IntermediateType must be float or double
1859template <bool isForwardPass, bool isBorder, typename OutType, typename InType, typename IntermediateType>
1860FORCE_INLINE void Convolve1DVerticalRef(SimpleImage<OutType> out,
1861 SimpleImage<InType> in,
1862 IntermediateType *borderValues, // [y][color]
1863 ssize_t yStart, ssize_t yEnd, ssize_t width, ssize_t height,
1864 typename MyTraits<IntermediateType>::SIMDtype *vCoefficients, double M[N * N])
1865{
1866 ssize_t yStep = isForwardPass ? 1 : -1;
1867
1868 ssize_t x = 0;
1869 do
1870 {
1871 IntermediateType prevOut[N];
1872 ssize_t y = yStart;
1873 if (isBorder && !isForwardPass)
1874 {
1875 IntermediateType u[N + 1][1]; // u[0] = last forward filtered value, u[1] = 2nd last forward filtered value, ...
1876 for (ssize_t i = 0; i < N + 1; ++i)
1877 {
1878 u[i][0] = in[yStart + i * yStep][x];
1879 }
1880 IntermediateType backwardsInitialState[N][1];
1881 calcTriggsSdikaInitialization<1>(M, u, &borderValues[x], &borderValues[x], ExtractElement0(vCoefficients[0]), backwardsInitialState);
1882 for (ssize_t i = 0; i < N; ++i)
1883 prevOut[i] = backwardsInitialState[i][0];
1884
1885 out[y][x] = clip_round_cast<OutType, IntermediateType>(prevOut[0]);
1886 y += yStep;
1887 if (y == yEnd)
1888 goto nextIteration;
1889 }
1890 else if (isBorder && isForwardPass)
1891 {
1892 for (ssize_t i = 0; i < N; ++i)
1893 prevOut[i] = in[0][x];
1894 }
1895 else
1896 {
1897 for (ssize_t i = 0; i < N; ++i)
1898 prevOut[i] = out[yStart - (i + 1) * yStep][x];
1899 }
1900
1901 do
1902 {
1903 IntermediateType sum = prevOut[0] * ExtractElement0(vCoefficients[1])
1904 + prevOut[1] * ExtractElement0(vCoefficients[2])
1905 + prevOut[2] * ExtractElement0(vCoefficients[3])
1906 + in[y][x] * ExtractElement0(vCoefficients[0]); // add last for best accuracy since this terms tends to be the smallest
1907
1908
1909 out[y][x] = clip_round_cast<OutType, IntermediateType>(sum);
1910 prevOut[2] = prevOut[1];
1911 prevOut[1] = prevOut[0];
1912 prevOut[0] = sum;
1913 y += yStep;
1914 } while (y != yEnd);
1915 nextIteration:
1916 ++x;
1917 } while (x < width);
1918}
1919
1920
1921
1922// input is always untransposed
1923// for reverse pass, input is output from forward pass
1924// for transposed output, in-place operation isn't possible
1925// hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined
1926template <bool isForwardPass, bool isBorder, typename OutType, typename InType, typename SIMD_Type>
1927static /*FORCE_INLINE*/ void Convolve1DVertical(SimpleImage<OutType> out,
1928 SimpleImage<InType> in,
1929 float *borderValues,
1930 ssize_t yStart, ssize_t yEnd, ssize_t width, ssize_t height,
1931 SIMD_Type *vCoefficients, double M[N * N])
1932{
1933#if 0
1934 Convolve1DVerticalRef<isForwardPass, isBorder>(out,
1935 in,
1936 borderValues,
1937 yStart, yEnd, width, height,
1938 vCoefficients, M);
1939 return;
1940#endif
1941 const ssize_t yStep = isForwardPass ? 1 : -1;
1942
1943 const int SIMD_WIDTH = 8;
1944 ssize_t x = 0;
1945#ifdef __AVX__
1946 for ( ; x <= width - SIMD_WIDTH; x += SIMD_WIDTH)
1947 {
1948 __m256 prevOut[N];
1949 ssize_t y = yStart;
1950 if (isBorder && !isForwardPass)
1951 {
1952 float u[N + 1][SIMD_WIDTH]; //[x][channels]
1953 for (ssize_t i = 0; i < N + 1; ++i)
1954 {
1955 __m256 temp;
1956 _mm256_storeu_ps(u[i], LoadFloats(temp, &in[yStart + i * yStep][x]));
1957 }
1958 float backwardsInitialState[N][SIMD_WIDTH];
1959 calcTriggsSdikaInitialization<SIMD_WIDTH>(M, u, &borderValues[x], &borderValues[x], ExtractElement0(vCoefficients[0]), backwardsInitialState);
1960 for (ssize_t i = 0; i < N; ++i)
1961 LoadFloats(prevOut[i], backwardsInitialState[i]);
1962
1963 StoreFloats(&out[y][x], prevOut[0]);
1964
1965 y += yStep;
1966 if (y == yEnd)
1967 continue;
1968 }
1969 else if (isBorder && isForwardPass)
1970 {
1971 // yStart must be 0
1972 __m256 firstPixel;
1973 LoadFloats(firstPixel, &in[0][x]);
1974 for (ssize_t i = 0; i < N; ++i)
1975 prevOut[i] = firstPixel;
1976 }
1977 else
1978 {
1979 for (ssize_t i = 0; i < N; ++i)
1980 {
1981 LoadFloats(prevOut[i], &out[yStart - (i + 1) * yStep][x]);
1982 }
1983 }
1984
1985 do
1986 {
1987 __m256 vIn;
1988 LoadFloats(vIn, &in[y][x]);
1989 __m256 vSum = vIn * vCoefficients[0];
1990
1991 vSum = prevOut[0] * vCoefficients[1]
1992 + prevOut[1] * vCoefficients[2]
1993 + prevOut[2] * vCoefficients[3]
1994 + vSum;
1995
1996 StoreFloats(&out[y][x], vSum);
1997
1998 prevOut[2] = prevOut[1];
1999 prevOut[1] = prevOut[0];
2000 prevOut[0] = vSum;
2001 y += yStep;
2002 } while (isForwardPass ? (y < yEnd) : (y > yEnd));
2003 }
2004#endif
2005 {
2006 const ssize_t SIMD_WIDTH = 4;
2007 for (; x < width; x += SIMD_WIDTH)
2008 {
2009 __m128 prevOut[N];
2010 ssize_t y = yStart;
2011 if (isBorder && !isForwardPass)
2012 {
2013 float u[N + 1][SIMD_WIDTH]; //[x][channels]
2014 for (ssize_t i = 0; i < N + 1; ++i)
2015 {
2016 __m128 temp;
2017 _mm_storeu_ps(u[i], LoadFloats(temp, &in[yStart + i * yStep][x]));
2018 }
2019 float backwardsInitialState[N][SIMD_WIDTH];
2020 calcTriggsSdikaInitialization<SIMD_WIDTH>(M, u, &borderValues[x], &borderValues[x], ExtractElement0(vCoefficients[0]), backwardsInitialState);
2021 for (ssize_t i = 0; i < N; ++i)
2022 LoadFloats(prevOut[i], backwardsInitialState[i]);
2023
2024 StoreFloats<true>(&out[y][x], prevOut[0], min(SIMD_WIDTH, width - x)); // todo: specialize loop to avoid partial stores
2025
2026 y += yStep;
2027 if (y == yEnd)
2028 continue;
2029 }
2030 else if (isBorder && isForwardPass)
2031 {
2032 // yStart must be 0
2033 __m128 firstPixel;
2034 LoadFloats(firstPixel, &in[0][x]);
2035 for (ssize_t i = 0; i < N; ++i)
2036 prevOut[i] = firstPixel;
2037 }
2038 else
2039 {
2040 for (ssize_t i = 0; i < N; ++i)
2041 {
2042 LoadFloats(prevOut[i], &out[yStart - (i + 1) * yStep][x]);
2043 }
2044 }
2045
2046 do
2047 {
2048 __m128 vIn;
2049 LoadFloats(vIn, &in[y][x]);
2050 __m128 vSum = vIn * Cast256To128(vCoefficients[0]);
2051
2052 vSum = prevOut[0] * Cast256To128(vCoefficients[1])
2053 + prevOut[1] * Cast256To128(vCoefficients[2])
2054 + prevOut[2] * Cast256To128(vCoefficients[3])
2055 + vSum;
2056
2057 StoreFloats<true>(&out[y][x], vSum, min(SIMD_WIDTH, width - x)); // todo: specialize loop to avoid partial stores
2058
2059 prevOut[2] = prevOut[1];
2060 prevOut[1] = prevOut[0];
2061 prevOut[0] = vSum;
2062 y += yStep;
2063 } while (isForwardPass ? (y < yEnd) : (y > yEnd));
2064 }
2065 }
2066}
2067
2068// input is always untransposed
2069// for reverse pass, input is output from forward pass
2070// for transposed output, in-place operation isn't possible
2071// hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined
2072template <bool isForwardPass, bool isBorder, typename OutType, typename InType, typename SIMD_Type>
2073static /*FORCE_INLINE*/ void Convolve1DVertical(SimpleImage<OutType> out,
2074 SimpleImage<InType> in,
2075 double *borderValues,
2076 ssize_t yStart, ssize_t yEnd, ssize_t width, ssize_t height,
2077 SIMD_Type *vCoefficients, double M[N * N])
2078{
2079#if 0
2080 Convolve1DVerticalRef<isForwardPass, isBorder>(out,
2081 in,
2082 borderValues,
2083 yStart, yEnd, width, height,
2084 vCoefficients, M);
2085 return;
2086#endif
2087
2088 const ssize_t yStep = isForwardPass ? 1 : -1,
2089 SIMD_WIDTH = 4;
2090 ssize_t x = 0;
2091#ifdef __AVX__
2092 for ( ; x <= width - SIMD_WIDTH; x += SIMD_WIDTH)
2093 {
2094 __m256d prevOut[N];
2095 ssize_t y = yStart;
2096
2097 if (isBorder && !isForwardPass)
2098 {
2099 // condition: yStart must be height - 1
2100 double u[N + 1][SIMD_WIDTH]; //[x][channels]
2101 for (ssize_t i = 0; i < N + 1; ++i)
2102 {
2103 __m256d temp;
2104 _mm256_storeu_pd(u[i], LoadDoubles(temp, &in[yStart + i * yStep][x]));
2105 }
2106 double backwardsInitialState[N][SIMD_WIDTH];
2107 calcTriggsSdikaInitialization<SIMD_WIDTH>(M, u, &borderValues[x], &borderValues[x], ExtractElement0(vCoefficients[0]), backwardsInitialState);
2108 for (ssize_t i = 0; i < N; ++i)
2109 LoadDoubles(prevOut[i], backwardsInitialState[i]);
2110
2111 StoreDoubles(&out[y][x], prevOut[0]);
2112
2113 y += yStep;
2114 if (y == yEnd)
2115 continue;
2116 }
2117 else if (isBorder && isForwardPass)
2118 {
2119 // condition: yStart must be 0
2120 __m256d firstPixel;
2121 LoadDoubles(firstPixel, &in[0][x]);
2122 for (ssize_t i = 0; i < N; ++i)
2123 prevOut[i] = firstPixel;
2124 }
2125 else
2126 {
2127 for (ssize_t i = 0; i < N; ++i)
2128 LoadDoubles(prevOut[i], &out[yStart - (i + 1) * yStep][x]);
2129 }
2130
2131 do
2132 {
2133 __m256d vIn;
2134 LoadDoubles(vIn, &in[y][x]);
2135 __m256d vSum = vIn * vCoefficients[0];
2136
2137 vSum = prevOut[0] * vCoefficients[1]
2138 + prevOut[1] * vCoefficients[2]
2139 + prevOut[2] * vCoefficients[3]
2140 + vSum;
2141
2142 StoreDoubles(&out[y][x], vSum);
2143
2144 prevOut[2] = prevOut[1];
2145 prevOut[1] = prevOut[0];
2146 prevOut[0] = vSum;
2147 y += yStep;
2148 } while (y != yEnd);
2149 }
2150#endif
2151 {
2152 const ssize_t SIMD_WIDTH = 2;
2153 for (; x < width; x += SIMD_WIDTH)
2154 {
2155 __m128d prevOut[N];
2156 ssize_t y = yStart;
2157
2158 if (isBorder && !isForwardPass)
2159 {
2160 // condition: yStart must be height - 1
2161 double u[N + 1][SIMD_WIDTH]; //[x][channels]
2162 for (ssize_t i = 0; i < N + 1; ++i)
2163 {
2164 __m128d temp;
2165 _mm_storeu_pd(u[i], LoadDoubles(temp, &in[yStart + i * yStep][x]));
2166 }
2167 double backwardsInitialState[N][SIMD_WIDTH];
2168 calcTriggsSdikaInitialization<SIMD_WIDTH>(M, u, &borderValues[x], &borderValues[x], ExtractElement0(vCoefficients[0]), backwardsInitialState);
2169 for (ssize_t i = 0; i < N; ++i)
2170 LoadDoubles(prevOut[i], backwardsInitialState[i]);
2171
2172 StoreDoubles<true>(&out[y][x], prevOut[0], min(SIMD_WIDTH, width - x)); // todo: specialize loop to avoid partial stores
2173
2174 y += yStep;
2175 if (y == yEnd)
2176 continue;
2177 }
2178 else if (isBorder && isForwardPass)
2179 {
2180 // condition: yStart must be 0
2181 __m128d firstPixel;
2182 LoadDoubles(firstPixel, &in[0][x]);
2183 for (ssize_t i = 0; i < N; ++i)
2184 prevOut[i] = firstPixel;
2185 }
2186 else
2187 {
2188 for (ssize_t i = 0; i < N; ++i)
2189 LoadDoubles(prevOut[i], &out[yStart - (i + 1) * yStep][x]);
2190 }
2191
2192 do
2193 {
2194 __m128d vIn;
2195 LoadDoubles(vIn, &in[y][x]);
2196 __m128d vSum = vIn * Cast256To128(vCoefficients[0]);
2197
2198 vSum = prevOut[0] * Cast256To128(vCoefficients[1])
2199 + prevOut[1] * Cast256To128(vCoefficients[2])
2200 + prevOut[2] * Cast256To128(vCoefficients[3])
2201 + vSum;
2202
2203 StoreDoubles<true>(&out[y][x], vSum, min(SIMD_WIDTH, width - x)); // todo: specialize loop to avoid partial stores
2204
2205 prevOut[2] = prevOut[1];
2206 prevOut[1] = prevOut[0];
2207 prevOut[0] = vSum;
2208 y += yStep;
2209 } while (y != yEnd);
2210 }
2211 }
2212}
2213
2214// hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined
2215template <bool transposeOut, int channels>
2216static /*FORCE_INLINE*/ void Copy2D(SimpleImage<uint8_t> out, SimpleImage<float> in, ssize_t width, ssize_t height)
2217{
2218 ssize_t y = 0;
2219 do
2220 {
2221 ssize_t x = 0;
2222 if (channels == 4)
2223 {
2224 #ifdef __AVX2__
2225 for (; x <= width - 2; x += 2)
2226 {
2227 __m256i vInt = _mm256_cvtps_epi32(_mm256_loadu_ps(&in[y][x * channels]));
2228 __m256i vInt2 = _mm256_permute2x128_si256(vInt, vInt, 1);
2229
2230 __m128i u16 = _mm256_castsi256_si128(_mm256_packus_epi32(vInt, vInt2));
2231 __m128i vRGBA = _mm_packus_epi16(u16, u16);
2232
2233 if (transposeOut)
2234 {
2235 *(uint32_t *)&out[x][y * channels] = _mm_extract_epi32(vRGBA, 0);
2236 *(uint32_t *)&out[x + 1][y * channels] = _mm_extract_epi32(vRGBA, 1);
2237 }
2238 else
2239 {
2240 _mm_storel_epi64((__m128i *)&out[y][x * channels], vRGBA);
2241 }
2242 }
2243 #endif
2244 while (x < width)
2245 {
2246 __m128 data = _mm_loadu_ps(&in[y][x * channels]);
2247 if (transposeOut)
2248 StoreFloats(&out[x][y * channels], data);
2249 else
2250 StoreFloats(&out[y][x * channels], data);
2251 ++x;
2252 }
2253 }
2254 else if (channels == 1)
2255 {
2256#ifdef __AVX__
2257 for (; x <= width - 8; x += 8)
2258 {
2259 StoreFloats(&out[y][x], _mm256_loadu_ps(&in[y][x]));
2260 }
2261 if (x < width)
2262 StoreFloats<true>(&out[y][x], _mm256_loadu_ps(&in[y][x]), width - x);
2263#else
2264 for (; x <= width - 4; x += 4)
2265 {
2266 StoreFloats(&out[y][x], _mm_loadu_ps(&in[y][x]));
2267 }
2268 if (x < width)
2269 StoreFloats<true>(&out[y][x], _mm_loadu_ps(&in[y][x]), width - x);
2270#endif
2271 }
2272 ++y;
2273 } while (y < height);
2274}
2275
2276// hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined
2277template <bool transposeOut, int channels>
2278static /*FORCE_INLINE*/ void Copy2D(SimpleImage<uint16_t> out, SimpleImage<float> in, ssize_t width, ssize_t height)
2279{
2280 ssize_t y = 0;
2281 do
2282 {
2283 ssize_t x = 0;
2284 if (channels == 4)
2285 {
2286 #ifdef __AVX2__
2287 for (; x <= width - 2; x += 2)
2288 {
2289 __m256i vInt = _mm256_cvtps_epi32(_mm256_loadu_ps(&in[y][x * channels]));
2290 __m256i vInt2 = _mm256_permute2x128_si256(vInt, vInt, 1);
2291
2292 __m128i vRGBA = _mm256_castsi256_si128(_mm256_packus_epi32(vInt, vInt2));
2293
2294 if (transposeOut)
2295 {
2296 _mm_storel_epi64((__m128i *)&out[x][y * channels], vRGBA);
2297 _mm_storel_epi64((__m128i *)&out[x + 1][y * channels], _mm_shuffle_epi32(vRGBA, _MM_SHUFFLE(0, 0, 3, 2)));
2298 }
2299 else
2300 {
2301 _mm_storeu_si128((__m128i *)&out[y][x * channels], vRGBA);
2302 }
2303 }
2304 #endif
2305 while (x < width)
2306 {
2307 __m128 data = _mm_loadu_ps(&in[y][x * channels]);
2308 if (transposeOut)
2309 StoreFloats(&out[x][y * channels], data);
2310 else
2311 StoreFloats(&out[y][x * channels], data);
2312 ++x;
2313 }
2314 }
2315 else if (channels == 1)
2316 {
2317 #ifdef __AVX__
2318 for (; x <= width - 8; x += 8)
2319 {
2320 StoreFloats(&out[y][x], _mm256_loadu_ps(&in[y][x]));
2321 }
2322 if (x < width)
2323 StoreFloats<true>(&out[y][x], _mm256_loadu_ps(&in[y][x]), width - x);
2324 #else
2325 for (; x <= width - 4; x += 4)
2326 {
2327 StoreFloats(&out[y][x], _mm_loadu_ps(&in[y][x]));
2328 }
2329 if (x < width)
2330 StoreFloats<true>(&out[y][x], _mm_loadu_ps(&in[y][x]), width - x);
2331 #endif
2332 }
2333 ++y;
2334 } while (y < height);
2335}
2336
2337// hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined
2338template <bool transposeOut, int channels>
2339static /*FORCE_INLINE*/ void Copy2D(SimpleImage<uint16_t> out, SimpleImage<double> in, ssize_t width, ssize_t height)
2340{
2341 ssize_t y = 0;
2342 do
2343 {
2344 ssize_t x = 0;
2345 if (channels == 4)
2346 {
2347 for ( ; x < width; x++)
2348 {
2349 #ifdef __AVX__
2350 __m128i i32 = _mm256_cvtpd_epi32(_mm256_loadu_pd(&in[y][x * channels]));
2351 #else
2352 __m128d in0 = _mm_load_pd(&in[y][x * channels]),
2353 in1 = _mm_load_pd(&in[y][x * channels + 2]);
2354 __m128i i32 = _mm_castps_si128(_mm_shuffle_ps(_mm_castsi128_ps(_mm_cvtpd_epi32(in0)), _mm_castsi128_ps(_mm_cvtpd_epi32(in1)), _MM_SHUFFLE(1, 0, 1, 0)));
2355 #endif
2356 #ifdef __SSE4_1__
2357 __m128i vRGBA = _mm_packus_epi32(i32, i32);
2358 #else
2359 __m128i vRGBA = _mm_max_epi16(_mm_packs_epi32(i32, i32), _mm_setzero_si128()); // hack: can get away with i16 for now
2360 #endif
2361
2362 if (transposeOut)
2363 {
2364 _mm_storel_epi64((__m128i *)&out[x][y * channels], vRGBA);
2365 }
2366 else
2367 {
2368 _mm_storel_epi64((__m128i *)&out[y][x * channels], vRGBA);
2369 }
2370 }
2371 }
2372 else if (channels == 1)
2373 {
2374 #ifdef __AVX__
2375 for (; x <= width - 4; x += 4)
2376 {
2377 StoreDoubles(&out[y][x], _mm256_loadu_pd(&in[y][x]));
2378 }
2379 if (x < width)
2380 {
2381 StoreDoubles<true>(&out[y][x], _mm256_loadu_pd(&in[y][x]), width - x);
2382 }
2383 #else
2384 for (; x <= width - 2; x += 2)
2385 {
2386 StoreDoubles(&out[y][x], _mm_loadu_pd(&in[y][x]));
2387 }
2388 if (x < width)
2389 {
2390 StoreDoubles<true>(&out[y][x], _mm_loadu_pd(&in[y][x]), width - x);
2391 }
2392 #endif
2393 }
2394 ++y;
2395 } while (y < height);
2396}
2397
2398template <bool transposeOut, int channels>
2399FORCE_INLINE void Copy2D(SimpleImage<float> out, SimpleImage<double> in, ssize_t width, ssize_t height)
2400{
2401 ssize_t y = 0;
2402 do
2403 {
2404 if (channels == 4)
2405 {
2406 ssize_t x = 0;
2407 do
2408 {
2409 #ifdef __AVX__
2410 __m128 v4f_data = _mm256_cvtpd_ps(_mm256_loadu_pd(&in[y][x * channels]));
2411 #else
2412 __m128 v4f_data = _mm_shuffle_ps(_mm_cvtpd_ps(_mm_loadu_pd(&in[y][x * channels])),
2413 _mm_cvtpd_ps(_mm_loadu_pd(&in[y][x * channels + 2])), _MM_SHUFFLE(1, 0, 1, 0));
2414 #endif
2415 if (transposeOut)
2416 _mm_store_ps(&out[x][y * channels], v4f_data);
2417 else
2418 _mm_store_ps(&out[y][x * channels], v4f_data);
2419 ++x;
2420 } while (x < width);
2421 }
2422 else
2423 {
2424 // 1 channel
2425 ssize_t x;
2426 #ifdef __AVX__
2427 for (x = 0; x <= width - 4; x += 4)
2428 StoreDoubles(&out[y][x], _mm256_loadu_pd(&in[y][x]));
2429 if (x < width)
2430 StoreDoubles<true>(&out[y][x], _mm256_loadu_pd(&in[y][x]), width - x);
2431 #else
2432 for (x = 0; x <= width - 2; x += 2)
2433 StoreDoubles(&out[y][x], _mm_loadu_pd(&in[y][x]));
2434 if (x < width)
2435 StoreDoubles<true>(&out[y][x], _mm_loadu_pd(&in[y][x]), width - x);
2436 #endif
2437 }
2438 ++y;
2439 } while (y < height);
2440}
2441
2442// hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined
2443template <bool transposeOut, int channels>
2444static /*FORCE_INLINE*/ void Copy2D(SimpleImage<uint8_t> out, SimpleImage <double> in, ssize_t width, ssize_t height)
2445{
2446 ssize_t y = 0;
2447 do
2448 {
2449 if (channels == 4)
2450 {
2451 ssize_t x = 0;
2452 do
2453 {
2454 #ifdef __AVX__
2455 __m256d _in = _mm256_load_pd(&in[y][x * channels]);
2456 __m128i i32 = _mm256_cvtpd_epi32(_in),
2457 #else
2458 __m128d in0 = _mm_load_pd(&in[y][x * channels]),
2459 in1 = _mm_load_pd(&in[y][x * channels + 2]);
2460 __m128i i32 = _mm_castps_si128(_mm_shuffle_ps(_mm_castsi128_ps(_mm_cvtpd_epi32(in0)), _mm_castsi128_ps(_mm_cvtpd_epi32(in1)), _MM_SHUFFLE(1, 0, 1, 0))),
2461 #endif
2462 #ifdef __SSE4_1__
2463 u16 = _mm_packus_epi32(i32, i32),
2464 #else
2465 u16 = _mm_max_epi16(_mm_packs_epi32(i32, i32), _mm_setzero_si128()),
2466 #endif
2467 u8 = _mm_packus_epi16(u16, u16);
2468 if (transposeOut)
2469 *(int32_t *)&out[x][y * channels] = _mm_cvtsi128_si32(u8);
2470 else
2471 *(int32_t *)&out[y][x * channels] = _mm_cvtsi128_si32(u8);
2472 ++x;
2473 } while (x < width);
2474 }
2475 else
2476 {
2477 // 1 channel
2478 ssize_t x;
2479 #ifdef __AVX__
2480 for (x = 0; x <= width - 4; x += 4)
2481 StoreDoubles(&out[y][x], _mm256_load_pd(&in[y][x * channels]));
2482 if (x < width)
2483 StoreDoubles<true>(&out[y][x], _mm256_load_pd(&in[y][x * channels]), width - x);
2484 #else
2485 for (x = 0; x <= width - 2; x += 2)
2486 StoreDoubles(&out[y][x], _mm_load_pd(&in[y][x * channels]));
2487 if (x < width)
2488 StoreDoubles<true>(&out[y][x], _mm_load_pd(&in[y][x * channels]), width - x);
2489 #endif
2490 }
2491 ++y;
2492 } while (y < height);
2493}
2494
2495#if 0 // comment this function out to ensure everything is vectorized
2496template <bool transposeOut, int channels, typename OutType, typename InType>
2497FORCE_INLINE void Copy2D(SimpleImage<OutType> out, SimpleImage <InType> in, ssize_t width, ssize_t height)
2498{
2499 ssize_t y = 0;
2500 do
2501 {
2502 ssize_t x = 0;
2503 do
2504 {
2505 ssize_t c = 0;
2506 do
2507 {
2508 if (transposeOut)
2509 out[x][y * channels + c] = clip_round_cast<OutType, InType>(in[y][x * channels + c]);
2510 else
2511 out[y][x * channels + c] = clip_round_cast<OutType, InType>(in[y][x * channels + c]);
2512 ++c;
2513 } while (c < channels);
2514 ++x;
2515 } while (x < width);
2516 ++y;
2517 } while (y < height);
2518}
2519#endif
2520
2521template <typename AnyType>
2522void StoreFloatsTransposed(SimpleImage<AnyType> out, __m128 x)
2523{
2524 out[0][0] = _mm_cvtss_f32(x);
2525 out[1][0] = _mm_cvtss_f32(_mm_shuffle_ps(x, x, _MM_SHUFFLE(0, 0, 0, 1)));
2526 out[2][0] = _mm_cvtss_f32(_mm_shuffle_ps(x, x, _MM_SHUFFLE(0, 0, 0, 2)));
2527 out[3][0] = _mm_cvtss_f32(_mm_shuffle_ps(x, x, _MM_SHUFFLE(0, 0, 0, 3)));
2528}
2529
2530// input & output are color interleaved
2531template <bool transposeOut, int channels, typename IntermediateType, typename OutType, typename InType>
2532void ConvolveHorizontal(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, float sigmaX, bool canOverwriteInput = false)
2533{
2534 const bool convertOutput = typeid(OutType) != typeid(IntermediateType);
2535
2536 typedef typename MyTraits<IntermediateType>::SIMDtype SIMDtype;
2537
2538 double bf[N];
2539 double M[N*N]; // matrix used for initialization procedure (has to be double)
2540 double b[N + 1];
2541
2542 calcFilter(sigmaX, bf);
2543
2544 for (size_t i = 0; i<N; i++)
2545 bf[i] = -bf[i];
2546
2547 b[0] = 1; // b[0] == alpha (scaling coefficient)
2548
2549 for (size_t i = 0; i<N; i++)
2550 {
2551 b[i + 1] = bf[i];
2552 b[0] -= b[i + 1];
2553 }
2554
2555 calcTriggsSdikaM(bf, M); // Compute initialization matrix
2556
2557 SIMDtype *vCoefficients;
2558
2559 if (channels == 4)
2560 {
2561 vCoefficients = (SIMDtype *)ALIGNED_ALLOCA(sizeof(SIMDtype) * (N + 1), sizeof(SIMDtype));
2562 for (ssize_t i = 0; i <= N; ++i)
2563 BroadcastSIMD(vCoefficients[i], (IntermediateType)b[i]);
2564 }
2565 else
2566 {
2567 if (typeid(IntermediateType) == typeid(double))
2568 {
2569 #ifdef __AVX2__
2570 __m256d *_vCoefficients = (__m256d *)ALIGNED_ALLOCA(sizeof(SIMDtype) * 2 * (N + 1), sizeof(SIMDtype));
2571 vCoefficients = (SIMDtype *)_vCoefficients;
2572
2573 __m256d temp = _mm256_loadu_pd(b);
2574 _vCoefficients[0] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(1, 2, 3, 0));
2575 _vCoefficients[1] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(2, 3, 0, 1));
2576 _vCoefficients[2] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(3, 0, 1, 2));
2577 _vCoefficients[3] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(0, 1, 2, 3));
2578
2579 // permutations for backward pass
2580 _vCoefficients[4] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(0, 3, 2, 1));
2581 _vCoefficients[5] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(1, 0, 3, 2));
2582 _vCoefficients[6] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(2, 1, 0, 3));
2583 _vCoefficients[7] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(3, 2, 1, 0));
2584 #else
2585 double *coefficients = (double *)ALIGNED_ALLOCA(sizeof(SIMDtype) * 4 * (N + 1), sizeof(SIMDtype));
2586 vCoefficients = (SIMDtype *)coefficients;
2587
2588 coefficients[0] = b[0];
2589 coefficients[1] = b[3];
2590 coefficients[2] = b[2];
2591 coefficients[3] = b[1];
2592
2593 coefficients[4] = b[1];
2594 coefficients[5] = b[0];
2595 coefficients[6] = b[3];
2596 coefficients[7] = b[2];
2597
2598 coefficients[8] = b[2];
2599 coefficients[9] = b[1];
2600 coefficients[10] = b[0];
2601 coefficients[11] = b[3];
2602
2603 coefficients[12] = b[3];
2604 coefficients[13] = b[2];
2605 coefficients[14] = b[1];
2606 coefficients[15] = b[0];
2607
2608 // permutations for backward pass
2609 coefficients[16] = b[1];
2610 coefficients[17] = b[2];
2611 coefficients[18] = b[3];
2612 coefficients[19] = b[0];
2613
2614 coefficients[20] = b[0];
2615 coefficients[21] = b[1];
2616 coefficients[22] = b[2];
2617 coefficients[23] = b[3];
2618
2619 coefficients[24] = b[3];
2620 coefficients[25] = b[0];
2621 coefficients[26] = b[1];
2622 coefficients[27] = b[2];
2623
2624 coefficients[28] = b[2];
2625 coefficients[29] = b[3];
2626 coefficients[30] = b[0];
2627 coefficients[31] = b[1];
2628 #endif
2629 }
2630 else
2631 {
2632 #ifdef __AVX__
2633 __m256 *_vCoefficients = (__m256 *)ALIGNED_ALLOCA(sizeof(SIMDtype) * 2 * (N + 1), sizeof(SIMDtype));
2634 vCoefficients = (SIMDtype *)_vCoefficients;
2635
2636 __m256 temp = _mm256_castps128_ps256(_mm256_cvtpd_ps(_mm256_loadu_pd(b)));
2637 temp = _mm256_permute2f128_ps(temp, temp, 0);
2638 _vCoefficients[0] = _mm256_permute_ps(temp, _MM_SHUFFLE(1, 2, 3, 0));
2639 _vCoefficients[1] = _mm256_permute_ps(temp, _MM_SHUFFLE(2, 3, 0, 1));
2640 _vCoefficients[2] = _mm256_permute_ps(temp, _MM_SHUFFLE(3, 0, 1, 2));
2641 _vCoefficients[3] = _mm256_permute_ps(temp, _MM_SHUFFLE(0, 1, 2, 3));
2642
2643 // permutations for backward pass
2644 _vCoefficients[4] = _mm256_permute_ps(temp, _MM_SHUFFLE(0, 3, 2, 1));
2645 _vCoefficients[5] = _mm256_permute_ps(temp, _MM_SHUFFLE(1, 0, 3, 2));
2646 _vCoefficients[6] = _mm256_permute_ps(temp, _MM_SHUFFLE(2, 1, 0, 3));
2647 _vCoefficients[7] = _mm256_permute_ps(temp, _MM_SHUFFLE(3, 2, 1, 0));
2648 #else
2649 __m128 *_vCoefficients = (__m128 *)ALIGNED_ALLOCA(sizeof(SIMDtype) * 2 * (N + 1), sizeof(SIMDtype));
2650 vCoefficients = (SIMDtype *)_vCoefficients;
2651
2652 __m128 temp = _mm_shuffle_ps
2653 (
2654 _mm_cvtpd_ps(_mm_loadu_pd(b)),
2655 _mm_cvtpd_ps(_mm_loadu_pd(&b[2])),
2656 _MM_SHUFFLE(1, 0, 1, 0)
2657 );
2658 _vCoefficients[0] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(1, 2, 3, 0));
2659 _vCoefficients[1] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(2, 3, 0, 1));
2660 _vCoefficients[2] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(3, 0, 1, 2));
2661 _vCoefficients[3] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(0, 1, 2, 3));
2662
2663 // permutations for backward pass
2664 _vCoefficients[4] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(0, 3, 2, 1));
2665 _vCoefficients[5] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(1, 0, 3, 2));
2666 _vCoefficients[6] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(2, 1, 0, 3));
2667 _vCoefficients[7] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(3, 2, 1, 0));
2668 #endif
2669 }
2670 }
2671 const ssize_t Y_BLOCK_SIZE = 8;
2672
2673 // X_BLOCK_SIZE * channels * sizeof(InType) had better be SIMD aligned
2674 const ssize_t X_BLOCK_SIZE = transposeOut ? 8
2675 : INT32_MAX / 2;
2676
2677 #pragma omp parallel
2678 {
2679 AlignedImage<IntermediateType, sizeof(SIMDtype)> forwardFilteredTemp; // TODO: can directly output to output buffer if !transposeOut & output type is float
2680 SimpleImage<IntermediateType> forwardFiltered;
2681 if (!convertOutput && !transposeOut)
2682 {
2683 forwardFiltered = SimpleImage<IntermediateType>((IntermediateType *)out.buffer, out.pitch);
2684 }
2685 /*
2686 else if (canOverwriteInput && typeid(InType) == typeid(IntermediateType))
2687 {
2688 forwardFiltered = SimpleImage<IntermediateType>((IntermediateType *)in.buffer, in.pitch);
2689 }*/
2690 else
2691 {
2692 forwardFilteredTemp.Resize(width * channels, Y_BLOCK_SIZE);
2693 forwardFiltered = forwardFilteredTemp;
2694 }
2695
2696 IntermediateType *borderValues = (IntermediateType *)alloca(channels * Y_BLOCK_SIZE * sizeof(IntermediateType));
2697
2698 #pragma omp for
2699 for (ssize_t y0 = 0; y0 < height; y0 += Y_BLOCK_SIZE)
2700 {
2701 ssize_t x = 0;
2702 ssize_t yBlockSize = min(height - y0, Y_BLOCK_SIZE);
2703
2704 ssize_t i = 0;
2705 do
2706 {
2707 ssize_t color = 0;
2708 do
2709 {
2710 borderValues[i * channels + color] = in[y0 + i][(width - 1) * channels + color];
2711 ++color;
2712 } while (color < channels);
2713 ++i;
2714 } while (i < yBlockSize);
2715
2716 ssize_t xBlockSize = min(max(X_BLOCK_SIZE, ssize_t(N)), width); // try to process at least X_BLOCK_SIZE or else later, data won't be SIMD aligned
2717 // convolve pixels[0:FILTER_SIZE - 1]
2718 Convolve1DHorizontal<false, true, true, channels>(forwardFiltered,
2719 in.SubImage(0, y0),
2720 (IntermediateType *)NULL,
2721 x, x + xBlockSize, width, yBlockSize,
2722 vCoefficients,
2723 M);
2724
2725 x += xBlockSize;
2726 while (x < width)
2727 {
2728 xBlockSize = min(width - x, X_BLOCK_SIZE);
2729
2730 Convolve1DHorizontal<false, true, false, channels>(forwardFiltered,
2731 in.SubImage(0, y0),
2732 (IntermediateType *)NULL,
2733 x, x + xBlockSize, width, yBlockSize,
2734 vCoefficients,
2735 M);
2736 x += xBlockSize;
2737 }
2738
2739 //--------------- backward pass--------------------------
2740 SimpleImage <IntermediateType> floatOut;
2741 // if output type is fixed point, we still compute an intermediate result as float for better precision
2742 if (convertOutput)
2743 {
2744 floatOut = forwardFiltered;
2745 }
2746 else
2747 {
2748 floatOut = SimpleImage<IntermediateType>((IntermediateType *)&out[transposeOut ? 0 : y0][(transposeOut ? y0 : 0) * channels], out.pitch);
2749 }
2750 x = width - 1;
2751
2752 ssize_t lastAligned = RoundDown(width, X_BLOCK_SIZE);
2753 // todo: check is this really vector aligned?
2754 xBlockSize = min(max(width - lastAligned, ssize_t(N)), width); // try to process more than N pixels so that later, data is SIMD aligned
2755
2756 // in-place operation (use forwardFiltered as both input & output) is possible due to internal register buffering
2757 if (transposeOut && !convertOutput)
2758 Convolve1DHorizontal<true, false, true, channels>(floatOut,
2759 forwardFiltered,
2760 borderValues,
2761 x, x - xBlockSize, width, yBlockSize,
2762 vCoefficients,
2763 M);
2764 else
2765 Convolve1DHorizontal<false, false, true, channels>(floatOut,
2766 forwardFiltered,
2767 borderValues,
2768 x, x - xBlockSize, width, yBlockSize,
2769 vCoefficients,
2770 M);
2771
2772 if (convertOutput)
2773 {
2774 ssize_t outCornerX = x + 1 - xBlockSize;
2775 Copy2D<transposeOut, channels>(out.SubImage((transposeOut ? y0 : outCornerX) * channels, transposeOut ? outCornerX : y0), floatOut.SubImage(outCornerX * channels, 0), xBlockSize, yBlockSize);
2776 }
2777 x -= xBlockSize;
2778 while (x >= 0)
2779 {
2780 xBlockSize = min(X_BLOCK_SIZE, x + 1);
2781
2782 if (transposeOut && !convertOutput)
2783 Convolve1DHorizontal<true, false, false, channels>(floatOut,
2784 forwardFiltered,
2785 borderValues,
2786 x, x - xBlockSize, width, yBlockSize,
2787 vCoefficients,
2788 M);
2789 else
2790 Convolve1DHorizontal<false, false, false, channels>(floatOut,
2791 forwardFiltered,
2792 borderValues,
2793 x, x - xBlockSize, width, yBlockSize,
2794 vCoefficients,
2795 M);
2796
2797 if (convertOutput)
2798 {
2799 ssize_t outCornerX = x + 1 - xBlockSize;
2800 Copy2D<transposeOut, channels>(out.SubImage((transposeOut ? y0 : outCornerX) * channels, transposeOut ? outCornerX : y0), floatOut.SubImage(outCornerX * channels, 0), xBlockSize, yBlockSize);
2801 }
2802 x -= xBlockSize;
2803 }
2804 }
2805 }
2806}
2807
2808template <bool transposeOut, int channels, typename OutType, typename InType>
2809void ConvolveHorizontal(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, float sigmaX, bool canOverwriteInput = false)
2810{
2811 if (sigmaX > MAX_SIZE_FOR_SINGLE_PRECISION)
2812 ConvolveHorizontal<transposeOut, channels, double, OutType, InType>(out, in, width, height, sigmaX, canOverwriteInput);
2813 else
2814 ConvolveHorizontal<transposeOut, channels, float, OutType, InType>(out, in, width, height, sigmaX, canOverwriteInput);
2815}
2816
2817// handles blocking
2818// input & output are color interleaved
2819template <typename IntermediateType, typename OutType, typename InType>
2820void ConvolveVertical(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, double sigmaY)
2821{
2822 const bool convertOutput = typeid(OutType) != typeid(IntermediateType);
2823
2824 const ssize_t Y_BLOCK_SIZE = 8,
2825 X_BLOCK_SIZE = 40; // must be multiple of SIMD width or else say goodbye to throughput
2826
2827 typedef typename MyTraits<IntermediateType>::SIMDtype SIMDtype;
2828
2829 double bf[N];
2830 double M[N*N]; // matrix used for initialization procedure (has to be double)
2831 double b[N + 1];
2832
2833 calcFilter(sigmaY, bf);
2834
2835 for (size_t i = 0; i<N; i++)
2836 bf[i] = -bf[i];
2837
2838 b[0] = 1; // b[0] == alpha (scaling coefficient)
2839
2840 for (size_t i = 0; i<N; i++)
2841 {
2842 b[i + 1] = bf[i];
2843 b[0] -= b[i + 1];
2844 }
2845 b[3] = 1 - (b[0] + b[1] + b[2]);
2846 // Compute initialization matrix
2847 calcTriggsSdikaM(bf, M);
2848
2849 SIMDtype *vCoefficients = (SIMDtype *)ALIGNED_ALLOCA(sizeof(SIMDtype) * (N + 1), sizeof(SIMDtype));
2850
2851 for (ssize_t i = 0; i <= N; ++i)
2852 {
2853 BroadcastSIMD(vCoefficients[i], (IntermediateType)b[i]);
2854 }
2855
2856 #pragma omp parallel
2857 {
2858 AlignedImage<IntermediateType, sizeof(SIMDtype)> forwardFiltered; // TODO: can directly output to output buffer if output type is float
2859 forwardFiltered.Resize(X_BLOCK_SIZE, height);
2860
2861 IntermediateType *borderValues = (IntermediateType *)alloca(X_BLOCK_SIZE * sizeof(IntermediateType));
2862
2863 #pragma omp for
2864 for (ssize_t x0 = 0; x0 < width; x0 += X_BLOCK_SIZE)
2865 {
2866 ssize_t y = 0;
2867 ssize_t xBlockSize = min(width - x0, X_BLOCK_SIZE);
2868
2869 ssize_t i = 0;
2870 do
2871 {
2872 borderValues[i] = in[height - 1][x0 + i];
2873 ++i;
2874 } while (i < xBlockSize);
2875
2876 ssize_t yBlockSize = min(ssize_t(N), height);
2877 // convolve pixels[0:filterSize - 1]
2878 Convolve1DVertical<true, true>(forwardFiltered,
2879 in.SubImage(x0, 0),
2880 (IntermediateType *)NULL,
2881 y, y + yBlockSize, xBlockSize, height,
2882 vCoefficients,
2883 M);
2884
2885 y += yBlockSize;
2886 while (y < height)
2887 {
2888 yBlockSize = min(height - y, Y_BLOCK_SIZE);
2889
2890 Convolve1DVertical<true, false>(forwardFiltered,
2891 in.SubImage(x0, 0),
2892 (IntermediateType *)NULL,
2893 y, y + yBlockSize, xBlockSize, height,
2894 vCoefficients,
2895 M);
2896 y += yBlockSize;
2897 }
2898
2899 //--------------- backward pass--------------------------
2900 SimpleImage<IntermediateType> floatOut;
2901 // if output type is fixed point, we still compute an intermediate result as float for better precision
2902 if (convertOutput)
2903 {
2904 floatOut = forwardFiltered;
2905 }
2906 else
2907 {
2908 floatOut = SimpleImage<IntermediateType>((IntermediateType *)&out[0][x0], out.pitch);
2909 }
2910 y = height - 1;
2911 yBlockSize = min(ssize_t(N), height);
2912
2913 // in-place operation (use forwardFiltered as both input & output) is possible due to internal register buffering
2914 Convolve1DVertical<false, true>(floatOut,
2915 forwardFiltered,
2916 borderValues,
2917 y, y - yBlockSize, xBlockSize, height,
2918 vCoefficients,
2919 M);
2920
2921 if (convertOutput)
2922 {
2923 ssize_t outCornerY = y + 1 - yBlockSize;
2924 Copy2D<false, 1>(out.SubImage(x0, outCornerY), floatOut.SubImage(0, outCornerY), xBlockSize, yBlockSize);
2925 }
2926 y -= yBlockSize;
2927 while (y >= 0)
2928 {
2929 yBlockSize = min(Y_BLOCK_SIZE, y + 1);
2930
2931 Convolve1DVertical<false, false>(floatOut,
2932 forwardFiltered,
2933 borderValues,
2934 y, y - yBlockSize, xBlockSize, y,
2935 vCoefficients,
2936 M);
2937
2938 if (convertOutput)
2939 {
2940 ssize_t outCornerY = y + 1 - yBlockSize;
2941 Copy2D<false, 1>(out.SubImage(x0, outCornerY), floatOut.SubImage(0, outCornerY), xBlockSize, yBlockSize);
2942 }
2943 y -= yBlockSize;
2944 }
2945 }
2946 }
2947}
2948
2949template <typename OutType, typename InType>
2950void ConvolveVertical(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, float sigmaY)
2951{
2952 if (sigmaY > MAX_SIZE_FOR_SINGLE_PRECISION)
2953 ConvolveVertical<double>(out, in, width, height, sigmaY);
2954 else
2955 ConvolveVertical<float>(out, in, width, height, sigmaY);
2956}
2957
2958// 2D
2959template <int channels, typename OutType, typename InType>
2960void Convolve(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY)
2961{
2962 using namespace std::chrono;
2963 typedef uint16_t HorizontalFilteredType;
2964 AlignedImage<HorizontalFilteredType, sizeof(__m256)> horizontalFiltered;
2965
2966 const bool DO_TIMING = false;
2967 high_resolution_clock::time_point t0;
2968 if (DO_TIMING)
2969 t0 = high_resolution_clock::now();
2970
2971 const bool TRANSPOSE = channels != 1; // means for the 1st and 2nd pass to transposes output
2972
2973 if (TRANSPOSE)
2974 {
2975 horizontalFiltered.Resize(height * channels, width);
2976 }
2977 else
2978 {
2979 horizontalFiltered.Resize(width * channels, height);
2980 }
2981 ConvolveHorizontal<TRANSPOSE, channels>(horizontalFiltered, in, width, height, sigmaX);
2982
2983#if 0
2984 // save intermediate image
2985 float scale;
2986 if (typeid(InType) == typeid(uint8_t))
2987 scale = 1.0f;
2988 else if (typeid(InType) == typeid(uint16_t))
2989 scale = 1.0f;
2990 else
2991 scale = 1.0f;
2992 SaveImage("horizontal_filtered.png", horizontalFiltered, TRANSPOSE ? height : width, TRANPOSE ? width : height, channels, scale);
2993#endif
2994
2995 if (DO_TIMING)
2996 cout << "Thoriz=" << duration_cast<milliseconds>(high_resolution_clock::now() - t0).count() << " ms" << endl;
2997
2998 //---------------------------------------------------
2999
3000 if (DO_TIMING)
3001 t0 = high_resolution_clock::now();
3002 if (TRANSPOSE)
3003 {
3004 ConvolveHorizontal<true, channels>(out, horizontalFiltered, height, width, sigmaY, true);
3005 }
3006 else
3007 {
3008 ConvolveVertical(out, horizontalFiltered, width * channels, height, sigmaY);
3009 }
3010 if (DO_TIMING)
3011 cout << "Tvert=" << duration_cast<milliseconds>(high_resolution_clock::now() - t0).count() << " ms" << endl;
3012}
3013
3014#ifndef __SSSE3__
3015 #define DO_FIR_IN_FLOAT // without mulhrs_epi16, int16 not competitive
3016#else
3017 #undef DO_FIR_IN_FLOAT
3018#endif
3019
3020#ifdef DO_FIR_IN_FLOAT
3021
3022// in-place (out = in) operation not allowed
3023template <int channels, bool symmetric, bool onBorder, typename OutType, typename InType, typename SIMD_Type>
3024void ConvolveHorizontalFIR(SimpleImage<OutType> out, SimpleImage<InType> in,
3025 ssize_t width, ssize_t height,
3026 ssize_t xStart, ssize_t xEnd,
3027 SIMD_Type *vFilter, int filterSize) // filterSize assumed to be odd
3028{
3029 if (channels == 4)
3030 {
3031 ssize_t y = 0;
3032 do
3033 {
3034 ssize_t x = xStart;
3035 #ifdef __AVX__
3036 const ssize_t SIMD_WIDTH = 8,
3037 PIXELS_PER_ITERATION = SIMD_WIDTH / channels;
3038 __m256 vSum,
3039 leftBorderValue, rightBorderValue;
3040 if (onBorder)
3041 {
3042 __m128 temp;
3043 LoadFloats(temp, &in[y][0 * channels]);
3044 leftBorderValue = _mm256_setr_m128(temp, temp);
3045
3046 LoadFloats(temp, &in[y][(width - 1) * channels]);
3047 rightBorderValue = _mm256_setr_m128(temp, temp);
3048 }
3049 goto middle;
3050
3051 do
3052 {
3053 StoreFloats(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum);
3054 middle:
3055 if (symmetric)
3056 {
3057 __m256 vIn;
3058 vSum = vFilter[0] * LoadFloats(vIn, &in[y][x * channels]);
3059 for (ssize_t i = 1; i <= filterSize; ++i)
3060 {
3061 __m256 filter = vFilter[i];
3062 ssize_t srcX = x - i;
3063 if (onBorder)
3064 srcX = max(-(PIXELS_PER_ITERATION - 1), srcX); // hack: do this for now until LoadFloats is modified to support partial loads
3065
3066 __m256 leftNeighbor, rightNeighbor;
3067 LoadFloats(leftNeighbor, &in[y][srcX * channels]);
3068 srcX = x + i;
3069 if (onBorder)
3070 srcX = min(width - 1, srcX); // hack: do this for now until LoadFloats is modified to support partial loads
3071
3072 LoadFloats(rightNeighbor, &in[y][srcX * channels]);
3073 if (onBorder)
3074 {
3075 __m256i notPastEnd = PartialVectorMask32(min(PIXELS_PER_ITERATION, max(ssize_t(0), width - i - x)) * channels * sizeof(float)),
3076 beforeBeginning = PartialVectorMask32(min(PIXELS_PER_ITERATION, max(ssize_t(0), i - x)) * channels * sizeof(float));
3077 leftNeighbor = _mm256_blendv_ps(leftNeighbor, leftBorderValue, _mm256_castsi256_ps(beforeBeginning));
3078 rightNeighbor = _mm256_blendv_ps(rightBorderValue, rightNeighbor, _mm256_castsi256_ps(notPastEnd));
3079 }
3080 vSum = vSum + filter * (leftNeighbor + rightNeighbor);
3081 }
3082 }
3083 else
3084 {
3085 vSum = _mm256_setzero_ps();
3086 ssize_t i = 0;
3087 // the smaller & simpler machine code for do-while probably outweighs the cost of the extra add
3088 do
3089 {
3090 ssize_t srcX = x - filterSize / 2 + i;
3091 // todo: border not handled
3092 __m256 vIn;
3093 LoadFloats(vIn, &in[y][srcX * channels]);
3094 vSum = vSum + vFilter[i] * vIn;
3095 ++i;
3096 } while (i < filterSize);
3097 }
3098 x += PIXELS_PER_ITERATION;
3099 } while (x < xEnd);
3100 StoreFloats<true>(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum, (xEnd - (x - PIXELS_PER_ITERATION)) * channels);
3101 #else
3102 do
3103 {
3104 __m128 vSum;
3105 if (symmetric)
3106 {
3107 __m128 vIn;
3108 LoadFloats(vIn, &in[y][x * channels]);
3109 vSum = vFilter[0] * vIn;
3110 for (ssize_t i = 1; i <= filterSize; ++i)
3111 {
3112 ssize_t srcX = x - i;
3113 if (onBorder)
3114 srcX = max(ssize_t(0), srcX);
3115
3116 __m128 leftNeighbor, rightNeighbor;
3117 LoadFloats(leftNeighbor, &in[y][srcX * channels]);
3118
3119 srcX = x + i;
3120 if (onBorder)
3121 srcX = min(width - 1, srcX);
3122 LoadFloats(rightNeighbor, &in[y][srcX * channels]);
3123
3124 vSum = vSum + vFilter[i] * (leftNeighbor + rightNeighbor);
3125 }
3126 }
3127 else
3128 {
3129 vSum = _mm_setzero_ps();
3130 ssize_t i = 0;
3131 do
3132 {
3133 ssize_t srcX = x - filterSize / 2 + i;
3134 if (onBorder)
3135 srcX = min(width - 1, max(ssize_t(0), srcX));
3136
3137 __m128 vIn;
3138 LoadFloats(vIn, &in[y][srcX * channels]);
3139 vSum = vSum + vFilter[i] * vIn;
3140 ++i;
3141 } while (i < filterSize);
3142 }
3143 StoreFloats(&out[y][x * channels], vSum);
3144 ++x;
3145 } while (x < xEnd);
3146
3147 #endif
3148 ++y;
3149 } while (y < height);
3150 }
3151 else
3152 {
3153 // 1 channel
3154 // todo: can merge with 4 channel?
3155 ssize_t y = 0;
3156 do
3157 {
3158 ssize_t x = xStart;
3159 #ifdef __AVX__
3160 const ssize_t SIMD_WIDTH = 8;
3161
3162 __m256 leftBorderValue, rightBorderValue;
3163 if (onBorder)
3164 {
3165 leftBorderValue = _mm256_set1_ps(in[y][0 * channels]);
3166 rightBorderValue = _mm256_set1_ps(in[y][(width - 1) * channels]);
3167 }
3168 __m256 vSum;
3169 // trashed performance from basic block reordering warning
3170 goto middle2;
3171 do
3172 {
3173 // write out values from previous iteration
3174 StoreFloats(&out[y][(x - SIMD_WIDTH) * channels], vSum);
3175 middle2:
3176 if (symmetric)
3177 {
3178 __m256 vIn;
3179 vSum = vFilter[0] * LoadFloats(vIn, &in[y][x * channels]);
3180 for (ssize_t i = 1; i <= filterSize; ++i)
3181 {
3182 __m256 filter = vFilter[i];
3183
3184 ssize_t srcX = x - i;
3185 if (onBorder)
3186 srcX = max(-(SIMD_WIDTH - 1), srcX); // hack: do this for now until LoadFloats is modified to support partial loads
3187
3188 __m256 leftNeighbor, rightNeighbor;
3189 LoadFloats(leftNeighbor, &in[y][srcX * channels]);
3190 srcX = x + i;
3191 if (onBorder)
3192 srcX = min(width - 1, srcX); // hack: do this for now until LoadFloats is modified to support partial loads
3193 LoadFloats(rightNeighbor, &in[y][srcX * channels]);
3194
3195 if (onBorder)
3196 {
3197 __m256i notPastEnd = PartialVectorMask32(min(SIMD_WIDTH, max(ssize_t(0), width - i - x)) * sizeof(float)),
3198 beforeBeginning = PartialVectorMask32(min(SIMD_WIDTH, max(ssize_t(0), i - x)) * sizeof(float));
3199 leftNeighbor = _mm256_blendv_ps(leftNeighbor, leftBorderValue, _mm256_castsi256_ps(beforeBeginning));
3200 rightNeighbor = _mm256_blendv_ps(rightBorderValue, rightNeighbor, _mm256_castsi256_ps(notPastEnd));
3201 }
3202 vSum = vSum + filter * (leftNeighbor + rightNeighbor);
3203 }
3204 }
3205 else
3206 {
3207 vSum = _mm256_setzero_ps();
3208 ssize_t i = 0;
3209 // the smaller & simpler machine code for do-while probably outweighs the cost of the extra add
3210 do
3211 {
3212 ssize_t srcX = x - filterSize / 2 + i;
3213 // todo: border not handled
3214 __m256 vIn;
3215 LoadFloats(vIn, &in[y][srcX * channels]);
3216 vSum = vSum + vFilter[i] * vIn;
3217 ++i;
3218 } while (i < filterSize);
3219 }
3220 x += SIMD_WIDTH;
3221 } while (x < xEnd);
3222 StoreFloats<true>(&out[y][(x - SIMD_WIDTH) * channels], vSum, xEnd - (x - SIMD_WIDTH));
3223 #else
3224 // SSE only
3225 const ssize_t SIMD_WIDTH = 4;
3226
3227 __m128 leftBorderValue, rightBorderValue;
3228 if (onBorder)
3229 {
3230 leftBorderValue = _mm_set1_ps(in[y][0 * channels]);
3231 rightBorderValue = _mm_set1_ps(in[y][(width - 1) * channels]);
3232 }
3233 __m128 vSum;
3234 // trashed performance from basic block reordering warning
3235 goto middle2;
3236 do
3237 {
3238 // write out values from previous iteration
3239 StoreFloats(&out[y][(x - SIMD_WIDTH) * channels], vSum);
3240 middle2:
3241 if (symmetric)
3242 {
3243 __m128 vIn;
3244 vSum = vFilter[0] * LoadFloats(vIn, &in[y][x * channels]);
3245 for (ssize_t i = 1; i <= filterSize; ++i)
3246 {
3247 __m128 filter = vFilter[i];
3248
3249 ssize_t srcX = x - i;
3250 if (onBorder)
3251 srcX = max(-(SIMD_WIDTH - 1), srcX); // hack: do this for now until LoadFloats is modified to support partial loads
3252
3253 __m128 leftNeighbor, rightNeighbor;
3254 LoadFloats(leftNeighbor, &in[y][srcX * channels]);
3255 srcX = x + i;
3256 if (onBorder)
3257 srcX = min(width - 1, srcX); // hack: do this for now until LoadFloats is modified to support partial loads
3258 LoadFloats(rightNeighbor, &in[y][srcX * channels]);
3259
3260 if (onBorder)
3261 {
3262 __m128i notPastEnd = PartialVectorMask(min(SIMD_WIDTH, max(ssize_t(0), width - i - x)) * sizeof(float)),
3263 beforeBeginning = PartialVectorMask(min(SIMD_WIDTH, max(ssize_t(0), i - x)) * sizeof(float));
3264 leftNeighbor = Select(leftNeighbor, leftBorderValue, _mm_castsi128_ps(beforeBeginning));
3265 rightNeighbor = Select(rightBorderValue, rightNeighbor, _mm_castsi128_ps(notPastEnd));
3266 }
3267 vSum = vSum + filter * (leftNeighbor + rightNeighbor);
3268 }
3269 }
3270 else
3271 {
3272 vSum = _mm_setzero_ps();
3273 ssize_t i = 0;
3274 // the smaller & simpler machine code for do-while probably outweighs the cost of the extra add
3275 do
3276 {
3277 ssize_t srcX = x - filterSize / 2 + i;
3278 // todo: border not handled
3279 __m128 vIn;
3280 LoadFloats(vIn, &in[y][srcX * channels]);
3281 vSum = vSum + vFilter[i] * vIn;
3282 ++i;
3283 } while (i < filterSize);
3284 }
3285 x += SIMD_WIDTH;
3286 } while (x < xEnd);
3287 StoreFloats<true>(&out[y][(x - SIMD_WIDTH) * channels], vSum, xEnd - (x - SIMD_WIDTH));
3288 #endif
3289 ++y;
3290 } while (y < height);
3291 }
3292}
3293
3294#else
3295
3296// DO_FIR_IN_INT16
3297
3298// in-place (out = in) operation not allowed
3299template <int channels, bool symmetric, bool onBorder, typename OutType, typename InType, typename SIMD_Type>
3300void ConvolveHorizontalFIR(SimpleImage<OutType> out, SimpleImage<InType> in,
3301 ssize_t width, ssize_t height,
3302 ssize_t xStart, ssize_t xEnd,
3303 SIMD_Type *vFilter, int filterSize)
3304{
3305 if (channels == 4)
3306 {
3307 ssize_t y = 0;
3308 do
3309 {
3310 #ifdef __AVX2__
3311 int16_t *convertedIn;
3312 if (typeid(InType) == typeid(int16_t))
3313 {
3314 convertedIn = (int16_t *)&in[y][0];
3315 }
3316 else
3317 {
3318 convertedIn = (int16_t *)ALIGNED_ALLOCA(RoundUp(width * channels, ssize_t(sizeof(__m256i) / sizeof(int16_t))) * sizeof(int16_t), sizeof(__m256i));
3319 for (ssize_t x = 0; x < width * channels; x += 16)
3320 {
3321 __m128i u8 = _mm_loadu_si128((__m128i *)&in[y][x]);
3322 __m256i i16 = _mm256_slli_epi16(_mm256_cvtepu8_epi16(u8), 6);
3323 _mm256_store_si256((__m256i *)&convertedIn[x], i16);
3324 }
3325 }
3326 ssize_t x = xStart;
3327 const ssize_t SIMD_WIDTH = 16,
3328 PIXELS_PER_ITERATION = SIMD_WIDTH / channels;
3329 __m256i vSum;
3330 __m256i leftBorderValue, rightBorderValue;
3331 if (onBorder)
3332 {
3333 __m128i temp = _mm_set1_epi64x(*(int64_t *)&convertedIn[0 * channels]);
3334 leftBorderValue = _mm256_setr_m128i(temp, temp);
3335 temp = _mm_set1_epi64x(*(int64_t *)&convertedIn[(width - 1) * channels]);
3336 rightBorderValue = _mm256_setr_m128i(temp, temp);
3337 }
3338 goto middle2;
3339 do
3340 {
3341 ScaleAndStoreInt16(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum);
3342 middle2:
3343 if (symmetric)
3344 {
3345 __m256i center = _mm256_loadu_si256((__m256i *)&convertedIn[x * channels]);
3346 vSum = _mm256_mulhrs_epi16(vFilter[0], center);
3347 for (ssize_t i = 1; i <= filterSize; ++i)
3348 {
3349 __m256i filter = vFilter[i];
3350
3351 ssize_t srcX = x - i;
3352 if (onBorder)
3353 srcX = max(-(PIXELS_PER_ITERATION - 1), srcX); // todo: use this for now until LoadAndScaleToInt16() supports partial loads
3354
3355 __m256i leftNeighbor, rightNeighbor;
3356 leftNeighbor = _mm256_loadu_si256((__m256i *)&convertedIn[srcX * channels]);
3357
3358 srcX = x + i;
3359 if (onBorder)
3360 srcX = min(width - 1, srcX);
3361 rightNeighbor = _mm256_loadu_si256((__m256i *)&convertedIn[srcX * channels]);
3362
3363 if (onBorder)
3364 {
3365 __m256i leftMask = PartialVectorMask32(min(PIXELS_PER_ITERATION, max(ssize_t(0), i - x)) * channels * sizeof(int16_t)),
3366 rightMask = PartialVectorMask32(min(PIXELS_PER_ITERATION, width - (x + i)) * channels * sizeof(int16_t));
3367 leftNeighbor = _mm256_blendv_epi8(leftNeighbor, leftBorderValue, leftMask);
3368 rightNeighbor = _mm256_blendv_epi8(rightBorderValue, rightNeighbor, rightMask);
3369 }
3370 vSum = _mm256_adds_epi16(vSum, _mm256_mulhrs_epi16(filter, _mm256_adds_epi16(leftNeighbor, rightNeighbor)));
3371 }
3372 }
3373 else
3374 {
3375 throw 0;
3376 }
3377 x += PIXELS_PER_ITERATION;
3378 } while (x < xEnd);
3379 ScaleAndStoreInt16<true>(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum, (xEnd - (x - PIXELS_PER_ITERATION)) * channels);
3380 #else
3381 // SSSE3 only
3382 int16_t *convertedIn;
3383 if (typeid(InType) == typeid(int16_t))
3384 {
3385 convertedIn = (int16_t *)&in[y][0];
3386 }
3387 else
3388 {
3389 convertedIn = (int16_t *)ALIGNED_ALLOCA(RoundUp(width * channels, ssize_t(sizeof(__m128i) / sizeof(int16_t))) * sizeof(int16_t), sizeof(__m128i));
3390 for (ssize_t x = 0; x < width * channels; x += 8)
3391 {
3392 __m128i u8 = _mm_loadl_epi64((__m128i *)&in[y][x]);
3393 __m128i i16 = _mm_slli_epi16(_mm_cvtepu8_epi16(u8), 6);
3394 _mm_store_si128((__m128i *)&convertedIn[x], i16);
3395 }
3396 }
3397 ssize_t x = xStart;
3398 const ssize_t SIMD_WIDTH = 8,
3399 PIXELS_PER_ITERATION = SIMD_WIDTH / channels;
3400 __m128i vSum;
3401 __m128i leftBorderValue, rightBorderValue;
3402 if (onBorder)
3403 {
3404 leftBorderValue = _mm_set1_epi64x(*(int64_t *)&convertedIn[0 * channels]);
3405 rightBorderValue = _mm_set1_epi64x(*(int64_t *)&convertedIn[(width - 1) * channels]);
3406 }
3407 goto middle3;
3408 do
3409 {
3410 ScaleAndStoreInt16(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum);
3411 middle3:
3412 if (symmetric)
3413 {
3414 __m128i center;
3415 vSum = _mm_mulhrs_epi16(Cast256To128(vFilter[0]), LoadAndScaleToInt16(center, &convertedIn[x * channels]));
3416 for (ssize_t i = 1; i <= filterSize; ++i)
3417 {
3418 __m128i filter = Cast256To128(vFilter[i]);
3419
3420 ssize_t srcX = x - i;
3421 if (onBorder)
3422 srcX = max(-(PIXELS_PER_ITERATION - 1), srcX); // todo: use this for now until LoadAndScaleToInt16() supports partial loads
3423
3424 __m128i leftNeighbor = _mm_loadu_si128((__m128i *)&convertedIn[srcX * channels]);
3425
3426 srcX = x + i;
3427 if (onBorder)
3428 srcX = min(width - 1, srcX);
3429
3430 __m128i rightNeighbor = _mm_loadu_si128((__m128i *)&convertedIn[srcX * channels]);
3431
3432 if (onBorder)
3433 {
3434 __m128i leftMask = PartialVectorMask(min(PIXELS_PER_ITERATION, max(ssize_t(0), i - x)) * channels * sizeof(int16_t)),
3435 rightMask = PartialVectorMask(min(PIXELS_PER_ITERATION, width - (x + i)) * channels * sizeof(int16_t));
3436 leftNeighbor = _mm_blendv_epi8(leftNeighbor, leftBorderValue, leftMask);
3437 rightNeighbor = _mm_blendv_epi8(rightBorderValue, rightNeighbor, rightMask);
3438 }
3439 vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(filter, _mm_adds_epi16(leftNeighbor, rightNeighbor)));
3440 }
3441 }
3442 else
3443 {
3444 throw 0;
3445 }
3446 x += PIXELS_PER_ITERATION;
3447 } while (x < xEnd);
3448 ScaleAndStoreInt16<true>(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum, (xEnd - (x - PIXELS_PER_ITERATION)) * channels);
3449 #endif
3450
3451 ++y;
3452 } while (y < height);
3453 }
3454 else
3455 {
3456 #ifdef __GNUC__
3457 const static void *labels[] =
3458 {
3459 NULL,
3460 &&remainder1,
3461 &&remainder2,
3462 &&remainder3,
3463 &&remainder4,
3464
3465 &&remainder5,
3466 &&remainder6,
3467 &&remainder7,
3468 &&remainder8
3469 };
3470 #endif
3471 // 1 channel
3472 // todo: can merge with 4 channel?
3473 ssize_t y = 0;
3474 do
3475 {
3476 int16_t *convertedIn;
3477 if (typeid(InType) == typeid(int16_t))
3478 {
3479 convertedIn = (int16_t *)&in[y][0];
3480 }
3481 else
3482 {
3483 convertedIn = (int16_t *)ALIGNED_ALLOCA(RoundUp(width, ssize_t(sizeof(__m256i) / sizeof(int16_t))) * sizeof(int16_t), sizeof(__m256i));
3484 #ifdef __AVX2__
3485 for (ssize_t x = 0; x < width; x += 16)
3486 {
3487 __m128i u8 = _mm_loadu_si128((__m128i *)&in[y][x]);
3488 __m256i i16 = _mm256_slli_epi16(_mm256_cvtepu8_epi16(u8), 6);
3489 _mm256_store_si256((__m256i *)&convertedIn[x], i16);
3490 }
3491 #else
3492 for (ssize_t x = 0; x < width; x += 8)
3493 {
3494 __m128i u8 = _mm_loadl_epi64((__m128i *)&in[y][x]);
3495 __m128i i16 = _mm_slli_epi16(_mm_cvtepu8_epi16(u8), 6);
3496 _mm_store_si128((__m128i *)&convertedIn[x], i16);
3497 }
3498 #endif
3499 }
3500 ssize_t x = xStart;
3501 const ssize_t SIMD_WIDTH = 8;
3502 __m128i vSum;
3503 __m128i leftBorderValue, rightBorderValue;
3504 if (onBorder)
3505 {
3506 leftBorderValue = _mm_set1_epi16(convertedIn[0 * channels]);
3507 rightBorderValue = _mm_set1_epi16(convertedIn[(width - 1) * channels]);
3508 }
3509 goto middle;
3510 do
3511 {
3512 ScaleAndStoreInt16(&out[y][x - SIMD_WIDTH], vSum);
3513 middle:
3514 if (symmetric)
3515 {
3516 #ifdef __GNUC__
3517 // up to 1.2x faster by using palignr instead of unaligned loads!
3518 // the greatest difficulty with using palignr is that the offset must be a compile time constant
3519 // solution? Duff's device
3520 __m128i leftHalf[2],
3521 rightHalf[2],
3522 center = _mm_loadu_si128((__m128i *)&convertedIn[x]);
3523
3524 if (onBorder)
3525 {
3526 __m128i mask = PartialVectorMask(min(SIMD_WIDTH, width - x) * sizeof(int16_t));
3527 center = _mm_blendv_epi8(rightBorderValue, center, mask);
3528 }
3529 rightHalf[0] = leftHalf[1] = center;
3530
3531 vSum = _mm_mulhrs_epi16(Cast256To128(vFilter[0]), rightHalf[0]);
3532
3533 ssize_t base = 0;
3534 while (base < filterSize)
3535 {
3536 leftHalf[0] = rightHalf[0];
3537 rightHalf[1] = leftHalf[1];
3538 rightHalf[0] = _mm_loadu_si128((__m128i *)&convertedIn[x + base + 8]);
3539 leftHalf[1] = _mm_loadu_si128((__m128i *)&convertedIn[x - base - 8]);
3540
3541 if (onBorder)
3542 {
3543 __m128i leftMask = PartialVectorMask(min(SIMD_WIDTH, max(ssize_t(0), (base + 8) - x)) * sizeof(int16_t)),
3544 rightMask = PartialVectorMask(min(SIMD_WIDTH, width - (x + base + 8)) * sizeof(int16_t));
3545 leftHalf[1] = _mm_blendv_epi8(leftHalf[1], leftBorderValue, leftMask);
3546 rightHalf[0] = _mm_blendv_epi8(rightBorderValue, rightHalf[0], rightMask);
3547 }
3548
3549 goto *labels[min(ssize_t(8), filterSize - base)];
3550 __m128i v, v2;
3551 remainder8:
3552 v = rightHalf[0]; // same as palignr(right, left, 16)
3553 v2 = leftHalf[1];
3554 vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 8]), _mm_adds_epi16(v, v2)));
3555 remainder7:
3556 v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 14);
3557 v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 2);
3558 vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 7]), _mm_adds_epi16(v, v2)));
3559 remainder6:
3560 v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 12);
3561 v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 4);
3562 vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 6]), _mm_adds_epi16(v, v2)));
3563 remainder5:
3564 v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 10);
3565 v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 6);
3566 vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 5]), _mm_adds_epi16(v, v2)));
3567 remainder4:
3568 v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 8);
3569 v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 8);
3570 vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 4]), _mm_adds_epi16(v, v2)));
3571 remainder3:
3572 v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 6);
3573 v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 10);
3574 vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 3]), _mm_adds_epi16(v, v2)));
3575 remainder2:
3576 v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 4);
3577 v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 12);
3578 vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 2]), _mm_adds_epi16(v, v2)));
3579 remainder1:
3580 v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 2);
3581 v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 14);
3582 vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 1]), _mm_adds_epi16(v, v2)));
3583 base += 8;
3584 }
3585 #else
3586 __m128i center;
3587 vSum = _mm_mulhrs_epi16(Cast256To128(vFilter[0]), LoadAndScaleToInt16(center, &convertedIn[x * channels]));
3588 for (ssize_t i = 1; i <= filterSize; ++i)
3589 {
3590 __m128i filter = Cast256To128(vFilter[i]);
3591
3592 ssize_t srcX = x - i;
3593 if (onBorder)
3594 srcX = max(-(SIMD_WIDTH - 1), srcX); // todo: use this for now until LoadAndScaleToInt16() supports partial loads
3595
3596 __m128i leftNeighbor = _mm_loadu_si128((__m128i *)&convertedIn[srcX * channels]);
3597
3598 srcX = x + i;
3599 if (onBorder)
3600 srcX = min(width - 1, srcX);
3601
3602 __m128i rightNeighbor = _mm_loadu_si128((__m128i *)&convertedIn[srcX * channels]);
3603
3604 if (onBorder)
3605 {
3606 __m128i leftMask = PartialVectorMask(min(SIMD_WIDTH, max(ssize_t(0), i - x)) * sizeof(int16_t)),
3607 rightMask = PartialVectorMask(min(SIMD_WIDTH, width - (x + i)) * sizeof(int16_t));
3608 leftNeighbor = _mm_blendv_epi8(leftNeighbor, leftBorderValue, leftMask);
3609 rightNeighbor = _mm_blendv_epi8(rightBorderValue, rightNeighbor, rightMask);
3610 }
3611 vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(filter, _mm_adds_epi16(leftNeighbor, rightNeighbor)));
3612 }
3613 #endif
3614 }
3615 else
3616 {
3617 throw 0;
3618 }
3619 x += SIMD_WIDTH;
3620 } while (x < xEnd);
3621 ScaleAndStoreInt16<true>(&out[y][x - SIMD_WIDTH], vSum, xEnd - (x - SIMD_WIDTH));
3622 ++y;
3623 } while (y < height);
3624 }
3625}
3626#endif
3627
3628// handles blocking
3629// in-place (out = in) operation not allowed
3630template <int channels, typename OutType, typename InType>
3631void ConvolveHorizontalFIR(SimpleImage<OutType> out,
3632 SimpleImage<InType> in,
3633 ssize_t width, ssize_t height, float sigmaX)
3634{
3635#ifdef DO_FIR_IN_FLOAT
3636 typedef MyTraits<float>::SIMDtype SIMDtype;
3637#else
3638 typedef MyTraits<int16_t>::SIMDtype SIMDtype;
3639#endif
3640
3641 ssize_t halfFilterSize = _effect_area_scr(sigmaX);
3642 float *filter = (float *)alloca((halfFilterSize + 1) * sizeof(float));
3643 _make_kernel(filter, sigmaX);
3644
3645 SIMDtype *vFilter = (SIMDtype *)ALIGNED_ALLOCA((halfFilterSize + 1) * sizeof(SIMDtype), sizeof(SIMDtype));
3646
3647 for (ssize_t i = 0; i <= halfFilterSize; ++i)
3648 {
3649 #ifdef DO_FIR_IN_FLOAT
3650 BroadcastSIMD(vFilter[i], filter[i]);
3651 #else
3652 BroadcastSIMD(vFilter[i], clip_round_cast<int16_t>(filter[i] * 32768));
3653 #endif
3654 }
3655
3656 const ssize_t IDEAL_Y_BLOCK_SIZE = 1; // pointless for now, but might be needed in the future when SIMD code processes 2 rows at a time
3657
3658 #pragma omp parallel
3659 {
3660 #pragma omp for
3661 for (ssize_t y = 0; y < height; y += IDEAL_Y_BLOCK_SIZE)
3662 {
3663 ssize_t yBlockSize = min(height - y, IDEAL_Y_BLOCK_SIZE);
3664
3665 ssize_t nonBorderStart = min(width, RoundUp(halfFilterSize, ssize_t(sizeof(__m256) / channels / sizeof(InType)))); // so that data for non-border region is vector aligned
3666 if (nonBorderStart < width - halfFilterSize)
3667 ConvolveHorizontalFIR<channels, true, false>(out.SubImage(0, y),
3668 in.SubImage(0, y),
3669 width, yBlockSize,
3670 nonBorderStart, width - halfFilterSize,
3671 vFilter, halfFilterSize);
3672 ssize_t xStart = 0,
3673 xEnd = nonBorderStart;
3674 processEnd:
3675 ConvolveHorizontalFIR<channels, true, true>(out.SubImage(0, y),
3676 in.SubImage(0, y),
3677 width, yBlockSize,
3678 xStart, xEnd,
3679 vFilter, halfFilterSize);
3680 if (xStart == 0)
3681 {
3682 // avoid inline happy compiler from inlining another call to ConvolveHorizontalFIR()
3683 xStart = max(nonBorderStart, width - halfFilterSize); // don't refilter anything in case the 2 border regions overlap
3684 xEnd = width;
3685 goto processEnd;
3686 }
3687 }
3688 } // omp parallel
3689}
3690
3691#ifdef DO_FIR_IN_FLOAT
3692// in-place (out = in) operation not allowed
3693template <bool symmetric, bool onBorder, typename OutType, typename InType, typename SIMD_Type>
3694void ConvolveVerticalFIR(SimpleImage<OutType> out, SimpleImage<InType> in,
3695 ssize_t width, ssize_t height,
3696 ssize_t yStart, ssize_t yEnd,
3697 SIMD_Type *vFilter, int filterSize)
3698{
3699 ssize_t y = yStart;
3700 do
3701 {
3702 ssize_t x = 0;
3703#ifdef __AVX__
3704 const ssize_t SIMD_WIDTH = 8;
3705 __m256 vSum;
3706 goto middle;
3707 do
3708 {
3709 StoreFloats(&out[y][x - SIMD_WIDTH], vSum); // write out data from previous iteration
3710 middle:
3711 if (symmetric)
3712 {
3713 __m256 vIn;
3714 LoadFloats(vIn, &in[y][x]);
3715 vSum = vFilter[0] * vIn;
3716 for (ssize_t i = 1; i <= filterSize; ++i)
3717 {
3718 ssize_t srcY = y - i;
3719 if (onBorder)
3720 srcY = max(ssize_t(0), srcY);
3721 __m256 bottom, top;
3722 LoadFloats(bottom, &in[srcY][x]);
3723
3724 srcY = y + i;
3725 if (onBorder)
3726 srcY = min(height - 1, srcY);
3727 LoadFloats(top, &in[srcY][x]);
3728
3729 vSum = vSum + vFilter[i] * (bottom + top);
3730 }
3731 }
3732 else
3733 {
3734 // wouldn't be surprised if the smaller & simpler do-while code outweighs the cost of the extra add
3735 vSum = _mm256_setzero_ps();
3736 ssize_t i = 0;
3737 do
3738 {
3739 ssize_t srcY = y - filterSize / 2 + i;
3740 if (onBorder)
3741 srcY = min(height - 1, max(ssize_t(0), srcY));
3742 __m256 vIn;
3743 LoadFloats(vIn, &in[srcY][x]);
3744 vSum = vSum + vFilter[i] * vIn;
3745 ++i;
3746 } while (i < filterSize);
3747 }
3748 x += SIMD_WIDTH;
3749 } while (x < width);
3750 StoreFloats<true>(&out[y][x - SIMD_WIDTH], vSum, width - (x - SIMD_WIDTH));
3751#else
3752 // for SSE only
3753 const ssize_t SIMD_WIDTH = 4;
3754 __m128 vSum;
3755 goto middle;
3756 do
3757 {
3758 StoreFloats(&out[y][x - SIMD_WIDTH], vSum); // write out data from previous iteration
3759 middle:
3760 if (symmetric)
3761 {
3762 __m128 vIn;
3763 LoadFloats(vIn, &in[y][x]);
3764 vSum = Cast256To128(vFilter[0]) * vIn;
3765 for (ssize_t i = 1; i <= filterSize; ++i)
3766 {
3767 ssize_t srcY = y - i;
3768 if (onBorder)
3769 srcY = max(ssize_t(0), srcY);
3770 __m128 bottom, top;
3771 LoadFloats(bottom, &in[srcY][x]);
3772
3773 srcY = y + i;
3774 if (onBorder)
3775 srcY = min(height - 1, srcY);
3776 LoadFloats(top, &in[srcY][x]);
3777
3778 vSum = vSum + Cast256To128(vFilter[i]) * (bottom + top);
3779 }
3780 }
3781 else
3782 {
3783 vSum = _mm_setzero_ps();
3784 ssize_t i = 0;
3785 do
3786 {
3787 ssize_t srcY = y - filterSize / 2 + i;
3788 if (onBorder)
3789 srcY = min(height - 1, max(ssize_t(0), srcY));
3790 __m128 _vFilter = Cast256To128(vFilter[i]);
3791
3792 __m128 vIn;
3793 LoadFloats(vIn, &in[srcY][x]);
3794 vSum = vSum + _vFilter * vIn;
3795 ++i;
3796 } while (i < filterSize);
3797 }
3798 x += SIMD_WIDTH;
3799 } while (x < width);
3800 StoreFloats<true>(&out[y][x - SIMD_WIDTH], vSum, width - (x - SIMD_WIDTH));
3801#endif
3802 ++y;
3803 } while (y < yEnd);
3804}
3805
3806#else // DO_FIR_IN_FLOAT
3807
3808// in-place (out = in) operation not allowed
3809template <bool symmetric, bool onBorder, typename OutType, typename InType, typename SIMD_Type>
3810void ConvolveVerticalFIR(SimpleImage<OutType> out, SimpleImage<InType> in,
3811 ssize_t width, ssize_t height,
3812 ssize_t yStart, ssize_t yEnd,
3813 SIMD_Type *vFilter, int filterSize)
3814{
3815 ssize_t y = yStart;
3816 do
3817 {
3818 ssize_t x = 0;
3819 #ifdef __AVX2__
3820 const ssize_t SIMD_WIDTH = 16;
3821 __m256i vSum;
3822 goto middle;
3823 do
3824 {
3825 ScaleAndStoreInt16(&out[y][x - SIMD_WIDTH], vSum); // store data from previous iteration
3826 middle:
3827 if (symmetric)
3828 {
3829 __m256i center;
3830 vSum = _mm256_mulhrs_epi16(vFilter[0], LoadAndScaleToInt16(center, &in[y][x]));
3831 for (ssize_t i = 1; i <= filterSize; ++i)
3832 {
3833 __m256i filter = vFilter[i];
3834 ssize_t srcY = y + i;
3835 if (onBorder)
3836 srcY = min(srcY, height - 1);
3837 __m256i topNeighbor;
3838 LoadAndScaleToInt16(topNeighbor, &in[srcY][x]);
3839
3840 srcY = y - i;
3841 if (onBorder)
3842 srcY = max(srcY, ssize_t(0));
3843 __m256i bottomNeighbor;
3844 LoadAndScaleToInt16(bottomNeighbor, &in[srcY][x]);
3845 vSum = _mm256_adds_epi16
3846 (
3847 vSum,
3848 _mm256_mulhrs_epi16
3849 (
3850 filter,
3851 _mm256_adds_epi16(bottomNeighbor, topNeighbor)
3852 )
3853 );
3854 }
3855 }
3856 else
3857 {
3858 throw 0;
3859 }
3860 x += SIMD_WIDTH;
3861 } while (x < width);
3862 ScaleAndStoreInt16<true>(&out[y][x - SIMD_WIDTH], vSum, width - (x - SIMD_WIDTH));
3863 #else
3864 const ssize_t SIMD_WIDTH = 8;
3865 __m128i vSum;
3866 goto middle;
3867 do
3868 {
3869 ScaleAndStoreInt16(&out[y][x - SIMD_WIDTH], vSum); // store data from previous iteration
3870 middle:
3871 if (symmetric)
3872 {
3873 __m128i center;
3874 vSum = _mm_mulhrs_epi16(Cast256To128(vFilter[0]), LoadAndScaleToInt16(center, &in[y][x]));
3875 for (ssize_t i = 1; i <= filterSize; ++i)
3876 {
3877 __m128i filter = Cast256To128(vFilter[i]);
3878 ssize_t srcY = y + i;
3879 if (onBorder)
3880 srcY = min(srcY, height - 1);
3881 __m128i topNeighbor;
3882 LoadAndScaleToInt16(topNeighbor, &in[srcY][x]);
3883
3884 srcY = y - i;
3885 if (onBorder)
3886 srcY = max(srcY, ssize_t(0));
3887 __m128i bottomNeighbor;
3888 LoadAndScaleToInt16(bottomNeighbor, &in[srcY][x]);
3889 vSum = _mm_adds_epi16
3890 (
3891 vSum,
3892 _mm_mulhrs_epi16
3893 (
3894 filter,
3895 _mm_adds_epi16(bottomNeighbor, topNeighbor)
3896 )
3897 );
3898 }
3899 }
3900 else
3901 {
3902 throw 0;
3903 }
3904 x += SIMD_WIDTH;
3905 } while (x < width);
3906 ScaleAndStoreInt16<true>(&out[y][x - SIMD_WIDTH], vSum, width - (x - SIMD_WIDTH));
3907 #endif
3908 ++y;
3909 } while (y < yEnd);
3910}
3911#endif
3912
3913// in-place (out = in) operation not allowed
3914template <typename OutType, typename InType>
3915void ConvolveVerticalFIR(SimpleImage<OutType> out,
3916 SimpleImage<InType> in,
3917 ssize_t width, ssize_t height,
3918 float sigmaY)
3919{
3920#ifdef DO_FIR_IN_FLOAT
3921 typedef MyTraits<float>::SIMDtype SIMDtype;
3922#else
3923 typedef MyTraits<int16_t>::SIMDtype SIMDtype;
3924#endif
3925 int halfFilterSize = _effect_area_scr(sigmaY);
3926
3927 float *filter = (float *)alloca((halfFilterSize + 1) * sizeof(float));
3928
3929 _make_kernel(filter, sigmaY);
3930
3931 SIMDtype *vFilter = (SIMDtype *)ALIGNED_ALLOCA((halfFilterSize + 1) * sizeof(SIMDtype), sizeof(SIMDtype));
3932
3933 for (ssize_t i = 0; i <= halfFilterSize; ++i)
3934 {
3935 #ifdef DO_FIR_IN_FLOAT
3936 BroadcastSIMD(vFilter[i], filter[i]);
3937 #else
3938 BroadcastSIMD(vFilter[i], clip_round_cast<int16_t>(filter[i] * 32768));
3939 #endif
3940 }
3941
3942 const ssize_t IDEAL_Y_BLOCK_SIZE = 2; // currently, no advantage to making > 1
3943
3944 #pragma omp parallel
3945 {
3946 #pragma omp for
3947 for (ssize_t y = 0; y < height; y += IDEAL_Y_BLOCK_SIZE)
3948 {
3949 ssize_t yBlockSize = min(height - y, IDEAL_Y_BLOCK_SIZE);
3950 bool onBorder = y < halfFilterSize || y + IDEAL_Y_BLOCK_SIZE + halfFilterSize > height;
3951 if (onBorder)
3952 {
3953 ConvolveVerticalFIR<true, true>(out, in,
3954 width, height,
3955 y, y + yBlockSize,
3956 vFilter, halfFilterSize);
3957
3958 }
3959 else
3960 {
3961 ConvolveVerticalFIR<true, false>(out, in,
3962 width, height,
3963 y, y + yBlockSize,
3964 vFilter, halfFilterSize);
3965 }
3966 }
3967 } // omp parallel
3968}
3969
3970template <int channels, typename OutType, typename InType>
3971void ConvolveFIR(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY)
3972{
3973 using namespace std::chrono;
3974#ifdef DO_FIR_IN_FLOAT
3975 AlignedImage<float, sizeof(__m256)> horizontalFiltered;
3976#else
3977 AlignedImage<int16_t, sizeof(__m256)> horizontalFiltered;
3978#endif
3979 horizontalFiltered.Resize(width * channels, height);
3980
3981 const bool DO_TIMING = false;
3982
3983 high_resolution_clock::time_point t0;
3984 if (DO_TIMING)
3985 t0 = high_resolution_clock::now();
3986
3987 ConvolveHorizontalFIR<channels>(horizontalFiltered, in, width, height, sigmaX);
3988
3989 if (DO_TIMING)
3990 {
3991 auto t1 = high_resolution_clock::now();
3992 cout << "T_horiz=" << duration_cast<milliseconds>(t1 - t0).count() << " ms" << endl;
3993 t0 = t1;
3994 }
3995
3996 // todo: use sliding window to reduce cache pollution
3997 float scale = 1.0f;
3998#ifndef DO_FIR_IN_FLOAT
3999 scale = 1.0f / 64;
4000#endif
4001
4002 //SaveImage("horizontal_filtered.png", horizontalFiltered, width, height, channels, scale);
4003 ConvolveVerticalFIR(out, horizontalFiltered, width * channels, height, sigmaY);
4004 if (DO_TIMING)
4005 cout << "T_v=" << duration_cast<milliseconds>(high_resolution_clock::now() - t0).count() << " ms" << endl;
4006}
04007
=== modified file 'src/display/nr-filter-gaussian.cpp'
--- src/display/nr-filter-gaussian.cpp 2014-03-27 01:33:44 +0000
+++ src/display/nr-filter-gaussian.cpp 2016-12-22 06:18:34 +0000
@@ -12,13 +12,13 @@
12 */12 */
1313
14#include "config.h" // Needed for HAVE_OPENMP14#include "config.h" // Needed for HAVE_OPENMP
15
16#include <algorithm>15#include <algorithm>
17#include <cmath>16#include <cmath>
18#include <complex>17#include <complex>
19#include <cstdlib>18#include <cstdlib>
20#include <glib.h>19#include <glib.h>
21#include <limits>20#include <limits>
21#include <typeinfo>
22#if HAVE_OPENMP22#if HAVE_OPENMP
23#include <omp.h>23#include <omp.h>
24#endif //HAVE_OPENMP24#endif //HAVE_OPENMP
@@ -32,11 +32,18 @@
32#include <2geom/affine.h>32#include <2geom/affine.h>
33#include "util/fixed_point.h"33#include "util/fixed_point.h"
34#include "preferences.h"34#include "preferences.h"
35#include <fstream>
36#include <iomanip>
37#include <cpuid.h>
38#include <chrono>
39#include "SimpleImage.h"
3540
36#ifndef INK_UNUSED41#ifndef INK_UNUSED
37#define INK_UNUSED(x) ((void)(x))42#define INK_UNUSED(x) ((void)(x))
38#endif43#endif
3944
45using namespace std;
46
40// IIR filtering method based on:47// IIR filtering method based on:
41// L.J. van Vliet, I.T. Young, and P.W. Verbeek, Recursive Gaussian Derivative Filters,48// L.J. van Vliet, I.T. Young, and P.W. Verbeek, Recursive Gaussian Derivative Filters,
42// in: A.K. Jain, S. Venkatesh, B.C. Lovell (eds.),49// in: A.K. Jain, S. Venkatesh, B.C. Lovell (eds.),
@@ -54,10 +61,12 @@
54// filters are used).61// filters are used).
55static size_t const N = 3;62static size_t const N = 3;
5663
64#if __cplusplus < 201103
57template<typename InIt, typename OutIt, typename Size>65template<typename InIt, typename OutIt, typename Size>
58inline void copy_n(InIt beg_in, Size N, OutIt beg_out) {66inline void copy_n(InIt beg_in, Size N, OutIt beg_out) {
59 std::copy(beg_in, beg_in+N, beg_out);67 std::copy(beg_in, beg_in+N, beg_out);
60}68}
69#endif
6170
62// Type used for IIR filter coefficients (can be 10.21 signed fixed point, see Anisotropic Gaussian Filtering Using Fixed Point Arithmetic, Christoph H. Lampert & Oliver Wirjadi, 2006)71// Type used for IIR filter coefficients (can be 10.21 signed fixed point, see Anisotropic Gaussian Filtering Using Fixed Point Arithmetic, Christoph H. Lampert & Oliver Wirjadi, 2006)
63typedef double IIRValue;72typedef double IIRValue;
@@ -123,6 +132,11 @@
123132
124FilterGaussian::FilterGaussian()133FilterGaussian::FilterGaussian()
125{134{
135 extern void (*GaussianBlurIIR_Y8)(SimpleImage<uint8_t>, SimpleImage<uint8_t>, ssize_t, ssize_t, float, float);
136 void InitializeSIMDFunctions();
137 if (GaussianBlurIIR_Y8 == NULL)
138 InitializeSIMDFunctions();
139
126 _deviation_x = _deviation_y = 0.0;140 _deviation_x = _deviation_y = 0.0;
127}141}
128142
@@ -142,8 +156,8 @@
142 return (int)std::ceil(std::fabs(deviation) * 3.0);156 return (int)std::ceil(std::fabs(deviation) * 3.0);
143}157}
144158
145static void159template <typename FIRValue>
146_make_kernel(FIRValue *const kernel, double const deviation)160static void _make_kernel(FIRValue *const kernel, double const deviation)
147{161{
148 int const scr_len = _effect_area_scr(deviation);162 int const scr_len = _effect_area_scr(deviation);
149 g_assert(scr_len >= 0);163 g_assert(scr_len >= 0);
@@ -546,6 +560,559 @@
546 };560 };
547}561}
548562
563#ifdef _MSC_VER
564 #define FORCE_INLINE __forceinline
565 #define ALIGN(x) __declspec(aligned(x))
566#else
567 #define FORCE_INLINE inline __attribute__((always_inline))
568 #define ALIGN(x) __attribute__((alignment(x)))
569#endif
570
571void (*GaussianBlurIIR_Y8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY);
572void (*GaussianBlurIIR_R8G8B8A8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY);
573
574void (*GaussianBlurFIR_Y8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY);
575void (*GaussianBlurFIR_R8G8B8A8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY);
576
577void(*GaussianBlurHorizontalIIR_Y8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, bool canOverwriteInput);
578void(*GaussianBlurHorizontalIIR_R8G8B8A8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, bool canOverwriteInput);
579void(*GaussianBlurVerticalIIR)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaY); // works for grayscale & RGBA
580
581template <typename AnyType>
582void SaveImage(const char *path, SimpleImage<AnyType> in, int width, int height, int channels, float scale = 1.0f)
583{
584 cairo_surface_t *scaled = cairo_image_surface_create(channels == 1 ? CAIRO_FORMAT_A8 : CAIRO_FORMAT_ARGB32, width, height);
585
586 uint8_t *_scaled = cairo_image_surface_get_data(scaled);
587 ssize_t scaledPitch = cairo_image_surface_get_stride(scaled);
588 for (int y = 0; y < height; ++y)
589 {
590 for (int x = 0; x < width * channels; ++x)
591 {
592 _scaled[y * scaledPitch + x] = min(in[y][x] * scale, 255.0f);
593 }
594 }
595 cairo_surface_mark_dirty(scaled);
596 if (cairo_surface_write_to_png(scaled, path) != CAIRO_STATUS_SUCCESS)
597 throw 0;
598}
599
600#if defined(__x86_64__) || defined(__i386__) || defined (_M_X64) || defined(_M_IX86) // if x86 processor
601
602#include <immintrin.h>
603
604#ifndef __GNUC__
605 #define __SSE__
606 #define __SSE2__
607#endif
608
609const float MAX_SIZE_FOR_SINGLE_PRECISION = 30.0f; // switch to double if sigma > threshold or else round off error becomes so big that the output barely changes and you see annoying mach bands
610
611const size_t GUARANTEED_ALIGNMENT = 16;
612
613#define ALIGNED_ALLOCA(size, alignment) RoundUp((size_t)alloca(size + (((alignment) / GUARANTEED_ALIGNMENT) - 1) * GUARANTEED_ALIGNMENT), alignment)
614
615const ALIGN(32) uint8_t PARTIAL_VECTOR_MASK[64] =
616{
617 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
618 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
619};
620
621#ifdef _MSC_VER
622#define __FMA__
623#define __AVX2__
624#define __AVX__
625#define __SSE4_1__
626#define __SSSE3__
627#include "gaussian_blur_templates.h"
628
629#else
630
631#pragma GCC push_options
632
633namespace AVX2
634{
635#pragma GCC target("fma,avx2")
636#define __FMA__
637#define __AVX2__
638#define __AVX__
639#define __SSE4_1__
640#define __SSSE3__
641#include "gaussian_blur_templates.h"
642}
643
644namespace AVX
645{
646#pragma GCC target("avx")
647#undef __AVX2__
648#undef __FMA__
649
650#include "gaussian_blur_templates.h"
651}
652
653namespace SSE2
654{
655//#ifdef __x86_64__
656// #pragma GCC target("default") // base x86_64 ISA already has SSE2
657//#else
658 #pragma GCC target("sse2")
659//#endif
660#undef __AVX__
661#undef __SSE4_1__
662#undef __SSSE3__
663#include "gaussian_blur_templates.h"
664}
665
666#pragma GCC pop_options
667
668#endif
669
670void InitializeSIMDFunctions()
671{
672#ifdef __GNUC__
673 enum SIMDarch { SSE2, AVX, AVX2 };
674 const char *SIMDarchNames[] = { "SSE2", "AVX", "AVX2" };
675 SIMDarch arch;
676 if (getenv("FORCE_SIMD") != NULL)
677 {
678 arch = (SIMDarch)atoi(getenv("FORCE_SIMD"));
679 }
680 else
681 {
682 unsigned cpuInfo[4];
683 // version in cpuid.h is buggy - forgets to clear ecx
684 #define __cpuid(level, a, b, c, d) \
685 __asm__("xor %%ecx, %%ecx\n" \
686 "cpuid\n" \
687 : "=a"(a), "=b"(b), "=c"(c), "=d"(d) \
688 : "0"(level))
689 int maxLevel = __get_cpuid_max(0, 0);
690
691 cpuInfo[1] = 0;
692 if (maxLevel >= 7)
693 __cpuid(7, cpuInfo[0], cpuInfo[1], cpuInfo[2], cpuInfo[3]);
694
695 if (cpuInfo[1] & bit_AVX2)
696 {
697 arch = AVX2;
698 }
699 else
700 {
701 __cpuid(1, cpuInfo[0], cpuInfo[1], cpuInfo[2], cpuInfo[3]);
702 if (cpuInfo[2] & bit_AVX)
703 arch = AVX;
704 else if (cpuInfo[3] & bit_SSE2)
705 arch = SSE2;
706 }
707 }
708 cout << "using " << SIMDarchNames[arch] << " functions" << endl;
709 switch (arch)
710 {
711 case SSE2:
712 GaussianBlurIIR_Y8 = SSE2::Convolve<1>;
713 GaussianBlurIIR_R8G8B8A8 = SSE2::Convolve<4>;
714 GaussianBlurFIR_Y8 = SSE2::ConvolveFIR<1>;
715 GaussianBlurFIR_R8G8B8A8 = SSE2::ConvolveFIR<4>;
716 GaussianBlurHorizontalIIR_Y8 = SSE2::ConvolveHorizontal<false, 1>;
717 GaussianBlurHorizontalIIR_R8G8B8A8 = SSE2::ConvolveHorizontal<false, 4>;
718 GaussianBlurVerticalIIR = SSE2::ConvolveVertical;
719 break;
720 case AVX:
721 GaussianBlurIIR_Y8 = AVX::Convolve<1>;
722 GaussianBlurIIR_R8G8B8A8 = AVX::Convolve<4>;
723 GaussianBlurFIR_Y8 = AVX::ConvolveFIR<1>;
724 GaussianBlurFIR_R8G8B8A8 = AVX::ConvolveFIR<4>;
725 GaussianBlurHorizontalIIR_Y8 = AVX::ConvolveHorizontal<false, 1>;
726 GaussianBlurHorizontalIIR_R8G8B8A8 = AVX::ConvolveHorizontal<false, 4>;
727 GaussianBlurVerticalIIR = AVX::ConvolveVertical;
728 break;
729 case AVX2:
730 GaussianBlurIIR_Y8 = AVX2::Convolve<1>;
731 GaussianBlurIIR_R8G8B8A8 = AVX2::Convolve<4>;
732 GaussianBlurFIR_Y8 = AVX2::ConvolveFIR<1>;
733 GaussianBlurFIR_R8G8B8A8 = AVX2::ConvolveFIR<4>;
734 GaussianBlurHorizontalIIR_Y8 = AVX2::ConvolveHorizontal<false, 1>;
735 GaussianBlurHorizontalIIR_R8G8B8A8 = AVX2::ConvolveHorizontal<false, 4>;
736 GaussianBlurVerticalIIR = AVX2::ConvolveVertical;
737 break;
738 }
739#else
740 GaussianBlurIIR_Y8 = Convolve<1>;
741 GaussianBlurIIR_R8G8B8A8 = Convolve<4>;
742 GaussianBlurFIR_Y8 = ConvolveFIR<1>;
743 GaussianBlurFIR_R8G8B8A8 = ConvolveFIR<4>;
744 GaussianBlurHorizontalIIR_Y8 = ConvolveHorizontal<false, 1>;
745 GaussianBlurHorizontalIIR_R8G8B8A8 = ConvolveHorizontal<false, 4>;
746 GaussianBlurVerticalIIR = ConvolveVertical;
747#endif
748}
749#else
750void InitializeSIMDFunctions()
751{
752}
753#endif
754
755
756#ifdef UNIT_TEST
757
758template <typename AnyType>
759void CompareImages(SimpleImage<AnyType> ref, SimpleImage<AnyType> actual, int w, int h)
760{
761 double avgDiff = 0,
762 maxDiff = 0,
763 maxErrorFrac = 0;
764 for (int y = 0; y < h; ++y)
765 {
766 for (int x = 0; x < w; ++x)
767 {
768 double diff = abs((double)actual[y][x] - (double)ref[y][x]);
769 maxDiff = max(diff, maxDiff);
770 avgDiff += diff;
771 if (ref[y][x] != 0)
772 {
773 double errorFrac = abs(diff / ref[y][x]);
774 maxErrorFrac = max(errorFrac, maxErrorFrac);
775 }
776 }
777 }
778 avgDiff /= (w * h);
779 cout << "avgDiff=" << setprecision(4) << setw(9) << avgDiff << " maxDiff=" << setw(4) << maxDiff << " maxErrorFrac=" << setprecision(4) << setw(8) << maxErrorFrac << endl;
780}
781
782void RefFilterIIR(cairo_surface_t *out, cairo_surface_t *in,
783 float deviation_x, float deviation_y)
784{
785 int threads = omp_get_max_threads();
786 const int MAX_THREADS = 16;
787
788 int h = cairo_image_surface_get_height(in),
789 w = cairo_image_surface_get_width(in);
790
791 IIRValue * tmpdata[MAX_THREADS];
792 for (int i = 0; i < threads; ++i)
793 tmpdata[i] = new IIRValue[std::max(w, h)*4];
794
795 gaussian_pass_IIR(Geom::X, deviation_x, in, out, tmpdata, threads);
796 gaussian_pass_IIR(Geom::Y, deviation_y, out, out, tmpdata, threads);
797
798 for (int i = 0; i < threads; ++i)
799 delete[] tmpdata[i];
800}
801
802void RefFilterFIR(cairo_surface_t *out, cairo_surface_t *in,
803 float deviation_x, float deviation_y)
804{
805 int threads = omp_get_max_threads();
806 gaussian_pass_FIR(Geom::X, deviation_x, in, out, threads);
807 gaussian_pass_FIR(Geom::Y, deviation_y, out, out, threads);
808}
809
810
811cairo_surface_t *ConvertToGrayscale(cairo_surface_t *s)
812{
813 int w = cairo_image_surface_get_width(s),
814 h = cairo_image_surface_get_height(s);
815 cairo_surface_t *temp = cairo_image_surface_create(CAIRO_FORMAT_ARGB32, w, h),
816 *grayScale = cairo_image_surface_create(CAIRO_FORMAT_A8, w, h);
817
818 cairo_t *r = cairo_create(temp);
819 // set to 0
820 cairo_set_source_rgb(r, 0, 0, 0);
821 cairo_paint(r);
822
823 // convert to gray scale
824 cairo_set_operator(r, CAIRO_OPERATOR_HSL_LUMINOSITY);
825 cairo_set_source_surface(r, s, 0, 0);
826 cairo_paint(r);
827 cairo_destroy(r);
828
829 ssize_t inPitch = cairo_image_surface_get_stride(temp),
830 outPitch = cairo_image_surface_get_stride(grayScale);
831 uint8_t *in = cairo_image_surface_get_data(temp),
832 *out = cairo_image_surface_get_data(grayScale);
833 for (int y = 0; y < h; ++y)
834 {
835 for (int x = 0; x < w; ++x)
836 {
837 out[y * outPitch + x] = in[y * inPitch + x * 4];
838 }
839 }
840 cairo_surface_destroy(temp);
841 cairo_surface_mark_dirty(grayScale);
842 return grayScale;
843}
844
845void CopySurface(cairo_surface_t *out, cairo_surface_t *in)
846{
847 int pitch = cairo_image_surface_get_stride(in),
848 h = cairo_image_surface_get_height(in);
849
850 memcpy(cairo_image_surface_get_data(out), cairo_image_surface_get_data(in), pitch * h);
851 cairo_surface_mark_dirty(out);
852}
853
854extern "C" int main(int argc, char **argv)
855{
856 using namespace boost::chrono;
857 bool compareOrBenchmark = false;
858 const char *imagePath = "../drmixx/rasterized/99_showdown_carcass.png";
859 _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); // why does Intel need microcode to handle denormals? NVIDIA handles it in hardware fine
860
861 cairo_surface_t *in;
862 bool result;
863 for (int i = 1; i < argc; ++i)
864 {
865 if (strcmp(argv[i], "-b") == 0)
866 compareOrBenchmark = true;
867 else
868 imagePath = argv[i];
869 }
870 in = cairo_image_surface_create_from_png(imagePath);
871
872 // OMG, not aligning to 16 bytes can almost slow things down 2x!
873 //iluScale(RoundUp(ilGetInteger(IL_IMAGE_WIDTH), 4), RoundUp(ilGetInteger(IL_IMAGE_HEIGHT), 4), 0);
874
875 if (cairo_surface_status(in) != CAIRO_STATUS_SUCCESS)
876 {
877 cerr << "error loading" << endl;
878 return -1;
879 }
880
881 //if (!iluScale(38, 44, 1))
882 // cerr << "error scaling" << endl;
883 int originalHeight = cairo_image_surface_get_height(in),
884 originalWidth = cairo_image_surface_get_width(in);
885
886 cairo_surface_t *grayScaleIn = ConvertToGrayscale(in);
887 if (GaussianBlurIIR_Y8 == NULL)
888 InitializeSIMDFunctions();
889
890 auto IterateCombinations = [&](int adjustedWidth0, int adjustedWidth1, int adjustedHeight0, int adjustedHeight1, auto callback)
891 {
892 for (int adjustedHeight = adjustedHeight0; adjustedHeight <= adjustedHeight1; ++adjustedHeight)
893 {
894 for (int adjustedWidth = adjustedWidth0; adjustedWidth <= adjustedWidth1; ++adjustedWidth)
895 {
896 for (int channels = 1; channels <= 4; channels += 3)
897 {
898 cairo_surface_t *modifiedIn = cairo_surface_create_similar_image(in, channels == 1 ? CAIRO_FORMAT_A8 : CAIRO_FORMAT_ARGB32, adjustedWidth, adjustedHeight);
899 if (cairo_surface_status(modifiedIn) != CAIRO_STATUS_SUCCESS)
900 {
901 cerr << "error creating surface" << endl;
902 continue;
903 }
904 cairo_t *ct = cairo_create(modifiedIn);
905 cairo_scale(ct, double(adjustedWidth) / originalWidth, double(adjustedHeight) / originalHeight);
906 // scale/convert to given size/format
907 if (channels == 1)
908 {
909 cairo_set_source_rgb(ct, 1, 1, 1);
910 cairo_mask_surface(ct, grayScaleIn, 0, 0);
911 }
912 else
913 {
914 cairo_set_source_surface(ct, in, 0, 0);
915 cairo_paint(ct);
916 }
917 cairo_destroy(ct);
918
919 cairo_surface_t *out = cairo_surface_create_similar_image(modifiedIn, cairo_image_surface_get_format(modifiedIn), adjustedWidth, adjustedHeight);
920
921 for (int FIRorIIR = 0; FIRorIIR < 2; ++FIRorIIR)
922 {
923 for (int inPlace = 0; inPlace < 2; ++inPlace)
924 {
925 if (inPlace)
926 CopySurface(out, modifiedIn); // restore overwritten input image
927
928 callback(out, inPlace ? out : modifiedIn, modifiedIn, inPlace, FIRorIIR);
929 }
930 }
931 cairo_surface_destroy(modifiedIn);
932 cairo_surface_destroy(out);
933 }
934 }
935 }
936 };
937 if (!compareOrBenchmark)
938 {
939 auto CompareFunction = [&](cairo_surface_t *out, cairo_surface_t *in, cairo_surface_t *backupIn, bool inPlace, bool FIRorIIR)
940 {
941 // here, we assume input & output have the same format
942 int channels = cairo_image_surface_get_format(in) == CAIRO_FORMAT_ARGB32 ? 4 : 1;
943
944 SimpleImage <uint8_t> _in(cairo_image_surface_get_data(in), cairo_image_surface_get_stride(in)),
945 _out(cairo_image_surface_get_data(out), cairo_image_surface_get_stride(out));
946 int width = cairo_image_surface_get_width(in),
947 height = cairo_image_surface_get_height(in);
948
949 cairo_surface_t *refOut = cairo_surface_create_similar_image(in, channels == 1 ? CAIRO_FORMAT_A8 : CAIRO_FORMAT_ARGB32, width, height);
950
951 SimpleImage <uint8_t> _refOut(cairo_image_surface_get_data(refOut), cairo_image_surface_get_stride(refOut));
952 bool originalSize = width == originalWidth && height == originalHeight;
953
954 // test the correctness of different sigmas only for the original image size
955 // testing multiple sigmas for scaled image sizes is a waste
956 float DEFAULT_SIGMA = 5.0f;
957
958 float sigmaX0 = originalSize ? 0.5f : DEFAULT_SIGMA,
959 sigmaX1 = originalSize ? (FIRorIIR ? 64 : 4) : DEFAULT_SIGMA;
960 for (float sigmaX = sigmaX0; sigmaX <= sigmaX1; sigmaX *= 2)
961 {
962 float sigmaY = 1.2f * sigmaX;
963
964 cout << width << "x" << height
965 << setw(10) << (channels == 4 ? " RGBA" : " grayscale")
966 << (FIRorIIR ? " IIR" : " FIR")
967 << setw(15) << (inPlace ? " in-place" : " out-of-place")
968 << " sigmaX=" << setw(3) << sigmaX << " ";
969 double dt = 0;
970
971 if (inPlace)
972 {
973 CopySurface(out, backupIn);
974 }
975
976
977 if (FIRorIIR)
978 {
979 if (channels == 1)
980 GaussianBlurIIR_Y8(_out, _in, width, height, sigmaX, sigmaY);
981 else
982 GaussianBlurIIR_R8G8B8A8(_out, _in, width, height, sigmaX, sigmaY);
983 }
984 else
985 {
986 if (channels == 1)
987 GaussianBlurFIR_Y8(_out, _in, width, height, sigmaX, sigmaY);
988 else
989 GaussianBlurFIR_R8G8B8A8(_out, _in, width, height, sigmaX, sigmaY);
990 }
991
992 // ---------------------reference
993 cairo_surface_t *refIn;
994 if (inPlace)
995 {
996 refIn = refOut;
997 CopySurface(refIn, backupIn);
998 }
999 else
1000 {
1001 refIn = in;
1002 }
1003
1004 if (FIRorIIR)
1005 RefFilterIIR(refOut, refIn, sigmaX, sigmaY);
1006 else
1007 RefFilterFIR(refOut, refIn, sigmaX, sigmaY);
1008
1009 if (0)//FIRorIIR && width == 1466 && sigmaX == 0.5f)
1010 {
1011 cout << " dumping ";
1012 cairo_surface_write_to_png(refOut, "filtered_ref.png");
1013 cairo_surface_write_to_png(out, "filtered_opt.png");
1014 exit(1);
1015 }
1016
1017 CompareImages(_refOut, _out, width, height);
1018 }
1019 cairo_surface_destroy(refOut);
1020 };
1021
1022 const int SIMD_Y_BLOCK_SIZE = 2, // for checking SIMD remainder handling issues
1023 SIMD_X_BLOCK_SIZE = 4;
1024 int h0 = RoundDown(originalHeight, SIMD_Y_BLOCK_SIZE),
1025 w0 = RoundDown(originalWidth, SIMD_X_BLOCK_SIZE);
1026 IterateCombinations(w0, w0 + SIMD_X_BLOCK_SIZE, h0, h0 + SIMD_Y_BLOCK_SIZE, CompareFunction);
1027 //IterateCombinations(4, 8, 4, 8, CompareFunction);
1028
1029 }
1030 else
1031 {
1032 // benchmark
1033#if defined(_WIN32) && defined(_DEBUG)
1034 const int REPEATS = 1;
1035#else
1036 const int REPEATS = 50 * max(1.0f, sqrtf(omp_get_max_threads())); // assume speedup is sqrt(#cores)
1037#endif
1038 auto BenchmarkFunction = [&](cairo_surface_t *out, cairo_surface_t *in, cairo_surface_t *backupIn, bool inPlace, bool FIRorIIR)
1039 {
1040 // here, we assume input & output have the same format
1041 int channels = cairo_image_surface_get_format(in) == CAIRO_FORMAT_ARGB32 ? 4 : 1;
1042
1043 SimpleImage <uint8_t> _in(cairo_image_surface_get_data(in), cairo_image_surface_get_stride(in)),
1044 _out(cairo_image_surface_get_data(out), cairo_image_surface_get_stride(out));
1045 int width = cairo_image_surface_get_width(in),
1046 height = cairo_image_surface_get_height(in);
1047 const bool useRefCode = false;
1048
1049 float sigma0, sigma1;
1050 if (FIRorIIR)
1051 {
1052 // test single precision & double precision throughput
1053 sigma0 = MAX_SIZE_FOR_SINGLE_PRECISION * 0.75f;
1054 sigma1 = sigma0 * 2.0f;
1055 }
1056 else
1057 {
1058 sigma0 = 0.5f;
1059 sigma1 = 4;
1060 }
1061
1062 for (float sigma = sigma0; sigma <= sigma1; sigma *= 2)
1063 {
1064 cout << width << "x" << height
1065 << setw(10) << (channels == 4 ? " RGBA" : " grayscale")
1066 << (FIRorIIR ? " IIR" : " FIR")
1067 << setw(15) << (inPlace ? " in-place" : " out-of-place")
1068 << " sigma=" << setw(3) << sigma;
1069 high_resolution_clock::duration dt(0);
1070 for (int i = 0; i < REPEATS; ++i)
1071 {
1072 if (inPlace)
1073 CopySurface(out, backupIn); // copy backup to input/output
1074
1075 auto t0 = high_resolution_clock::now();
1076 if (useRefCode)
1077 {
1078 if (FIRorIIR)
1079 RefFilterIIR(out, in, sigma, sigma);
1080 else
1081 RefFilterFIR(out, in, sigma, sigma);
1082 }
1083 else
1084 {
1085 if (FIRorIIR)
1086 {
1087 if (channels == 1)
1088 GaussianBlurIIR_Y8(_out, _in, width, height, sigma, sigma);
1089 else
1090 GaussianBlurIIR_R8G8B8A8(_out, _in, width, height, sigma, sigma);
1091 }
1092 else
1093 {
1094 if (channels == 1)
1095 GaussianBlurFIR_Y8(_out, _in, width, height, sigma, sigma);
1096 else
1097 GaussianBlurFIR_R8G8B8A8(_out, _in, width, height, sigma, sigma);
1098 }
1099 }
1100 dt += high_resolution_clock::now() - t0;
1101 }
1102 cout << setw(9) << setprecision(3) << double(width * height * REPEATS) / duration_cast<microseconds>(dt).count() << " Mpix/s" << endl;
1103 }
1104 };
1105 int roundedWidth = RoundUp(originalWidth, 4),
1106 roundedHeight = RoundUp(originalHeight, 4);
1107 IterateCombinations(roundedWidth, roundedWidth, roundedHeight, roundedHeight, BenchmarkFunction);
1108 }
1109 cairo_surface_destroy(in);
1110 cairo_surface_destroy(grayScaleIn);
1111 return 0;
1112}
1113
1114#endif
1115
549void FilterGaussian::render_cairo(FilterSlot &slot)1116void FilterGaussian::render_cairo(FilterSlot &slot)
550{1117{
551 cairo_surface_t *in = slot.getcairo(_input);1118 cairo_surface_t *in = slot.getcairo(_input);
@@ -645,22 +1212,97 @@
645 }1212 }
646 cairo_surface_flush(downsampled);1213 cairo_surface_flush(downsampled);
6471214
1215 SimpleImage<uint8_t> im((uint8_t *)cairo_image_surface_get_data(downsampled), cairo_image_surface_get_stride(downsampled));
1216
1217 // 2D filter benefits
1218 // 1. intermediate image may have higher precision than uint8
1219 // 2. reduced cache pollution from useless flushing of intermediate image to memory
1220 if (scr_len_x > 0 && scr_len_y > 0 && use_IIR_x == use_IIR_y && GaussianBlurIIR_Y8 != NULL)
1221 {
1222 if (fmt == CAIRO_FORMAT_ARGB32) {
1223 if (use_IIR_x) {
1224 GaussianBlurIIR_R8G8B8A8(im, // out
1225 im, // in
1226 cairo_image_surface_get_width(downsampled),
1227 cairo_image_surface_get_height(downsampled),
1228 deviation_x, deviation_y);
1229 }
1230 else {
1231 GaussianBlurFIR_R8G8B8A8(im, // out
1232 im, // in
1233 cairo_image_surface_get_width(downsampled),
1234 cairo_image_surface_get_height(downsampled),
1235 deviation_x, deviation_y);
1236 }
1237 }
1238 else {
1239 if (use_IIR_x) {
1240 GaussianBlurIIR_Y8(im,
1241 im,
1242 cairo_image_surface_get_width(downsampled),
1243 cairo_image_surface_get_height(downsampled),
1244 deviation_x, deviation_y);
1245 }
1246 else {
1247 GaussianBlurFIR_Y8(im,
1248 im,
1249 cairo_image_surface_get_width(downsampled),
1250 cairo_image_surface_get_height(downsampled),
1251 deviation_x, deviation_y);
1252 }
1253 }
1254 }
1255 else
1256 {
648 if (scr_len_x > 0) {1257 if (scr_len_x > 0) {
649 if (use_IIR_x) {1258 if (use_IIR_x && GaussianBlurIIR_Y8 != NULL) {
650 gaussian_pass_IIR(Geom::X, deviation_x, downsampled, downsampled, tmpdata, threads);1259 if (fmt == CAIRO_FORMAT_ARGB32)
1260 {
1261 GaussianBlurHorizontalIIR_R8G8B8A8(im, // out
1262 im, // in
1263 cairo_image_surface_get_width(downsampled),
1264 cairo_image_surface_get_height(downsampled),
1265 deviation_x, false);
1266 }
1267 else {
1268 GaussianBlurHorizontalIIR_Y8(im, // out
1269 im, // in
1270 cairo_image_surface_get_width(downsampled),
1271 cairo_image_surface_get_height(downsampled),
1272 deviation_x, false);
1273 //gaussian_pass_IIR(Geom::X, deviation_x, downsampled, downsampled, tmpdata, threads);
1274 }
651 } else {1275 } else {
1276 // optimized 1D FIR filter can't work in-place
652 gaussian_pass_FIR(Geom::X, deviation_x, downsampled, downsampled, threads);1277 gaussian_pass_FIR(Geom::X, deviation_x, downsampled, downsampled, threads);
653 }1278 }
654 }1279 }
6551280
656 if (scr_len_y > 0) {1281 if (scr_len_y > 0) {
657 if (use_IIR_y) {1282 if (use_IIR_y && GaussianBlurIIR_Y8 != NULL) {
658 gaussian_pass_IIR(Geom::Y, deviation_y, downsampled, downsampled, tmpdata, threads);1283 if (fmt == CAIRO_FORMAT_ARGB32)
1284 {
1285 GaussianBlurVerticalIIR(im, // out
1286 im, // in
1287 cairo_image_surface_get_width(downsampled) * 4,
1288 cairo_image_surface_get_height(downsampled),
1289 deviation_y);
1290 }
1291 else
1292 {
1293 GaussianBlurVerticalIIR(im, // out
1294 im, // in
1295 cairo_image_surface_get_width(downsampled),
1296 cairo_image_surface_get_height(downsampled),
1297 deviation_y);
1298 //gaussian_pass_IIR(Geom::Y, deviation_y, downsampled, downsampled, tmpdata, threads);
1299 }
659 } else {1300 } else {
1301 // optimized 1D FIR filter can't work in-place
660 gaussian_pass_FIR(Geom::Y, deviation_y, downsampled, downsampled, threads);1302 gaussian_pass_FIR(Geom::Y, deviation_y, downsampled, downsampled, threads);
661 }1303 }
662 }1304 }
6631305 }
664 // free the temporary data1306 // free the temporary data
665 if ( use_IIR_x || use_IIR_y ) {1307 if ( use_IIR_x || use_IIR_y ) {
666 for(int i = 0; i < threads; ++i) {1308 for(int i = 0; i < threads; ++i) {