Merge lp:~simdgenius/inkscape/inkscape into lp:~inkscape.dev/inkscape/trunk
- inkscape
- Merge into trunk
Status: | Needs review |
---|---|
Proposed branch: | lp:~simdgenius/inkscape/inkscape |
Merge into: | lp:~inkscape.dev/inkscape/trunk |
Diff against target: |
4832 lines (+4736/-8) 3 files modified
src/display/SimpleImage.h (+80/-0) src/display/gaussian_blur_templates.h (+4006/-0) src/display/nr-filter-gaussian.cpp (+650/-8) |
To merge this branch: | bzr merge lp:~simdgenius/inkscape/inkscape |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Mc | Abstain | ||
Review via email: mp+307251@code.launchpad.net |
Commit message
Description of the change
This vectorized version of Gaussian blur has a speed up of about 5x for IIR filtering (more expected) and ~20x for FIR on a modern processor supporting AVX2.
Searching the mailing lists, users have been complaining about blurs being very since the beginning of Inkscape. I think Jasper's use of recursive filters (IIR) was a big help and so was OpenMP multithreading. But more is better, especially for artists who can't be interrupted by slow, refreshing screens.
The code is a monstrosity in terms of size, but it kind of has to be given all the different cases ({FIR, IIR} x {int16, float, double} x {grayscale, RGBA}. Plus, it needs to support SSE2 (any x86_64 has SSE2), AVX, and AVX2 through runtime dispatch. So, I fear about the maintainability.
I think I can rewrite maybe 1/2 of the code to use GCC vector extensions instead of direct intrinsics. There was a suggestion from Boudewijn Rempt to use the Vc library that Krita uses, but I don't see much benefit because you can't expect to efficiently abstract all the SIMD operations, especially the ones that operate horizontally across vector lanes (e.g. dot product, horizontal adds, permutes/shuffles).
The other concern is how SIMD remainders are handled. The code currently uses partial vector stores so that it never writes past the end of an image, so it should never crash. But it can do whole vector loads that go past the end all the time. So, this will render memory checking tools like AddressSanitizer and Valgrind useless from all the false positives. Ideally, I'd like all images rows to be padded to 16 or 32 bytes.
-------
*further speed up IIR - the current cost is ~44 cycles/pixel for RGBA. This is much slower than a lower bound of 7 cycles/pixel = (4 multiplies + 3 adds) x 2 passes / (2 pixels/iteration for AVX), assuming 1 instruction/cycle.
*rewrite the stand alone unit test/benchmark as a Google GTest.
*document the functions with an SVG animation
- 15139. By Martin Owens
-
Remove the reset on the glyph tangent, it breaks text on path (bug lp:1627523)
- 15140. By Tavmjong Bah
-
Update attributes list for rename of 'mesh' to 'meshgradient'.
- 15141. By Martin Owens
-
Add a prune method to saving svg files that removes Adobe's i:pgf tag.
- 15142. By Martin Owens
-
Adjust dock size to minimum width during canvas table size allocation signal.
- 15143. By Martin Owens
-
Merge in the new eraser mode (2) which uses clip instead of cut.
- 15144. By Tavmjong Bah
-
Prevent a crash if a mesh is defined in bounding box coordinates.
- 15145. By Jabiertxof
-
Fix a bug on eraser mode when previous clip are shapes not paths
- 15146. By Martin Owens
-
Merge in jabiertxof's hover information for measure tool
- 15147. By Mc
-
xverbs branch merge by dmitry-zhulanov
Add somes "xverbs" that are like verbs but can take arguments, and yaml file parsing for batch commandline operations. - 15148. By Martin Owens
-
Merge jabier's Measure line LPE
- 15149. By Jabiertxof
-
Fix bug:1630821 on LPE selected nodes
- 15150. By Jabiertxof
-
Fix bug:1630796 on flatten button
- 15151. By Jabiertxof
-
Fix bug:1605334 FeImage X and Y position
- 15152. By Jabiertxof
-
Fix bug:1630796 on flatten button. Attemp 2
- 15153. By Jabiertxof
-
Remove unnecesary header from last push
- 15154. By Jabiertxof
-
Fix bug:1622321 on powerstroke
- 15155. By Sandra Snan
-
[Bug #770681] KEY MAPPING: Comma and period hijacked by scaling.
- 15156. By Mc
-
merge trunk-refactoring
- 15157. By Tavmjong Bah
-
Provide simple "preview" for mesh gradients.
- 15158. By Liam P. White
-
Fix palette flickering, probably.
- 15159. By Mc
-
merge a fix and refactor
- 15160. By victor.westmann
-
[Packaging] NSIS translation update for pt_br
- 15161. By victor.westmann
-
[Bug #1627166] Brazilian Portuguese translation for 0.92.
- 15162. By Sveinn í Felli
-
[Bug #1426423] Updated Icelandic translation for 0.92.
- 15163. By Tavmjong Bah
-
Render mesh gradients that reference other mesh gradients.
- 15164. By marenhachmann
-
[Bug #1630635] Wrong tool tip for new text line height setting.
- 15165. By Tavmjong Bah
-
Better handling of mesh gradients in Paint Selector dialog.
- 15166. By Tavmjong Bah
-
Remove unused/undefined function.
- 15167. By Tavmjong Bah
-
Ensure newly created meshes have correct 'gradientUnits'.
- 15168. By Tavmjong Bah
-
Do not create unused "vector" gradient when creating mesh gradient.
- 15169. By Tavmjong Bah
-
Code cleanup: simplify initial mesh color calculation.
- 15170. By Jabiertxof
-
Fix bug:1633521 on powerstroke
- 15171. By Tavmjong Bah
-
Implement copying of objects with mesh gradients.
- 15172. By victor.westmann
-
[Bug #1627166] Brazilian Portuguese translation for 0.92.
- 15173. By Tavmjong Bah
-
Add option to scale mesh to fit in bounding box.
- 15174. By jazzynico
-
[Bug #1633999] xcf export fails if layer names contain non-ASCII characters.
- 15175. By Tavmjong Bah
-
Use geometric bounding box for fill, visual for stroke in creating mesh.
- 15176. By Mc
-
update author list in about dialog from AUTHORS file
- 15177. By Tavmjong Bah
-
Implement 'vector-effect' value 'non-scaling-
stroke' . No GUI yet. - 15178. By FirasH
-
[Bug #1574561] Italian translation update.
- 15179. By Jabiertxof
-
Fix bug:1634641 crash on delete
- 15180. By Mc
-
Fix CMake dependency order
- 15181. By Tavmjong Bah
-
Add 'vector-effect' to attributes test.
- 15182. By Kris
-
cosmetic change
- 15183. By Mc
-
Fix gradient comparison.
- 15184. By Mc
-
update translators
- 15185. By marenhachmann
-
[Bug #1635332] Update for German translation.
- 15186. By Jabiertxof <email address hidden>
-
Fix bug#1635442
- 15187. By Patrick Storz
-
CMake: inkscape.com needs the "-mconsole" linker flag to be useful
- 15188. By Jordi Mas
-
[Bug #1636086] Update Catalan translation for Inkscape 0.92.
- 15189. By Mc
-
CPPification: almost all sp_object_
set_whatever and sp_selection_ whatever global functions are now methods of ObjectSet*, with these additional benefits:
- They can now act on any SelectionSet, not just the current selection;
- Whenever possible, they don't need a desktop anymore and can run if called from GUI.I hope I did not break too many things in the process.
*: So instead of callink sp_selection_
move(desktop, x,y), you call myobjectset- >move(x, y) - 15190. By Mc
-
Fix test
- 15191. By Mc
-
Fix signals
- 15192. By Mc
-
Prevent image drag/drop from grouping
- 15193. By Mc
-
Fix regression in loop prevention
- 15194. By Jordi Mas
-
[Bug #1636086] Update Catalan translation for Inkscape 0.92.
- 15195. By Mc
-
allows for denser screens in zoom correction factors
- 15196. By FirasH
-
[Bug #1574561] Italian translation update.
- 15197. By Jabiertxof
-
Close the bounding box path LPE
- 15198. By Jabiertxof
-
Fix fill between many LPE to start up with current path
- 15199. By Jabiertxof
-
Update branding folder
- 15200. By Jabiertxof
-
Fix bug:1013141 crash deleting LPE
- 15201. By houz
-
fix none color in palettes with scrollbars
- 15202. By houz
-
fix prefs icon
- 15203. By Mc
-
Add some unit tests for object-set cppification
- 15204. By Mc
-
Revert two changes from r15177
- 15205. By Mc
-
Fix crash in some commandline usage
- 15206. By Jabiertxof
-
Execution of update_po_files.sh
- 15207. By Tavmjong Bah
-
Render meshes with old syntax using camelCase.
- 15208. By Jabiertxof
-
Reformat branding folder
- 15209. By Jabiertxof
-
Update branding folder, remove fonts
- 15210. By Jabiertxof
-
Fix bug:1639083 crach closing segment with shortcut LPE
- 15211. By Jabiertxof
-
Fix bug:1639098
- 15212. By Jabiertxof
-
Fix change between multiples LPE in the same item
- 15213. By Jabiertxof
-
Fix a bug on duplicate item with multiples LPE on it. previously the LPE become "clones" if more than 1 LPE on the item.
Also wee need to discuss what happends on LPE copied what are inside a group, fork them or clone, currently are cloned
This can be a feature or a bug in the same user with diferent works. My proposal is fork it and add a item in paste LPEs to allow cloned LPE on paste - 15214. By Jabiertxof
-
Fix last commit not working, LPE are cloned on copies
- 15215. By Jabiertxof
-
Move a header place
- 15216. By Jabiertxof
-
Fix bug on apply bend LPE from pen/cil without cliboard, nothin happends previously
- 15217. By Jabiertxof
-
Minor tweak
- 15218. By Mc
-
further cppification
- 15219. By Jabiertxof
-
Fix some bugs on pen/cil dropdown shapes
- 15220. By Mc
-
merge recursive unlink clones branch
- 15221. By mpasteven
-
fix cursor on big endian systems
- 15222. By Jabiertxof
-
1639832 Blend and blur unspected results
- 15223. By Mc
-
annotate custom builds, and add correct revno into make dist tarballs
- 15224. By Mattia Rizzolo
-
reproducible builds
- 15225. By jazzynico
-
[Bug #262341] Tooltips for LPE tool modes do not show up as translated.
- 15226. By suv-lp
-
[Bug #1638472] Quadrant points of ellipse/circle fail to snap (as source or target).
- 15227. By gordcaswell
-
[Bug #1639081] recently-used.xbel remaining when run portably.
- 15228. By FirasH
-
[Bug #1574561] Italian translation update.
- 15229. By Tavmjong Bah
-
Improve mesh handling in Fill and Stroke dialog.
Create new meshes with alternating color/white pattern
(makes it more obvious a mesh has been created). - 15230. By Tavmjong Bah
-
Enable swapping of fill and stroke when one is a mesh.
- 15231. By Tavmjong Bah
-
Click-drag selects nodes rather than creates new mesh if mesh already exists.
- 15232. By Mc
-
merge boolop branch: Move boolop functions from sp_selected_
path_<op> to ObjectSet::path<op> - 15233. By Mc
-
resizable undocked dialogs
- 15234. By Jordi Mas
-
[Bug #1636086] Update Catalan translation for Inkscape 0.92.
- 15235. By mathog
-
patch for bug 1405292, start clipping with COPY instead of OR so GDI clipping works
- 15236. By Mc
-
fix test
- 15237. By Mc
-
fix automatic dockbar resizing
- 15238. By Mc
-
Fix filter editor update
- 15239. By Mc
-
Add a make inkscape_pot to regen potfile
- 15240. By Mc
-
Fix selection toolbar icons missing on start
- 15241. By Mc
-
Fix rare crash on undo break apart
- 15242. By Mc
-
update potfile
- 15243. By Tavmjong Bah
-
Fit to bounding box: correct transform when mesh has a non-identity gradient transform.
- 15244. By Mc
-
fix build
- 15245. By Patrick Storz
-
Packaging: Merge all fixes from 0.92.x branch for NSIS and WiX installers (Windows .exe and .msi)
- 15246. By Patrick Storz
-
Tutorials: Rename image files to follow "name.lang_id.ext" scheme
(Allows NSIS installer to autodetect localized files per locale and adjust installation components accordingly) - 15247. By Yuri Chornoivan
-
[Bug #1407331] Ukrainian translation update for 0.92.
- 15248. By Sylvain Chiron
-
Translations. French translation update.
- 15249. By jazzynico
-
[Bug #1590529] Italian Updates for inkscape docs (0.92.x)
- 15250. By Tavmjong Bah
-
Implement tweaking of mesh handle colors.
- 15251. By Tavmjong Bah
-
Split selected rows/columns in half using Insert key.
- 15252. By Jordi Mas
-
[Bug #1636086] Update Catalan translation for Inkscape 0.92.
- 15253. By Tavmjong Bah
-
Ensure getVector() and getArray() return a valid gradient pointer.
- 15254. By Tavmjong Bah
-
Do not return invalid vector gradient when switching from mesh to linear/radial gradient.
- 15255. By Tavmjong Bah
-
Fix path outline function for meshes with nrow != ncolumn.
- 15256. By Tavmjong Bah
-
Fix status bar messages for meshes and gradients.
- 15257. By Tavmjong Bah
-
Remove debug line from last commit.
- 15258. By Tavmjong Bah
-
Another fix for the status bar with mesh gradients.
- 15259. By Jabiertxof
-
Fix #1627817. Bug in knot LPE
- 15260. By Tavmjong Bah
-
Improve mouse handling for mesh:
* Double clicking an object will create a new mesh if one does not exist,
otherwise clicking a line should now reliably divide the row/column.
* Click-dragging will create a new mesh if one does not exist,
otherwise it will do a rubberband selection of corner nodes.
With Shift will add nodes, without will replace selected nodes. - 15261. By su_v
-
Add Shift-I shortcut for insert node.
- 15262. By Tavmjong Bah
-
Preserve selection of corner nodes for some corner operations.
- 15263. By Jabiertxof
-
Fix #1643408. Bug in pap LPE
- 15264. By Tavmjong Bah
-
Keep corner nodes selected when possible for corner operations.
- 15265. By gordcaswell
-
[Bug #1643730] Inkscape Portable language selection not maintained.
- 15266. By helix84
-
Fix a typo in inkscape-
preferences. cpp. - 15267. By Tavmjong Bah
-
Select mesh nodes by clicking on control lines.
- 15268. By helix84
-
* [INTL:zh_TW] Traditional Chinese translation update
- 15269. By Lucas Vieites
-
[Bug #1643818] Updated es.po for 0.92.
- 15270. By FirasH
-
[Bug #1574561] Italian translation update.
- 15271. By Tavmjong Bah
-
Remove deprecated GtkWidget:
wide-seperators which is ignored as of 3.20. - 15272. By Tavmjong Bah
-
Remove deprecated GtkWidget-
separator- height, ignored as of 3.20. - 15273. By Tavmjong Bah
-
Provide a way to update a legacy document to account for the 90 to 96 dpi change.
This method relies on setting the 'viewBox'. - 15274. By Jabiertxof <jtx@jtx>
-
Add stroke dash empty to allow render only fills and markers. Tested in FF and Chromium
- 15275. By scootergrisen
-
[Bug #1644934] Translation to danish.
- 15276. By Patrick Storz
-
Translations/
Packaging: Convert Danish translation to UTF8 - 15277. By jazzynico
-
[Bug #1644886] Color profiles not loaded on Windows (partial fix).
- 15278. By Patrick Storz
-
CMake: Add ${INKSCAPE_
SHARE_INSTALL}
This is set to "share/inkscape" by default, on Windows we need to be able to install directly into "share" however - 15279. By Patrick Storz
-
CMake: Explicitly call python
At least on Windows this breaks if Python is not associated with .py files (and even if it is an arbitrary Python version that might be installed on the system is used) - 15280. By Patrick Storz
-
Remove unneeded "#include <arpa/inet.h>" in "cairo-utils.cpp"
- 15281. By jazzynico
-
[Bug #1641111] extension Visualize Path/Measure path... fails
- 15282. By scootergrisen
-
[Bug #1471443] Updated danish translation for 0.92.
- 15283. By Mc
-
update filter list when pasting and on import
- 15284. By Jabiertxof
-
Fixes transforms bug in meassure line LPE pointed in IRC by CR and suv
- 15285. By Jabiertxof
-
Reorganize SVG Structure have clean meassure line structure
- 15286. By Tavmjong Bah
-
Give mesh corner nodes a different color from handle nodes (following node tool coloring).
- 15287. By FirasH
-
[Bug #1574561] Italian translation update.
- 15288. By Tavmjong Bah
-
Fix bug with mesh handle update when corner moved via keys.
- 15289. By Tavmjong Bah
-
Add toggles for handle visibility, editing fill, and editing stroke.
- 15290. By Tavmjong Bah
-
Ensure new mesh is immediately editable.
- 15291. By Alexander Brock <email address hidden>
-
Improve precision of offset_cubic
- 15292. By Mc
-
prevent use of string concat for compatibility with old cmake
- 15293. By Jabiertxof
-
Add triangle knot.
- 15294. By Jabiertxof <jtx@jtx>
-
Improvements and fixes for buds pointed by suv on measure line LPE
- 15295. By Jabiertxof <jtx@jtx>
-
Fix a typo
- 15296. By Jabiertxof
-
Enable node resizing in mesh tool.
- 15297. By Jabiertxof <jtx@jtx>
-
Fix names in measure line LPE
- 15298. By Jabiertxof
-
Highlight mesh handles when corner or handle selected.
Highlight mesh control lines when corner/handle hovered over. - 15299. By FirasH
-
[Bug #1574561] Italian translation update.
- 15300. By Tavmjong Bah
-
Fix memory leak (incomplete clear).
- 15301. By Jabiertxof
-
Add dpiswitcher extension and option to scale legacy documents with it.
- 15302. By Jabiertxof <jtx@jtx>
-
Fixes for measure LPE and speed path based LPE operations
- 15303. By Jabiertxof <jtx@jtx>
-
Remove obsolete comment
- 15304. By Jabiertxof <jtx@jtx>
-
Fix measure LPE to fit future extra objects based LPE
- 15305. By Jabiertxof
-
fix bug #1644621 on show handles
- 15306. By Jabiertxof
-
fix bug #1644621 on show handles. Fix start knot on closed paths
- 15307. By Tavmjong Bah
-
Add option to save a backup when updating file for dpi change.
- 15308. By Jabiertxof
-
Fix for reopened bug #1643408
- 15309. By Jabiertxof
-
Fix for reopened bug #1643408- minor typyng
- 15310. By Jabiertxof <jtx@jtx>
-
Improve measure line to allow similar LPE
- 15311. By Tavmjong Bah
-
Correct error messages.
- 15312. By Tavmjong Bah
-
Improve working of Type (Smoothing) menu.
- 15313. By Jabiertxof <jtx@jtx>
-
'upport' changes to LPE's rotate copies and mirrot symmetry
- 15314. By Tavmjong Bah
-
Don't add fortify source flag for debug builds. Avoids tons of warnings.
- 15315. By Tavmjong Bah
-
Add button to access outer text style ('font-size', 'line-height'). These determine the minimum line spacing.
- 15316. By Tavmjong Bah
-
Correct outer text style input for non-px based files.
- 15317. By Tavmjong Bah
-
Fix a bug where initially text has no fill and but has a stroke.
- 15318. By Jabiertxof <jtx@jtx>
-
Fix headers on LPE's
- 15319. By Tavmjong Bah
-
Fix line-height when converting between different units for flowed text.
- 15320. By Jabiertxof <jtx@jtx>
-
Apply suv patch to handle containers https:/
/bugs.launchpad .net/inkscape/ +bug/1389723/ comments/ 95 - 15321. By Tavmjong Bah
-
Add option to unset 'line-height' (as well as determine where it is set).
- 15322. By Tavmjong Bah
-
Add missing 'pt' unit to test of legacy files.
- 15323. By Jabiertxof
-
Apply su_v patch to DPISwitcher: https:/
/launchpadlibra rian.net/ 297886893/ 0000-fix- dpiswitcher- scaling- v1.diff - 15324. By Tavmjong Bah
-
Add test internal scaling to account for DPI change.
- 15325. By Tavmjong Bah
-
Fixes for internal document scaling and add a second test option.
- 15326. By Tavmjong Bah
-
Fix crash from last commit due to bad preference path.
- 15327. By Tavmjong Bah
-
Save state of backup button.
- 15328. By Tavmjong Bah
-
Prevent crash when iterator invalidated after converting shape to path.
- 15329. By Tavmjong Bah
-
Fix bug where conical gradient drawn in wrong arc.
- 15330. By Jabiertxof <jtx@jtx>
-
Fix a bug on transforms in mirror symmetry
- 15331. By Jabiertxof <jtx@jtx>
-
Use Geom::Reflection instead custom method an copy rotate and mirror LPE
- 15332. By Jabiertxof <jtx@jtx>
-
Some coding style fixes
- 15333. By Jabiertxof
-
Remove 'desktop' usage on measure line LPE
- 15334. By Jabiertxof
-
Add translatable strings to trunk
- 15335. By Jabiertxof
-
Add update_helperpaths not member of nodetool class to easy call from outside
- 15336. By Jabiertxof
-
Remove unneeded static var from previous commit
- 15337. By Jabiertxof <jtx@jtx>
-
Remove some ocurrences of desktop in knot functions
- 15338. By Jabiertxof <jtx@jtx>
-
Fix strings in mirror symetry and copy rotate LPE
- 15339. By Jabiertxof <jtx@jtx>
-
Remove string from tip
- 15340. By Jabiertxof
-
Fix undo incosistences in mirror LPE and in copy rotate LPE
- 15341. By Jabiertxof
-
Regenerate PO files for new translations
- 15342. By Yuri Chornoivan
-
Translation update for Ukranian
- 15343. By Yuri Chornoivan
-
[Bug #1407331] Ukrainian translation update for 0.92.
- 15344. By Jabiertxof <jtx@jtx>
-
Remove more SPDesktop from LPE's
- 15345. By Jabiertxof <jtx@jtx>
-
Add string translatable pointed by Maren
- 15346. By Jabiertxof <jtx@jtx>
-
Update po and pot
- 15347. By Jabiertxof
-
Update pot files. for extrange thing intltool generate output as untitled.pot instesad inkscape.pot not sure how to fix
- 15348. By Jabiertxof
-
Update .pot file generated with cmake and the resulting po. add info to update_po_files.sh added by Mc
- 15349. By Yale Zhang
-
q
Jabiertxof (jabiertxof) wrote : | # |
Apply the diff clean. Give compiliong errors: https:/
Yale Zhang (simdgenius) wrote : | # |
Jabier, thanks for reviewing. The compile error is because
_mm_cvtsi64_m64() isn't supported for i686. I can fix it and replace
as much of the SSE intrinsic functions code with GCC vector extensions
over the next few days.
For now, can you test on x86-64?
-yale
On Tue, Feb 21, 2017 at 10:30 AM, Jabiertxof <email address hidden> wrote:
> Apply the diff clean. Give compiliong errors: https:/
> --
> https:/
> You are the owner of lp:~simdgenius/inkscape/inkscape.
Jabiertxof (jabiertxof) wrote : | # |
Hi Yale, sorry for the delay. I could compile this weekend at home in a 64Bit debian.
In terms of usage I do a fast try because need to revert the changes to fix a bug, In this little experience blurs go very fast, some portions not render sometimes but not your problem Is usual in trunk. I confirm that in hi zoom levels still very slow.
Thanks for your great work, ping me if want any help or specific test.
Yale Zhang (simdgenius) wrote : | # |
Appreciate your testing. Anything for me to do before it can be accepted?
On my side, I still need to do 2 things:
1. rewrite as much of the SSE intrinsic functions with GCC vector
extensions as possible
2. change the way vector remainders are handled - currently it loads
past the end of arrays, but never writes past the end. But even this
can crash, which I've seen a few times.
"some portions not render sometimes but not your problem Is usual in trunk"
Yes, I've noticed it too. It happens when the filter quality isn't set
to maximum.
On Mon, Feb 27, 2017 at 1:47 AM, Jabiertxof <email address hidden> wrote:
> Hi Yale, sorry for the delay. I could compile this weekend at home in a 64Bit debian.
> In terms of usage I do a fast try because need to revert the changes to fix a bug, In this little experience blurs go very fast, some portions not render sometimes but not your problem Is usual in trunk. I confirm that in hi zoom levels still very slow.
>
> Thanks for your great work, ping me if want any help or specific test.
> --
> https:/
> You are the owner of lp:~simdgenius/inkscape/inkscape.
Jabiertxof (jabiertxof) wrote : | # |
Hi Yale.
Also you need to handle 32bit platforms and re-read the Tav message to the list about more comments in your code, he is a too much mode advanced dev than me.
Also "some portions not render sometimes but not your problem Is usual in trunk" is only my opinion maybe Im not right and could be part of your code, please be sure. My intention with the comment is reflect is not the only time withe spaces in render happends, but your comment at high level disapear give me to a recheck it well.
Certainy Im not the best man for review your code and if I approve i tell in the darkness, what is not good. In my opinion you need to reply Tav message and try approve with him.
Cheers and thanks for your hard work, Jabier.
Jabiertxof (jabiertxof) wrote : | # |
Also you can drop a line to check if 0.93 has support for 32Bit OS. I think XP is droped but there is others
Mc (mc...) wrote : | # |
I had a look at the code and asked a friend more familiar than me at those low-level calls, and it looks like it's doing sane things.
-> If it can compile on {linux,
Actually, my main fear about this merge is that if you ever leave and something ever goes wrong in that code parts, there may be no one knowing enough about how this works to actually fix stuff...
Marking as "Abstain", but consider it as an "Approve" iff the "If it can compile on all platforms/
Yale Zhang (simdgenius) wrote : | # |
If it can compile on {linux,
{gcc,clang} (and maybe icc, visualstudio).
How serious are we about supporting Visual Studio? If it's to be
supported (possible now since we're using CMake), then I can't rewrite
the intrinsics code with GCC vector extensions since that's only
supported on GCC and clang. In theory you could make a C++ class to
abstract the vectors like the VC library that Krita uses, but I think
that's not worth the effort and you can't expect to abstract all
vector operations, especially the ones that operate horizontally
across vector lanes (e.g. shuffles).
"there may be no one knowing enough about how this works to actually
fix stuff..."
Exactly. That's why I want to rewrite as much of the _mm_ intrinsics
with GCC vector extensions.
On Mon, Feb 27, 2017 at 4:11 PM, Mc <email address hidden> wrote:
> Review: Abstain
>
> I had a look at the code and asked a friend more familiar than me at those low-level calls, and it looks like it's doing sane things.
>
> -> If it can compile on {linux,
>
> Actually, my main fear about this merge is that if you ever leave and something ever goes wrong in that code parts, there may be no one knowing enough about how this works to actually fix stuff...
>
> Marking as "Abstain", but consider it as an "Approve" iff the "If it can compile on all platforms/
> --
> https:/
> You are the owner of lp:~simdgenius/inkscape/inkscape.
Unmerged revisions
- 15349. By Yale Zhang
-
q
Preview Diff
1 | === added file 'src/display/SimpleImage.h' | |||
2 | --- src/display/SimpleImage.h 1970-01-01 00:00:00 +0000 | |||
3 | +++ src/display/SimpleImage.h 2016-12-22 06:18:34 +0000 | |||
4 | @@ -0,0 +1,80 @@ | |||
5 | 1 | #ifndef SIMPLE_IMAGE_H | ||
6 | 2 | #define SIMPLE_IMAGE_H | ||
7 | 3 | |||
8 | 4 | #if _MSC_VER | ||
9 | 5 | #ifdef _M_IX86 | ||
10 | 6 | typedef int ssize_t; | ||
11 | 7 | #else | ||
12 | 8 | typedef __int64 ssize_t; | ||
13 | 9 | #endif | ||
14 | 10 | #else | ||
15 | 11 | #include <stddef.h> | ||
16 | 12 | #endif | ||
17 | 13 | |||
18 | 14 | // a minimal image representation that allows 2D indexing with [][] and that's completely reusable | ||
19 | 15 | template <typename AnyType> | ||
20 | 16 | class SimpleImage | ||
21 | 17 | { | ||
22 | 18 | public: | ||
23 | 19 | SimpleImage() | ||
24 | 20 | { | ||
25 | 21 | } | ||
26 | 22 | SimpleImage(AnyType *b, ssize_t p) | ||
27 | 23 | { | ||
28 | 24 | buffer = b; | ||
29 | 25 | pitch = p; | ||
30 | 26 | } | ||
31 | 27 | AnyType *operator[](ssize_t y) | ||
32 | 28 | { | ||
33 | 29 | return (AnyType *)((uint8_t *)buffer + y * pitch); | ||
34 | 30 | } | ||
35 | 31 | SimpleImage<AnyType> SubImage(ssize_t x, ssize_t y) | ||
36 | 32 | { | ||
37 | 33 | return SimpleImage<AnyType>(&(*this)[y][x], pitch); | ||
38 | 34 | } | ||
39 | 35 | AnyType *buffer; | ||
40 | 36 | ssize_t pitch; | ||
41 | 37 | }; | ||
42 | 38 | |||
43 | 39 | template <typename IntType> | ||
44 | 40 | IntType RoundDown(IntType a, IntType b) | ||
45 | 41 | { | ||
46 | 42 | return (a / b) * b; | ||
47 | 43 | } | ||
48 | 44 | |||
49 | 45 | template <typename IntType> | ||
50 | 46 | IntType RoundUp(IntType a, IntType b) | ||
51 | 47 | { | ||
52 | 48 | return RoundDown(a + b - 1, b); | ||
53 | 49 | } | ||
54 | 50 | |||
55 | 51 | #ifdef _WIN32 | ||
56 | 52 | #define aligned_alloc(a, s) _aligned_malloc(s, a) | ||
57 | 53 | #define aligned_free(x) _aligned_free(x) | ||
58 | 54 | #else | ||
59 | 55 | #define aligned_free(x) free(x) | ||
60 | 56 | #endif | ||
61 | 57 | |||
62 | 58 | template <typename AnyType, ssize_t alignment> | ||
63 | 59 | class AlignedImage : public SimpleImage<AnyType> | ||
64 | 60 | { | ||
65 | 61 | public: | ||
66 | 62 | AlignedImage() | ||
67 | 63 | { | ||
68 | 64 | this->buffer = NULL; | ||
69 | 65 | } | ||
70 | 66 | void Resize(int width, int height) | ||
71 | 67 | { | ||
72 | 68 | if (this->buffer != NULL) | ||
73 | 69 | aligned_free(this->buffer); | ||
74 | 70 | this->pitch = RoundUp(ssize_t(width * sizeof(AnyType)), alignment); | ||
75 | 71 | this->buffer = (AnyType *)aligned_alloc(alignment, this->pitch * height); | ||
76 | 72 | } | ||
77 | 73 | ~AlignedImage() | ||
78 | 74 | { | ||
79 | 75 | if (this->buffer != NULL) | ||
80 | 76 | aligned_free(this->buffer); | ||
81 | 77 | } | ||
82 | 78 | }; | ||
83 | 79 | |||
84 | 80 | #endif | ||
85 | 0 | 81 | ||
86 | === added file 'src/display/gaussian_blur_templates.h' | |||
87 | --- src/display/gaussian_blur_templates.h 1970-01-01 00:00:00 +0000 | |||
88 | +++ src/display/gaussian_blur_templates.h 2016-12-22 06:18:34 +0000 | |||
89 | @@ -0,0 +1,4006 @@ | |||
90 | 1 | #ifdef __SSE3__ | ||
91 | 2 | // not any faster, at least on Haswell? | ||
92 | 3 | #define _mm_loadu_pd(p) _mm_castsi128_pd(_mm_lddqu_si128((__m128i *)(p))) | ||
93 | 4 | #define _mm_loadu_ps(p) _mm_castsi128_ps(_mm_lddqu_si128((__m128i *)(p))) | ||
94 | 5 | #define _mm_loadu_si128(p) _mm_lddqu_si128(p) | ||
95 | 6 | |||
96 | 7 | #define _mm256_loadu_pd(p) _mm256_castsi256_pd(_mm256_lddqu_si256((__m256i *)(p))) | ||
97 | 8 | #define _mm256_loadu_ps(p) _mm256_castsi256_ps(_mm256_lddqu_si256((__m256i *)(p))) | ||
98 | 9 | #define _mm256_loadu_si128(p) _mm256_lddqu_si256(p) | ||
99 | 10 | #else | ||
100 | 11 | #undef _mm_loadu_pd | ||
101 | 12 | #undef _mm_loadu_ps | ||
102 | 13 | #undef _mm_loadu_si128 | ||
103 | 14 | #undef _mm256_loadu_pd | ||
104 | 15 | #undef _mm256_loadu_ps | ||
105 | 16 | #undef _mm256_loadu_si128 | ||
106 | 17 | #endif | ||
107 | 18 | |||
108 | 19 | template <typename AnyType> | ||
109 | 20 | struct MyTraits | ||
110 | 21 | { | ||
111 | 22 | }; | ||
112 | 23 | |||
113 | 24 | template <> | ||
114 | 25 | struct MyTraits<float> | ||
115 | 26 | { | ||
116 | 27 | #ifdef __AVX__ | ||
117 | 28 | typedef __m256 SIMDtype; | ||
118 | 29 | #else | ||
119 | 30 | typedef __m128 SIMDtype; | ||
120 | 31 | #endif | ||
121 | 32 | }; | ||
122 | 33 | |||
123 | 34 | template <> | ||
124 | 35 | struct MyTraits<int16_t> | ||
125 | 36 | { | ||
126 | 37 | #ifdef __AVX2__ | ||
127 | 38 | typedef __m256i SIMDtype; | ||
128 | 39 | #else | ||
129 | 40 | typedef __m128i SIMDtype; | ||
130 | 41 | #endif | ||
131 | 42 | }; | ||
132 | 43 | |||
133 | 44 | template <> | ||
134 | 45 | struct MyTraits<double> | ||
135 | 46 | { | ||
136 | 47 | #ifdef __AVX__ | ||
137 | 48 | typedef __m256d SIMDtype; | ||
138 | 49 | #else | ||
139 | 50 | typedef __m128d SIMDtype; | ||
140 | 51 | #endif | ||
141 | 52 | }; | ||
142 | 53 | |||
143 | 54 | #if defined(__AVX__) && defined(__GNUC__) | ||
144 | 55 | FORCE_INLINE __m256 _mm256_setr_m128(__m128 lo, __m128 hi) | ||
145 | 56 | { | ||
146 | 57 | return _mm256_insertf128_ps(_mm256_castps128_ps256(lo), hi, 1); | ||
147 | 58 | } | ||
148 | 59 | |||
149 | 60 | FORCE_INLINE __m256i _mm256_setr_m128i(__m128i lo, __m128i hi) | ||
150 | 61 | { | ||
151 | 62 | return _mm256_insertf128_si256(_mm256_castsi128_si256(lo), hi, 1); | ||
152 | 63 | } | ||
153 | 64 | |||
154 | 65 | FORCE_INLINE __m256d _mm256_setr_m128d(__m128d lo, __m128d hi) | ||
155 | 66 | { | ||
156 | 67 | return _mm256_insertf128_pd(_mm256_castpd128_pd256(lo), hi, 1); | ||
157 | 68 | } | ||
158 | 69 | #endif | ||
159 | 70 | |||
160 | 71 | #ifdef __FMA__ | ||
161 | 72 | //#pragma GCC push_options | ||
162 | 73 | //#pragma GCC target("fma") | ||
163 | 74 | FORCE_INLINE __m128 MultiplyAdd(__m128 a, __m128 b, __m128 c) | ||
164 | 75 | { | ||
165 | 76 | return _mm_fmadd_ps(a, b, c); | ||
166 | 77 | } | ||
167 | 78 | |||
168 | 79 | FORCE_INLINE __m256 MultiplyAdd(__m256 a, __m256 b, __m256 c) | ||
169 | 80 | { | ||
170 | 81 | return _mm256_fmadd_ps(a, b, c); | ||
171 | 82 | } | ||
172 | 83 | |||
173 | 84 | FORCE_INLINE __m256d MultiplyAdd(__m256d a, __m256d b, __m256d c) | ||
174 | 85 | { | ||
175 | 86 | return _mm256_fmadd_pd(a, b, c); | ||
176 | 87 | } | ||
177 | 88 | //#pragma GCC pop_options | ||
178 | 89 | #endif | ||
179 | 90 | |||
180 | 91 | #ifndef __GNUC__ | ||
181 | 92 | FORCE_INLINE __m128d operator + (__m128d a, __m128d b) | ||
182 | 93 | { | ||
183 | 94 | return _mm_add_pd(a, b); | ||
184 | 95 | } | ||
185 | 96 | |||
186 | 97 | FORCE_INLINE __m128d operator - (__m128d a, __m128d b) | ||
187 | 98 | { | ||
188 | 99 | return _mm_sub_pd(a, b); | ||
189 | 100 | } | ||
190 | 101 | |||
191 | 102 | FORCE_INLINE __m128d operator * (__m128d a, __m128d b) | ||
192 | 103 | { | ||
193 | 104 | return _mm_mul_pd(a, b); | ||
194 | 105 | } | ||
195 | 106 | |||
196 | 107 | FORCE_INLINE __m256d operator + (__m256d a, __m256d b) | ||
197 | 108 | { | ||
198 | 109 | return _mm256_add_pd(a, b); | ||
199 | 110 | } | ||
200 | 111 | |||
201 | 112 | FORCE_INLINE __m256d operator - (__m256d a, __m256d b) | ||
202 | 113 | { | ||
203 | 114 | return _mm256_sub_pd(a, b); | ||
204 | 115 | } | ||
205 | 116 | |||
206 | 117 | FORCE_INLINE __m256d operator * (__m256d a, __m256d b) | ||
207 | 118 | { | ||
208 | 119 | return _mm256_mul_pd(a, b); | ||
209 | 120 | } | ||
210 | 121 | |||
211 | 122 | FORCE_INLINE __m128 operator + (__m128 a, __m128 b) | ||
212 | 123 | { | ||
213 | 124 | return _mm_add_ps(a, b); | ||
214 | 125 | } | ||
215 | 126 | |||
216 | 127 | FORCE_INLINE __m128 operator - (__m128 a, __m128 b) | ||
217 | 128 | { | ||
218 | 129 | return _mm_sub_ps(a, b); | ||
219 | 130 | } | ||
220 | 131 | |||
221 | 132 | FORCE_INLINE __m128 operator * (__m128 a, __m128 b) | ||
222 | 133 | { | ||
223 | 134 | return _mm_mul_ps(a, b); | ||
224 | 135 | } | ||
225 | 136 | |||
226 | 137 | FORCE_INLINE __m256 operator + (__m256 a, __m256 b) | ||
227 | 138 | { | ||
228 | 139 | return _mm256_add_ps(a, b); | ||
229 | 140 | } | ||
230 | 141 | |||
231 | 142 | FORCE_INLINE __m256 operator - (__m256 a, __m256 b) | ||
232 | 143 | { | ||
233 | 144 | return _mm256_sub_ps(a, b); | ||
234 | 145 | } | ||
235 | 146 | |||
236 | 147 | FORCE_INLINE __m256 operator * (__m256 a, __m256 b) | ||
237 | 148 | { | ||
238 | 149 | return _mm256_mul_ps(a, b); | ||
239 | 150 | } | ||
240 | 151 | #endif | ||
241 | 152 | |||
242 | 153 | #ifdef __AVX__ | ||
243 | 154 | FORCE_INLINE float ExtractElement0(__m256 x) | ||
244 | 155 | { | ||
245 | 156 | return _mm_cvtss_f32(_mm256_castps256_ps128(x)); | ||
246 | 157 | } | ||
247 | 158 | |||
248 | 159 | FORCE_INLINE double ExtractElement0(__m256d x) | ||
249 | 160 | { | ||
250 | 161 | return _mm_cvtsd_f64(_mm256_castpd256_pd128(x)); | ||
251 | 162 | } | ||
252 | 163 | #endif | ||
253 | 164 | |||
254 | 165 | FORCE_INLINE float ExtractElement0(__m128 x) | ||
255 | 166 | { | ||
256 | 167 | return _mm_cvtss_f32(x); | ||
257 | 168 | } | ||
258 | 169 | |||
259 | 170 | FORCE_INLINE double ExtractElement0(__m128d x) | ||
260 | 171 | { | ||
261 | 172 | return _mm_cvtsd_f64(x); | ||
262 | 173 | } | ||
263 | 174 | |||
264 | 175 | template<int SIZE> | ||
265 | 176 | static void calcTriggsSdikaInitialization(double const M[N*N], float uold[N][SIZE], float const uplus[SIZE], float const vplus[SIZE], float const alpha, float vold[N][SIZE]) | ||
266 | 177 | { | ||
267 | 178 | __m128 v4f_alpha = _mm_set1_ps(alpha); | ||
268 | 179 | ssize_t c; | ||
269 | 180 | for (c = 0; c + 4 <= SIZE; c += 4) | ||
270 | 181 | { | ||
271 | 182 | __m128 uminp[N]; | ||
272 | 183 | for(ssize_t i=0; i<N; i++) | ||
273 | 184 | uminp[i] = _mm_loadu_ps(&uold[i][c]) - _mm_loadu_ps(&uplus[c]); | ||
274 | 185 | |||
275 | 186 | __m128 v4f_vplus = _mm_loadu_ps(&vplus[c]); | ||
276 | 187 | |||
277 | 188 | for(ssize_t i=0; i<N; i++) | ||
278 | 189 | { | ||
279 | 190 | __m128 voldf = _mm_setzero_ps(); | ||
280 | 191 | for(ssize_t j=0; j<N; j++) | ||
281 | 192 | { | ||
282 | 193 | voldf = voldf + uminp[j] * _mm_set1_ps(M[i*N+j]); | ||
283 | 194 | } | ||
284 | 195 | // Properly takes care of the scaling coefficient alpha and vplus (which is already appropriately scaled) | ||
285 | 196 | // This was arrived at by starting from a version of the blur filter that ignored the scaling coefficient | ||
286 | 197 | // (and scaled the final output by alpha^2) and then gradually reintroducing the scaling coefficient. | ||
287 | 198 | _mm_storeu_ps(&vold[i][c], voldf * v4f_alpha + v4f_vplus); | ||
288 | 199 | } | ||
289 | 200 | } | ||
290 | 201 | while (c < SIZE) | ||
291 | 202 | { | ||
292 | 203 | double uminp[N]; | ||
293 | 204 | for(ssize_t i=0; i<N; i++) uminp[i] = uold[i][c] - uplus[c]; | ||
294 | 205 | for(ssize_t i=0; i<N; i++) { | ||
295 | 206 | double voldf = 0; | ||
296 | 207 | for(ssize_t j=0; j<N; j++) { | ||
297 | 208 | voldf += uminp[j]*M[i*N+j]; | ||
298 | 209 | } | ||
299 | 210 | // Properly takes care of the scaling coefficient alpha and vplus (which is already appropriately scaled) | ||
300 | 211 | // This was arrived at by starting from a version of the blur filter that ignored the scaling coefficient | ||
301 | 212 | // (and scaled the final output by alpha^2) and then gradually reintroducing the scaling coefficient. | ||
302 | 213 | vold[i][c] = voldf*alpha; | ||
303 | 214 | vold[i][c] += vplus[c]; | ||
304 | 215 | } | ||
305 | 216 | ++c; | ||
306 | 217 | } | ||
307 | 218 | } | ||
308 | 219 | |||
309 | 220 | template<int SIZE> | ||
310 | 221 | static void calcTriggsSdikaInitialization(double const M[N*N], double uold[N][SIZE], double const uplus[SIZE], double const vplus[SIZE], double const alpha, double vold[N][SIZE]) | ||
311 | 222 | { | ||
312 | 223 | __m128d v2f_alpha = _mm_set1_pd(alpha); | ||
313 | 224 | ssize_t c; | ||
314 | 225 | for (c = 0; c <= SIZE - 2; c += 2) | ||
315 | 226 | { | ||
316 | 227 | __m128d uminp[N]; | ||
317 | 228 | for(ssize_t i=0; i<N; i++) | ||
318 | 229 | uminp[i] = _mm_loadu_pd(&uold[i][c]) - _mm_loadu_pd(&uplus[c]); | ||
319 | 230 | |||
320 | 231 | __m128d v2f_vplus = _mm_loadu_pd(&vplus[c]); | ||
321 | 232 | |||
322 | 233 | for(ssize_t i=0; i<N; i++) | ||
323 | 234 | { | ||
324 | 235 | __m128d voldf = _mm_setzero_pd(); | ||
325 | 236 | for(ssize_t j=0; j<N; j++) | ||
326 | 237 | { | ||
327 | 238 | voldf = voldf + uminp[j] * _mm_load1_pd(&M[i*N+j]); | ||
328 | 239 | } | ||
329 | 240 | // Properly takes care of the scaling coefficient alpha and vplus (which is already appropriately scaled) | ||
330 | 241 | // This was arrived at by starting from a version of the blur filter that ignored the scaling coefficient | ||
331 | 242 | // (and scaled the final output by alpha^2) and then gradually reintroducing the scaling coefficient. | ||
332 | 243 | _mm_storeu_pd(&vold[i][c], voldf * v2f_alpha + v2f_vplus); | ||
333 | 244 | } | ||
334 | 245 | } | ||
335 | 246 | while (c < SIZE) | ||
336 | 247 | { | ||
337 | 248 | double uminp[N]; | ||
338 | 249 | for(ssize_t i=0; i<N; i++) uminp[i] = uold[i][c] - uplus[c]; | ||
339 | 250 | for(ssize_t i=0; i<N; i++) { | ||
340 | 251 | double voldf = 0; | ||
341 | 252 | for(ssize_t j=0; j<N; j++) { | ||
342 | 253 | voldf += uminp[j]*M[i*N+j]; | ||
343 | 254 | } | ||
344 | 255 | // Properly takes care of the scaling coefficient alpha and vplus (which is already appropriately scaled) | ||
345 | 256 | // This was arrived at by starting from a version of the blur filter that ignored the scaling coefficient | ||
346 | 257 | // (and scaled the final output by alpha^2) and then gradually reintroducing the scaling coefficient. | ||
347 | 258 | vold[i][c] = voldf*alpha; | ||
348 | 259 | vold[i][c] += vplus[c]; | ||
349 | 260 | } | ||
350 | 261 | ++c; | ||
351 | 262 | } | ||
352 | 263 | } | ||
353 | 264 | |||
354 | 265 | FORCE_INLINE __m128i PartialVectorMask(ssize_t n) | ||
355 | 266 | { | ||
356 | 267 | return _mm_loadu_si128((__m128i *)&PARTIAL_VECTOR_MASK[sizeof(PARTIAL_VECTOR_MASK) / 2 - n]); | ||
357 | 268 | } | ||
358 | 269 | |||
359 | 270 | #ifdef __AVX__ | ||
360 | 271 | FORCE_INLINE __m256i PartialVectorMask32(ssize_t n) | ||
361 | 272 | { | ||
362 | 273 | return _mm256_loadu_si256((__m256i *)&PARTIAL_VECTOR_MASK[sizeof(PARTIAL_VECTOR_MASK) / 2 - n]); | ||
363 | 274 | } | ||
364 | 275 | #endif | ||
365 | 276 | |||
366 | 277 | #if !defined(_WIN32) && !defined(_MSC_VER) | ||
367 | 278 | // using _mm_maskmove_si64() is preferable to _mm_maskmoveu_si128(), but for some reason on Windows, it causes memory corruption | ||
368 | 279 | // could it be due to mixing x87 and MMX? | ||
369 | 280 | #define CAN_USE_MMX | ||
370 | 281 | #endif | ||
371 | 282 | |||
372 | 283 | #ifdef CAN_USE_MMX | ||
373 | 284 | // return __m64 so that it can be used by _mm_movemask_si64() | ||
374 | 285 | FORCE_INLINE __m64 PartialVectorMask8(ssize_t n) | ||
375 | 286 | { | ||
376 | 287 | return _mm_cvtsi64_m64(*(int64_t *)&PARTIAL_VECTOR_MASK[sizeof(PARTIAL_VECTOR_MASK) / 2 - n]); | ||
377 | 288 | } | ||
378 | 289 | #else | ||
379 | 290 | FORCE_INLINE __m128i PartialVectorMask8(ssize_t n) | ||
380 | 291 | { | ||
381 | 292 | return _mm_loadl_epi64((__m128i *)&PARTIAL_VECTOR_MASK[sizeof(PARTIAL_VECTOR_MASK) / 2 - n]); | ||
382 | 293 | } | ||
383 | 294 | #endif | ||
384 | 295 | |||
385 | 296 | #ifdef __AVX__ | ||
386 | 297 | FORCE_INLINE __m256d LoadDoubles(__m256d &out, double *x) | ||
387 | 298 | { | ||
388 | 299 | return out = _mm256_loadu_pd(x); | ||
389 | 300 | } | ||
390 | 301 | |||
391 | 302 | FORCE_INLINE __m256d LoadDoubles(__m256d &out, float *x) | ||
392 | 303 | { | ||
393 | 304 | return out = _mm256_cvtps_pd(_mm_loadu_ps(x)); | ||
394 | 305 | } | ||
395 | 306 | |||
396 | 307 | FORCE_INLINE __m256d LoadDoubles(__m256d &out, uint8_t *x) | ||
397 | 308 | { | ||
398 | 309 | return out = _mm256_cvtepi32_pd(_mm_cvtepu8_epi32(_mm_cvtsi32_si128(*(int32_t *)x))); | ||
399 | 310 | } | ||
400 | 311 | |||
401 | 312 | FORCE_INLINE __m256d LoadDoubles(__m256d &out, uint16_t *x) | ||
402 | 313 | { | ||
403 | 314 | return out = _mm256_cvtepi32_pd(_mm_cvtepu16_epi32(_mm_loadl_epi64((__m128i *)x))); | ||
404 | 315 | } | ||
405 | 316 | |||
406 | 317 | FORCE_INLINE __m256 LoadFloats(__m256 &out, float *x) // seriously? compiler needs to be told to inline this when PIC on? | ||
407 | 318 | { | ||
408 | 319 | return out = _mm256_loadu_ps(x); | ||
409 | 320 | } | ||
410 | 321 | |||
411 | 322 | FORCE_INLINE __m256 LoadFloats(__m256 &out, uint8_t *x) | ||
412 | 323 | { | ||
413 | 324 | __m128i temp = _mm_loadl_epi64((__m128i *)x); | ||
414 | 325 | #ifdef __AVX2__ | ||
415 | 326 | out = _mm256_cvtepi32_ps(_mm256_cvtepu8_epi32(temp)); | ||
416 | 327 | #else | ||
417 | 328 | out = _mm256_cvtepi32_ps(_mm256_setr_m128i(_mm_cvtepu8_epi32(temp), _mm_cvtepu8_epi32(_mm_shuffle_epi32(temp, _MM_SHUFFLE(0, 0, 0, 1))))); | ||
418 | 329 | #endif | ||
419 | 330 | return out; | ||
420 | 331 | } | ||
421 | 332 | |||
422 | 333 | FORCE_INLINE __m256 LoadFloats(__m256 &out, uint16_t *x) | ||
423 | 334 | { | ||
424 | 335 | __m128i temp = _mm_loadu_si128((__m128i *)x); | ||
425 | 336 | __m256i i32; | ||
426 | 337 | #ifdef __AVX2__ | ||
427 | 338 | i32 = _mm256_cvtepu16_epi32(temp); | ||
428 | 339 | #else | ||
429 | 340 | __m128i zero = _mm_setzero_si128(); | ||
430 | 341 | i32 = _mm256_setr_m128i(_mm_unpacklo_epi16(temp, zero), _mm_unpackhi_epi16(temp, zero)); | ||
431 | 342 | #endif | ||
432 | 343 | return out = _mm256_cvtepi32_ps(i32); | ||
433 | 344 | } | ||
434 | 345 | |||
435 | 346 | template <bool partial = false> // no, this parameter isn't redundant - without it, there will be a redundant n == 4 check when partial = 0 | ||
436 | 347 | FORCE_INLINE void StoreDoubles(double *out, __m256d x, ssize_t n = 4) | ||
437 | 348 | { | ||
438 | 349 | if (partial) | ||
439 | 350 | _mm256_maskstore_pd(out, PartialVectorMask32(n * sizeof(double)), x); | ||
440 | 351 | else | ||
441 | 352 | _mm256_storeu_pd(out, x); | ||
442 | 353 | } | ||
443 | 354 | |||
444 | 355 | template <bool partial = false> | ||
445 | 356 | FORCE_INLINE void StoreDoubles(float *out, __m256d x, ssize_t n = 4) | ||
446 | 357 | { | ||
447 | 358 | __m128 f32 = _mm256_cvtpd_ps(x); | ||
448 | 359 | if (partial) | ||
449 | 360 | _mm_maskstore_ps(out, PartialVectorMask(n * sizeof(float)), f32); | ||
450 | 361 | else | ||
451 | 362 | _mm_storeu_ps(out, f32); | ||
452 | 363 | } | ||
453 | 364 | |||
454 | 365 | template <bool partial = false> | ||
455 | 366 | FORCE_INLINE void StoreDoubles(uint16_t *out, __m256d x, ssize_t n = 4) | ||
456 | 367 | { | ||
457 | 368 | __m128i i32 = _mm256_cvtpd_epi32(x), | ||
458 | 369 | u16 = _mm_packus_epi32(i32, i32); | ||
459 | 370 | if (partial) | ||
460 | 371 | { | ||
461 | 372 | #ifdef CAN_USE_MMX | ||
462 | 373 | _mm_maskmove_si64(_mm_movepi64_pi64(u16), PartialVectorMask8(n * sizeof(int16_t)), (char *)out); | ||
463 | 374 | #else | ||
464 | 375 | _mm_maskmoveu_si128(u16, PartialVectorMask8(n * sizeof(int16_t)), (char *)out); | ||
465 | 376 | #endif | ||
466 | 377 | } | ||
467 | 378 | else | ||
468 | 379 | _mm_storel_epi64((__m128i *)out, u16); | ||
469 | 380 | } | ||
470 | 381 | |||
471 | 382 | template <bool partial = false> | ||
472 | 383 | FORCE_INLINE void StoreDoubles(uint8_t *out, __m256d x, ssize_t n = 4) | ||
473 | 384 | { | ||
474 | 385 | __m128i i32 = _mm256_cvtpd_epi32(x), | ||
475 | 386 | u16 = _mm_packus_epi32(i32, i32), | ||
476 | 387 | u8 = _mm_packus_epi16(u16, u16); | ||
477 | 388 | if (partial) | ||
478 | 389 | { | ||
479 | 390 | #ifdef CAN_USE_MMX | ||
480 | 391 | _mm_maskmove_si64(_mm_movepi64_pi64(u8), PartialVectorMask8(n), (char *)out); | ||
481 | 392 | #else | ||
482 | 393 | _mm_maskmoveu_si128(u8, PartialVectorMask8(n), (char *)out); | ||
483 | 394 | #endif | ||
484 | 395 | } | ||
485 | 396 | else | ||
486 | 397 | *(int32_t *)out = _mm_cvtsi128_si32(u8); | ||
487 | 398 | } | ||
488 | 399 | |||
489 | 400 | FORCE_INLINE void StoreDoubles(uint8_t *out, __m256d x) | ||
490 | 401 | { | ||
491 | 402 | __m128i vInt = _mm_cvtps_epi32(_mm256_cvtpd_ps(x)); | ||
492 | 403 | *(int32_t *)out = _mm_cvtsi128_si32(_mm_packus_epi16(_mm_packus_epi32(vInt, vInt), vInt)); | ||
493 | 404 | } | ||
494 | 405 | |||
495 | 406 | template <bool partial = false> // no, this parameter isn't redundant - without it, there will be a redundant n == 8 check when partial = 0 | ||
496 | 407 | FORCE_INLINE void StoreFloats(float *out, __m256 x, ssize_t n = 8) | ||
497 | 408 | { | ||
498 | 409 | if (partial) | ||
499 | 410 | _mm256_maskstore_ps(out, PartialVectorMask32(n * sizeof(float)), x); | ||
500 | 411 | else | ||
501 | 412 | _mm256_storeu_ps(out, x); | ||
502 | 413 | } | ||
503 | 414 | |||
504 | 415 | template <bool partial = false> | ||
505 | 416 | FORCE_INLINE void StoreFloats(uint16_t *out, __m256 x, ssize_t n = 8) | ||
506 | 417 | { | ||
507 | 418 | __m256i i32 = _mm256_cvtps_epi32(x); | ||
508 | 419 | __m128i u16 = _mm_packus_epi32(_mm256_castsi256_si128(i32), _mm256_extractf128_si256(i32, 1)); | ||
509 | 420 | if (partial) | ||
510 | 421 | _mm_maskmoveu_si128(u16, PartialVectorMask(n * sizeof(int16_t)), (char *)out); | ||
511 | 422 | else | ||
512 | 423 | _mm_storeu_si128((__m128i *)out, u16); | ||
513 | 424 | } | ||
514 | 425 | |||
515 | 426 | template <bool partial = false> | ||
516 | 427 | FORCE_INLINE void StoreFloats(uint8_t *out, __m256 x, ssize_t n = 8) | ||
517 | 428 | { | ||
518 | 429 | __m256i i32 = _mm256_cvtps_epi32(x); | ||
519 | 430 | __m128i i32Hi = _mm256_extractf128_si256(i32, 1), | ||
520 | 431 | u16 = _mm_packus_epi32(_mm256_castsi256_si128(i32), i32Hi), | ||
521 | 432 | u8 = _mm_packus_epi16(u16, u16); | ||
522 | 433 | if (partial) | ||
523 | 434 | { | ||
524 | 435 | #ifdef CAN_USE_MMX | ||
525 | 436 | _mm_maskmove_si64(_mm_movepi64_pi64(u8), PartialVectorMask8(n), (char *)out); | ||
526 | 437 | #else | ||
527 | 438 | _mm_maskmoveu_si128(u8, PartialVectorMask8(n), (char *)out); | ||
528 | 439 | #endif | ||
529 | 440 | } | ||
530 | 441 | else | ||
531 | 442 | _mm_storel_epi64((__m128i *)out, u8); | ||
532 | 443 | } | ||
533 | 444 | #endif | ||
534 | 445 | |||
535 | 446 | #ifdef __AVX__ | ||
536 | 447 | FORCE_INLINE __m256 BroadcastSIMD(__m256 &out, float x) | ||
537 | 448 | { | ||
538 | 449 | return out = _mm256_set1_ps(x); | ||
539 | 450 | } | ||
540 | 451 | |||
541 | 452 | FORCE_INLINE __m256d BroadcastSIMD(__m256d &out, double x) | ||
542 | 453 | { | ||
543 | 454 | return out = _mm256_set1_pd(x); | ||
544 | 455 | } | ||
545 | 456 | |||
546 | 457 | FORCE_INLINE __m256i BroadcastSIMD(__m256i &out, int16_t x) | ||
547 | 458 | { | ||
548 | 459 | return out = _mm256_set1_epi16(x); | ||
549 | 460 | } | ||
550 | 461 | #endif | ||
551 | 462 | |||
552 | 463 | FORCE_INLINE __m128 BroadcastSIMD(__m128 &out, float x) | ||
553 | 464 | { | ||
554 | 465 | return out = _mm_set1_ps(x); | ||
555 | 466 | } | ||
556 | 467 | |||
557 | 468 | FORCE_INLINE __m128d BroadcastSIMD(__m128d &out, double x) | ||
558 | 469 | { | ||
559 | 470 | return out = _mm_set1_pd(x); | ||
560 | 471 | } | ||
561 | 472 | |||
562 | 473 | FORCE_INLINE __m128i BroadcastSIMD(__m128i &out, int16_t x) | ||
563 | 474 | { | ||
564 | 475 | return out = _mm_set1_epi16(x); | ||
565 | 476 | } | ||
566 | 477 | |||
567 | 478 | |||
568 | 479 | FORCE_INLINE __m128 LoadFloats(__m128 &out, float *x) | ||
569 | 480 | { | ||
570 | 481 | return out = _mm_loadu_ps(x); | ||
571 | 482 | } | ||
572 | 483 | |||
573 | 484 | FORCE_INLINE __m128 LoadFloats(__m128 &out, uint8_t *x) | ||
574 | 485 | { | ||
575 | 486 | __m128i u8 = _mm_cvtsi32_si128(*(int32_t *)x), | ||
576 | 487 | i32; | ||
577 | 488 | #ifdef __SSE4_1__ | ||
578 | 489 | i32 = _mm_cvtepu8_epi32(u8); | ||
579 | 490 | #else | ||
580 | 491 | __m128i zero = _mm_setzero_si128(); | ||
581 | 492 | i32 = _mm_unpacklo_epi16(_mm_unpacklo_epi8(u8, zero), zero); | ||
582 | 493 | #endif | ||
583 | 494 | return out = _mm_cvtepi32_ps(i32); | ||
584 | 495 | } | ||
585 | 496 | |||
586 | 497 | FORCE_INLINE __m128 LoadFloats(__m128 &out, uint16_t *x) | ||
587 | 498 | { | ||
588 | 499 | __m128i u16 = _mm_loadl_epi64((__m128i *)x), | ||
589 | 500 | i32; | ||
590 | 501 | #ifdef __SSE4_1__ | ||
591 | 502 | i32 = _mm_cvtepu16_epi32(u16); | ||
592 | 503 | #else | ||
593 | 504 | __m128i zero = _mm_setzero_si128(); | ||
594 | 505 | i32 = _mm_unpacklo_epi16(u16, zero); | ||
595 | 506 | #endif | ||
596 | 507 | return out = _mm_cvtepi32_ps(i32); | ||
597 | 508 | } | ||
598 | 509 | |||
599 | 510 | |||
600 | 511 | template <bool partial = false> // no, this parameter isn't redundant - without it, there will be a redundant n == 4 check when partial = 0 | ||
601 | 512 | FORCE_INLINE void StoreFloats(float *out, __m128 x, ssize_t n = 4) | ||
602 | 513 | { | ||
603 | 514 | if (partial) | ||
604 | 515 | { | ||
605 | 516 | #ifdef __AVX__ | ||
606 | 517 | _mm_maskstore_ps(out, PartialVectorMask(n * sizeof(float)), x); | ||
607 | 518 | #else | ||
608 | 519 | _mm_maskmoveu_si128(_mm_castps_si128(x), PartialVectorMask(n * sizeof(float)), (char *)out); | ||
609 | 520 | #endif | ||
610 | 521 | } | ||
611 | 522 | else | ||
612 | 523 | { | ||
613 | 524 | _mm_storeu_ps(out, x); | ||
614 | 525 | } | ||
615 | 526 | } | ||
616 | 527 | |||
617 | 528 | template <bool partial = false> | ||
618 | 529 | FORCE_INLINE void StoreFloats(uint16_t *out, __m128 x, ssize_t n = 4) | ||
619 | 530 | { | ||
620 | 531 | __m128i i32 = _mm_cvtps_epi32(x), | ||
621 | 532 | #ifdef __SSE4_1__ | ||
622 | 533 | u16 = _mm_packus_epi32(i32, i32); | ||
623 | 534 | #else | ||
624 | 535 | u16 = _mm_max_epi16(_mm_packs_epi32(i32, i32), _mm_setzero_si128()); // can get away with treating as int16 for now | ||
625 | 536 | #endif | ||
626 | 537 | if (partial) | ||
627 | 538 | { | ||
628 | 539 | #ifdef CAN_USE_MMX | ||
629 | 540 | _mm_maskmove_si64(_mm_movepi64_pi64(u16), PartialVectorMask8(n * sizeof(int16_t)), (char *)out); | ||
630 | 541 | #else | ||
631 | 542 | _mm_maskmoveu_si128(u16, PartialVectorMask(n * sizeof(int16_t)), (char *)out); | ||
632 | 543 | #endif | ||
633 | 544 | } | ||
634 | 545 | else | ||
635 | 546 | _mm_storel_epi64((__m128i *)out, u16); | ||
636 | 547 | } | ||
637 | 548 | |||
638 | 549 | template <bool partial = false> | ||
639 | 550 | FORCE_INLINE void StoreFloats(uint8_t *out, __m128 x, ssize_t n = 4) | ||
640 | 551 | { | ||
641 | 552 | __m128i i32 = _mm_cvtps_epi32(x), | ||
642 | 553 | u8 = _mm_packus_epi16(_mm_packs_epi32(i32, i32), i32); // should use packus_epi32, but that's only in SSE4 | ||
643 | 554 | if (partial) | ||
644 | 555 | { | ||
645 | 556 | #ifdef CAN_USE_MMX | ||
646 | 557 | _mm_maskmove_si64(_mm_movepi64_pi64(u8), PartialVectorMask8(n), (char *)out); | ||
647 | 558 | #else | ||
648 | 559 | _mm_maskmoveu_si128(u8, PartialVectorMask(n), (char *)out); | ||
649 | 560 | #endif | ||
650 | 561 | } | ||
651 | 562 | else | ||
652 | 563 | *(int32_t *)out = _mm_cvtsi128_si32(u8); | ||
653 | 564 | } | ||
654 | 565 | |||
655 | 566 | |||
656 | 567 | FORCE_INLINE __m128d LoadDoubles(__m128d &out, double *x) | ||
657 | 568 | { | ||
658 | 569 | return out = _mm_loadu_pd(x); | ||
659 | 570 | } | ||
660 | 571 | |||
661 | 572 | FORCE_INLINE __m128d LoadDoubles(__m128d &out, uint8_t *x) | ||
662 | 573 | { | ||
663 | 574 | __m128i u8 = _mm_cvtsi32_si128(*(uint16_t *)x), | ||
664 | 575 | i32; | ||
665 | 576 | #ifdef __SSE4_1__ | ||
666 | 577 | i32 = _mm_cvtepu8_epi32(u8); | ||
667 | 578 | #else | ||
668 | 579 | __m128i zero = _mm_setzero_si128(); | ||
669 | 580 | i32 = _mm_unpacklo_epi16(_mm_unpacklo_epi8(u8, zero), zero); | ||
670 | 581 | #endif | ||
671 | 582 | return out = _mm_cvtepi32_pd(i32); | ||
672 | 583 | } | ||
673 | 584 | |||
674 | 585 | FORCE_INLINE __m128d LoadDoubles(__m128d &out, uint16_t *x) | ||
675 | 586 | { | ||
676 | 587 | __m128i u16 = _mm_cvtsi32_si128(*(uint32_t *)x), | ||
677 | 588 | i32; | ||
678 | 589 | #ifdef __SSE4_1__ | ||
679 | 590 | i32 = _mm_cvtepu16_epi32(u16); | ||
680 | 591 | #else | ||
681 | 592 | __m128i zero = _mm_setzero_si128(); | ||
682 | 593 | i32 = _mm_unpacklo_epi16(u16, zero); | ||
683 | 594 | #endif | ||
684 | 595 | return out = _mm_cvtepi32_pd(i32); | ||
685 | 596 | } | ||
686 | 597 | |||
687 | 598 | template <bool partial = false> | ||
688 | 599 | FORCE_INLINE void StoreDoubles(double *out, __m128d x, ssize_t n = 2) | ||
689 | 600 | { | ||
690 | 601 | if (partial) | ||
691 | 602 | { | ||
692 | 603 | #ifdef __AVX__ | ||
693 | 604 | _mm_maskstore_pd(out, PartialVectorMask(n * sizeof(double)), x); | ||
694 | 605 | #else | ||
695 | 606 | _mm_maskmoveu_si128(_mm_castpd_si128(x), PartialVectorMask(n * sizeof(double)), (char *)out); | ||
696 | 607 | #endif | ||
697 | 608 | } | ||
698 | 609 | else | ||
699 | 610 | { | ||
700 | 611 | _mm_storeu_pd(out, x); | ||
701 | 612 | } | ||
702 | 613 | } | ||
703 | 614 | |||
704 | 615 | template <bool partial = false> | ||
705 | 616 | FORCE_INLINE void StoreDoubles(float *out, __m128d x, ssize_t n = 2) | ||
706 | 617 | { | ||
707 | 618 | __m128 f32 = _mm_cvtpd_ps(x); | ||
708 | 619 | if (partial) | ||
709 | 620 | { | ||
710 | 621 | #ifdef CAN_USE_MMX | ||
711 | 622 | _mm_maskmove_si64(_mm_movepi64_pi64(_mm_castps_si128(f32)), PartialVectorMask8(n * sizeof(float)), (char *)out); | ||
712 | 623 | #else | ||
713 | 624 | _mm_maskmoveu_si128(_mm_castps_si128(f32), PartialVectorMask8(n * sizeof(float)), (char *)out); | ||
714 | 625 | #endif | ||
715 | 626 | } | ||
716 | 627 | else | ||
717 | 628 | { | ||
718 | 629 | _mm_storel_pi((__m64 *)out, f32); | ||
719 | 630 | } | ||
720 | 631 | } | ||
721 | 632 | |||
722 | 633 | template <bool partial = false> | ||
723 | 634 | FORCE_INLINE void StoreDoubles(uint16_t *out, __m128d x, ssize_t n = 2) | ||
724 | 635 | { | ||
725 | 636 | __m128i i32 = _mm_cvtpd_epi32(x), | ||
726 | 637 | #ifdef __SSE4_1__ | ||
727 | 638 | u16 = _mm_packus_epi32(i32, i32); | ||
728 | 639 | #else | ||
729 | 640 | u16 = _mm_max_epi16(_mm_packs_epi32(i32, i32), _mm_setzero_si128()); // can get away with using i16 for now | ||
730 | 641 | #endif | ||
731 | 642 | if (partial) | ||
732 | 643 | { | ||
733 | 644 | #ifdef CAN_USE_MMX | ||
734 | 645 | _mm_maskmove_si64(_mm_movepi64_pi64(u16), PartialVectorMask8(n * sizeof(int16_t)), (char *)out); | ||
735 | 646 | #else | ||
736 | 647 | _mm_maskmoveu_si128(u16, PartialVectorMask8(n * sizeof(int16_t)), (char *)out); | ||
737 | 648 | #endif | ||
738 | 649 | } | ||
739 | 650 | else | ||
740 | 651 | { | ||
741 | 652 | *(uint32_t *)out = _mm_cvtsi128_si32(u16); | ||
742 | 653 | } | ||
743 | 654 | } | ||
744 | 655 | |||
745 | 656 | template <bool partial = false> | ||
746 | 657 | FORCE_INLINE void StoreDoubles(uint8_t *out, __m128d x, ssize_t n = 2) | ||
747 | 658 | { | ||
748 | 659 | __m128i i32 = _mm_cvtpd_epi32(x), | ||
749 | 660 | #ifdef __SSE4_1__ | ||
750 | 661 | u16 = _mm_packus_epi32(i32, i32), | ||
751 | 662 | #else | ||
752 | 663 | u16 = _mm_max_epi16(_mm_packs_epi32(i32, i32), _mm_setzero_si128()), // can get away with using i16 for now | ||
753 | 664 | #endif | ||
754 | 665 | u8 = _mm_packus_epi16(u16, u16); | ||
755 | 666 | |||
756 | 667 | if (partial) | ||
757 | 668 | { | ||
758 | 669 | #ifdef CAN_USE_MMX | ||
759 | 670 | _mm_maskmove_si64(_mm_movepi64_pi64(u8), PartialVectorMask8(n), (char *)out); | ||
760 | 671 | #else | ||
761 | 672 | _mm_maskmoveu_si128(u8, PartialVectorMask8(n), (char *)out); | ||
762 | 673 | #endif | ||
763 | 674 | } | ||
764 | 675 | else | ||
765 | 676 | { | ||
766 | 677 | *(uint16_t *)out = _mm_cvtsi128_si32(u8); | ||
767 | 678 | } | ||
768 | 679 | } | ||
769 | 680 | |||
770 | 681 | #ifdef __AVX__ | ||
771 | 682 | FORCE_INLINE __m256 Load4x2Floats(uint8_t *row0, uint8_t *row1) | ||
772 | 683 | { | ||
773 | 684 | return _mm256_cvtepi32_ps(_mm256_setr_m128i(_mm_cvtepu8_epi32(_mm_cvtsi32_si128(*(int32_t *)row0)), | ||
774 | 685 | _mm_cvtepu8_epi32(_mm_cvtsi32_si128(*(int32_t *)row1)))); | ||
775 | 686 | } | ||
776 | 687 | |||
777 | 688 | FORCE_INLINE __m256 Load4x2Floats(uint16_t *row0, uint16_t *row1) | ||
778 | 689 | { | ||
779 | 690 | return _mm256_cvtepi32_ps(_mm256_setr_m128i(_mm_cvtepu16_epi32(_mm_loadl_epi64((__m128i *)row0)), | ||
780 | 691 | _mm_cvtepu16_epi32(_mm_loadl_epi64((__m128i *)row1)))); | ||
781 | 692 | } | ||
782 | 693 | |||
783 | 694 | FORCE_INLINE __m256 Load4x2Floats(float *row0, float *row1) | ||
784 | 695 | { | ||
785 | 696 | return _mm256_setr_m128(_mm_loadu_ps(row0), _mm_loadu_ps(row1)); | ||
786 | 697 | } | ||
787 | 698 | #endif | ||
788 | 699 | |||
789 | 700 | FORCE_INLINE __m128i LoadAndScaleToInt16(__m128i &out, uint8_t *x) | ||
790 | 701 | { | ||
791 | 702 | // convert from [0-255] to [0-16383] | ||
792 | 703 | // leave 1 spare bit so that 2 values can be added without overflow for symmetric filters | ||
793 | 704 | __m128i u8 = _mm_loadl_epi64((__m128i *)x), | ||
794 | 705 | i16; | ||
795 | 706 | #ifdef __SSE4_1__ | ||
796 | 707 | i16 = _mm_cvtepu8_epi16(u8); | ||
797 | 708 | #else | ||
798 | 709 | i16 = _mm_unpacklo_epi8(u8, _mm_setzero_si128()); | ||
799 | 710 | #endif | ||
800 | 711 | return out = _mm_slli_epi16(i16, 6); | ||
801 | 712 | } | ||
802 | 713 | |||
803 | 714 | __m128i LoadAndScaleToInt16(__m128i &out, int16_t *x) | ||
804 | 715 | { | ||
805 | 716 | return out = _mm_loadu_si128((__m128i *)x); | ||
806 | 717 | } | ||
807 | 718 | |||
808 | 719 | #ifdef __AVX2__ | ||
809 | 720 | |||
810 | 721 | FORCE_INLINE __m256i LoadAndScaleToInt16(__m256i &out, uint8_t *x) | ||
811 | 722 | { | ||
812 | 723 | // convert from [0-255] to [0-16383] | ||
813 | 724 | // leave 1 spare bit so that 2 values can be added without overflow for symmetric filters | ||
814 | 725 | return out = _mm256_slli_epi16(_mm256_cvtepu8_epi16(_mm_loadu_si128((__m128i *)x)), 6); | ||
815 | 726 | } | ||
816 | 727 | |||
817 | 728 | FORCE_INLINE __m256i LoadAndScaleToInt16(__m256i &out, int16_t *x) | ||
818 | 729 | { | ||
819 | 730 | return out = _mm256_loadu_si256((__m256i *)x); | ||
820 | 731 | } | ||
821 | 732 | |||
822 | 733 | #endif | ||
823 | 734 | |||
824 | 735 | template <bool partial = false> | ||
825 | 736 | FORCE_INLINE void ScaleAndStoreInt16(uint8_t *out, __m128i x, ssize_t n = 8) | ||
826 | 737 | { | ||
827 | 738 | __m128i i16 = _mm_srai_epi16(_mm_adds_epi16(x, _mm_set1_epi16(32)), 6), | ||
828 | 739 | u8 = _mm_packus_epi16(i16, i16); | ||
829 | 740 | if (partial) | ||
830 | 741 | { | ||
831 | 742 | #ifdef CAN_USE_MMX | ||
832 | 743 | _mm_maskmove_si64(_mm_movepi64_pi64(u8), PartialVectorMask8(n), (char *)out); | ||
833 | 744 | #else | ||
834 | 745 | _mm_maskmoveu_si128(u8, PartialVectorMask8(n), (char *)out); | ||
835 | 746 | #endif | ||
836 | 747 | } | ||
837 | 748 | else | ||
838 | 749 | _mm_storel_epi64((__m128i *)out, u8); | ||
839 | 750 | } | ||
840 | 751 | |||
841 | 752 | template <bool partial = false> | ||
842 | 753 | FORCE_INLINE void ScaleAndStoreInt16(int16_t *out, __m128i i16, ssize_t n = 8) | ||
843 | 754 | { | ||
844 | 755 | if (partial) | ||
845 | 756 | _mm_maskmoveu_si128(i16, PartialVectorMask(n * sizeof(int16_t)), (char *)out); | ||
846 | 757 | else | ||
847 | 758 | _mm_storeu_si128((__m128i *)out, i16); | ||
848 | 759 | } | ||
849 | 760 | |||
850 | 761 | #ifdef __AVX2__ | ||
851 | 762 | |||
852 | 763 | template <bool partial = false> | ||
853 | 764 | FORCE_INLINE void ScaleAndStoreInt16(uint8_t *out, __m256i x, ssize_t n = 16) | ||
854 | 765 | { | ||
855 | 766 | __m256i i16 = _mm256_srai_epi16(_mm256_adds_epi16(x, _mm256_set1_epi16(32)), 6); | ||
856 | 767 | __m128i u8 = _mm256_castsi256_si128(_mm256_packus_epi16(i16, _mm256_permute2f128_si256(i16, i16, 1))); | ||
857 | 768 | if (partial) | ||
858 | 769 | _mm_maskmoveu_si128(u8, PartialVectorMask(n), (char *)out); | ||
859 | 770 | else | ||
860 | 771 | _mm_storeu_si128((__m128i *)out, u8); | ||
861 | 772 | } | ||
862 | 773 | |||
863 | 774 | template <bool partial = false> | ||
864 | 775 | FORCE_INLINE void ScaleAndStoreInt16(int16_t *out, __m256i i16, ssize_t n = 16) | ||
865 | 776 | { | ||
866 | 777 | if (partial) | ||
867 | 778 | { | ||
868 | 779 | _mm_maskmoveu_si128(_mm256_castsi256_si128(i16), PartialVectorMask(min(ssize_t(8), n) * sizeof(int16_t)), (char *)out); | ||
869 | 780 | _mm_maskmoveu_si128(_mm256_extractf128_si256(i16, 1), PartialVectorMask(max(ssize_t(0), n - 8) * sizeof(int16_t)), (char *)&out[8]); | ||
870 | 781 | } | ||
871 | 782 | else | ||
872 | 783 | _mm256_storeu_si256((__m256i *)out, i16); | ||
873 | 784 | } | ||
874 | 785 | #endif | ||
875 | 786 | |||
876 | 787 | // selectors are doubles to avoid int-float domain transition | ||
877 | 788 | FORCE_INLINE __m128d Select(__m128d a, __m128d b, __m128d selectors) | ||
878 | 789 | { | ||
879 | 790 | #ifdef __SSE4_1__ | ||
880 | 791 | return _mm_blendv_pd(a, b, selectors); | ||
881 | 792 | #else | ||
882 | 793 | return _mm_or_pd(_mm_andnot_pd(selectors, a), _mm_and_pd(selectors, b)); | ||
883 | 794 | #endif | ||
884 | 795 | } | ||
885 | 796 | |||
886 | 797 | // selectors are floats to avoid int-float domain transition | ||
887 | 798 | FORCE_INLINE __m128 Select(__m128 a, __m128 b, __m128 selectors) | ||
888 | 799 | { | ||
889 | 800 | #ifdef __SSE4_1__ | ||
890 | 801 | return _mm_blendv_ps(a, b, selectors); | ||
891 | 802 | #else | ||
892 | 803 | return _mm_or_ps(_mm_andnot_ps(selectors, a), _mm_and_ps(selectors, b)); | ||
893 | 804 | #endif | ||
894 | 805 | } | ||
895 | 806 | |||
896 | 807 | // even these simple ops need to be redeclared for each SIMD architecture due to VEX and non-VEX encodings of SSE instructions | ||
897 | 808 | #ifdef __AVX__ | ||
898 | 809 | FORCE_INLINE __m128 Cast256To128(__m256 v) | ||
899 | 810 | { | ||
900 | 811 | return _mm256_castps256_ps128(v); | ||
901 | 812 | } | ||
902 | 813 | |||
903 | 814 | FORCE_INLINE __m128d Cast256To128(__m256d v) | ||
904 | 815 | { | ||
905 | 816 | return _mm256_castpd256_pd128(v); | ||
906 | 817 | } | ||
907 | 818 | FORCE_INLINE __m128i Cast256To128(__m256i v) | ||
908 | 819 | { | ||
909 | 820 | return _mm256_castsi256_si128(v); | ||
910 | 821 | } | ||
911 | 822 | #endif | ||
912 | 823 | |||
913 | 824 | FORCE_INLINE __m128 Cast256To128(__m128 v) | ||
914 | 825 | { | ||
915 | 826 | return v; | ||
916 | 827 | } | ||
917 | 828 | |||
918 | 829 | FORCE_INLINE __m128d Cast256To128(__m128d v) | ||
919 | 830 | { | ||
920 | 831 | return v; | ||
921 | 832 | } | ||
922 | 833 | |||
923 | 834 | FORCE_INLINE __m128i Cast256To128(__m128i v) | ||
924 | 835 | { | ||
925 | 836 | return v; | ||
926 | 837 | } | ||
927 | 838 | |||
928 | 839 | |||
929 | 840 | // does 1D IIR convolution on multiple rows (height) of data | ||
930 | 841 | // IntermediateType must be float or double | ||
931 | 842 | template <bool transposeOut, bool isForwardPass, bool isBorder, int channels, typename OutType, typename InType, typename IntermediateType> | ||
932 | 843 | FORCE_INLINE void Convolve1DHorizontalRef(SimpleImage <OutType> out, | ||
933 | 844 | SimpleImage <InType> in, | ||
934 | 845 | IntermediateType *borderValues, // [y][color] | ||
935 | 846 | ssize_t xStart, ssize_t xEnd, ssize_t width, ssize_t height, | ||
936 | 847 | typename MyTraits<IntermediateType>::SIMDtype *vCoefficients, double M[N * N]) | ||
937 | 848 | { | ||
938 | 849 | ssize_t xStep = isForwardPass ? 1 : -1; | ||
939 | 850 | |||
940 | 851 | ssize_t y = 0; | ||
941 | 852 | do | ||
942 | 853 | { | ||
943 | 854 | ssize_t c = 0; | ||
944 | 855 | do | ||
945 | 856 | { | ||
946 | 857 | IntermediateType prevOut[N]; | ||
947 | 858 | ssize_t x = xStart; | ||
948 | 859 | if (isBorder && !isForwardPass) | ||
949 | 860 | { | ||
950 | 861 | // xStart must be width - 1 | ||
951 | 862 | IntermediateType u[N + 1][1]; // u[0] = last forward filtered value, u[1] = 2nd last forward filtered value, ... | ||
952 | 863 | for (ssize_t i = 0; i < N + 1; ++i) | ||
953 | 864 | { | ||
954 | 865 | u[i][0] = in[y][(xStart + i * xStep) * channels + c]; | ||
955 | 866 | } | ||
956 | 867 | IntermediateType backwardsInitialState[N][1]; | ||
957 | 868 | calcTriggsSdikaInitialization<1>(M, u, &borderValues[y * channels + c], &borderValues[y * channels + c], ExtractElement0(vCoefficients[0]), backwardsInitialState); | ||
958 | 869 | for (ssize_t i = 0; i < N; ++i) | ||
959 | 870 | prevOut[i] = backwardsInitialState[i][0]; | ||
960 | 871 | |||
961 | 872 | if (transposeOut) | ||
962 | 873 | out[x][y * channels + c] = clip_round_cast<OutType, IntermediateType>(prevOut[0]); | ||
963 | 874 | else | ||
964 | 875 | out[y][x * channels + c] = clip_round_cast<OutType, IntermediateType>(prevOut[0]); | ||
965 | 876 | x += xStep; | ||
966 | 877 | if (x == xEnd) | ||
967 | 878 | goto nextIteration; // do early check here so that we can still use do-while for forward pass | ||
968 | 879 | } | ||
969 | 880 | else if (isBorder && isForwardPass) | ||
970 | 881 | { | ||
971 | 882 | for (ssize_t i = 0; i < N; ++i) | ||
972 | 883 | prevOut[i] = in[y][0 * channels + c]; | ||
973 | 884 | } | ||
974 | 885 | else | ||
975 | 886 | { | ||
976 | 887 | for (ssize_t i = 0; i < N; ++i) | ||
977 | 888 | { | ||
978 | 889 | prevOut[i] = transposeOut ? out[xStart - (i + 1) * xStep][y * channels + c] | ||
979 | 890 | : out[y][(xStart - (i + 1) * xStep) * channels + c]; | ||
980 | 891 | } | ||
981 | 892 | } | ||
982 | 893 | |||
983 | 894 | do | ||
984 | 895 | { | ||
985 | 896 | IntermediateType sum = prevOut[0] * ExtractElement0(vCoefficients[1]) | ||
986 | 897 | + prevOut[1] * ExtractElement0(vCoefficients[2]) | ||
987 | 898 | + prevOut[2] * ExtractElement0(vCoefficients[3]) | ||
988 | 899 | + in[y][x * channels + c] * ExtractElement0(vCoefficients[0]); // add last for best accuracy since this terms tends to be the smallest | ||
989 | 900 | if (transposeOut) | ||
990 | 901 | out[x][y * channels + c] = clip_round_cast<OutType, IntermediateType>(sum); | ||
991 | 902 | else | ||
992 | 903 | out[y][x * channels + c] = clip_round_cast<OutType, IntermediateType>(sum); | ||
993 | 904 | prevOut[2] = prevOut[1]; | ||
994 | 905 | prevOut[1] = prevOut[0]; | ||
995 | 906 | prevOut[0] = sum; | ||
996 | 907 | x += xStep; | ||
997 | 908 | } while (x != xEnd); | ||
998 | 909 | ++c; | ||
999 | 910 | } while (c < channels); | ||
1000 | 911 | nextIteration: | ||
1001 | 912 | ++y; | ||
1002 | 913 | } while (y < height); | ||
1003 | 914 | } | ||
1004 | 915 | |||
1005 | 916 | template <int channels, bool transposeOut, ssize_t xStep, int i0, int i1, int i2, typename OutType, typename InType> | ||
1006 | 917 | FORCE_INLINE void DoOneIIR(SimpleImage<OutType> out, SimpleImage<InType> in, __m256d &vSum, __m256d &vIn, ssize_t x, ssize_t y, __m256d vCoefficients[N + 1], __m256d prevOut[N]) | ||
1007 | 918 | { | ||
1008 | 919 | vSum = vIn * vCoefficients[0]; | ||
1009 | 920 | LoadDoubles(vIn, &in[y][(x + xStep) * channels]); // load data for next iteration early to hide latency (software pipelining) | ||
1010 | 921 | |||
1011 | 922 | // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3] | ||
1012 | 923 | #ifdef __FMA__ | ||
1013 | 924 | // this expression uses fewer MADs than the max. possible, but has a shorter critical path and is actually faster | ||
1014 | 925 | vSum = MultiplyAdd(prevOut[i2], vCoefficients[3], vSum) + MultiplyAdd(prevOut[i1], vCoefficients[2], prevOut[i0] * vCoefficients[1]); | ||
1015 | 926 | #else | ||
1016 | 927 | vSum = prevOut[i0] * vCoefficients[1] | ||
1017 | 928 | + prevOut[i1] * vCoefficients[2] | ||
1018 | 929 | + prevOut[i2] * vCoefficients[3] | ||
1019 | 930 | + vIn * vCoefficients[0]; | ||
1020 | 931 | #endif | ||
1021 | 932 | if (transposeOut) | ||
1022 | 933 | StoreDoubles(&out[x][y * channels], vSum); | ||
1023 | 934 | else | ||
1024 | 935 | StoreDoubles(&out[y][x * channels], vSum); | ||
1025 | 936 | } | ||
1026 | 937 | |||
1027 | 938 | // input is always untransposed | ||
1028 | 939 | // for reverse pass, input is output from forward pass | ||
1029 | 940 | // for transposed output, in-place operation isn't possible | ||
1030 | 941 | // hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined | ||
1031 | 942 | template <bool transposeOut, bool isForwardPass, bool isBorder, int channels, typename OutType, typename InType, typename SIMD_Type> | ||
1032 | 943 | static /*FORCE_INLINE*/ void Convolve1DHorizontal(SimpleImage<OutType> out, | ||
1033 | 944 | SimpleImage<InType> in, | ||
1034 | 945 | double *borderValues, | ||
1035 | 946 | ssize_t xStart, ssize_t xEnd, ssize_t width, ssize_t height, | ||
1036 | 947 | SIMD_Type *vCoefficients, double M[N * N]) | ||
1037 | 948 | { | ||
1038 | 949 | #if 0 | ||
1039 | 950 | |||
1040 | 951 | Convolve1DHorizontalRef<transposeOut, isForwardPass, isBorder, channels>(out, | ||
1041 | 952 | in, | ||
1042 | 953 | borderValues, | ||
1043 | 954 | xStart, xEnd, width, height, | ||
1044 | 955 | vCoefficients, M); | ||
1045 | 956 | return; | ||
1046 | 957 | |||
1047 | 958 | #endif | ||
1048 | 959 | const ssize_t xStep = isForwardPass ? 1 : -1; | ||
1049 | 960 | if (channels == 4) | ||
1050 | 961 | { | ||
1051 | 962 | #ifdef __AVX__ | ||
1052 | 963 | ssize_t y = 0; | ||
1053 | 964 | do | ||
1054 | 965 | { | ||
1055 | 966 | __m256d prevOut[N]; | ||
1056 | 967 | |||
1057 | 968 | ssize_t x = xStart; | ||
1058 | 969 | if (isBorder && !isForwardPass) | ||
1059 | 970 | { | ||
1060 | 971 | // condition: xStart must be width - 1 | ||
1061 | 972 | double u[N + 1][channels]; //[x][channels] | ||
1062 | 973 | for (ssize_t i = 0; i < N + 1; ++i) | ||
1063 | 974 | { | ||
1064 | 975 | __m256d temp; | ||
1065 | 976 | _mm256_storeu_pd(u[i], LoadDoubles(temp, &in[y][(xStart + i * xStep) * channels])); | ||
1066 | 977 | } | ||
1067 | 978 | double backwardsInitialState[N][channels]; | ||
1068 | 979 | calcTriggsSdikaInitialization<channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState); | ||
1069 | 980 | for (ssize_t i = 0; i < N; ++i) | ||
1070 | 981 | LoadDoubles(prevOut[i], backwardsInitialState[i]); | ||
1071 | 982 | |||
1072 | 983 | if (transposeOut) | ||
1073 | 984 | StoreDoubles(&out[x][y * channels], prevOut[0]); | ||
1074 | 985 | else | ||
1075 | 986 | StoreDoubles(&out[y][x * channels], prevOut[0]); | ||
1076 | 987 | |||
1077 | 988 | x += xStep; | ||
1078 | 989 | if (x == xEnd) | ||
1079 | 990 | goto nextIteration; | ||
1080 | 991 | } | ||
1081 | 992 | else if (isBorder && isForwardPass) | ||
1082 | 993 | { | ||
1083 | 994 | __m256d firstPixel; | ||
1084 | 995 | LoadDoubles(firstPixel, &in[y][0 * channels]); | ||
1085 | 996 | for (ssize_t i = 0; i < N; ++i) | ||
1086 | 997 | prevOut[i] = firstPixel; | ||
1087 | 998 | } | ||
1088 | 999 | else | ||
1089 | 1000 | { | ||
1090 | 1001 | for (ssize_t i = 0; i < N; ++i) | ||
1091 | 1002 | { | ||
1092 | 1003 | if (transposeOut) | ||
1093 | 1004 | LoadDoubles(prevOut[i], &out[xStart - (i + 1) * xStep][y * channels]); | ||
1094 | 1005 | else | ||
1095 | 1006 | LoadDoubles(prevOut[i], &out[y][(xStart - (i + 1) * xStep) * channels]); | ||
1096 | 1007 | } | ||
1097 | 1008 | } | ||
1098 | 1009 | |||
1099 | 1010 | #if 0 // no measurable speedup | ||
1100 | 1011 | // same as loop below, but unrolled 3 times to increase ||ism, hide latency, and reduce overhead of shifting the sliding window (prevOut) | ||
1101 | 1012 | __m256d vIn; | ||
1102 | 1013 | LoadDoubles(vIn, &in[y][xStart * channels]); | ||
1103 | 1014 | for ( ; isForwardPass ? (x < xEnd - 3) : (x > xEnd + 3); ) | ||
1104 | 1015 | { | ||
1105 | 1016 | __m256d vSum; | ||
1106 | 1017 | DoOneIIR<channels, transposeOut, xStep, 0, 1, 2>(out, in, vSum, vIn, x, y, vCoefficients, prevOut); | ||
1107 | 1018 | prevOut[2] = vSum; | ||
1108 | 1019 | x += xStep; | ||
1109 | 1020 | |||
1110 | 1021 | DoOneIIR<channels, transposeOut, xStep, 2, 0, 1>(out, in, vSum, vIn, x, y, vCoefficients, prevOut); | ||
1111 | 1022 | prevOut[1] = vSum; | ||
1112 | 1023 | x += xStep; | ||
1113 | 1024 | |||
1114 | 1025 | DoOneIIR<channels, transposeOut, xStep, 1, 2, 0>(out, in, vSum, vIn, x, y, vCoefficients, prevOut); | ||
1115 | 1026 | prevOut[0] = vSum; | ||
1116 | 1027 | x += xStep; | ||
1117 | 1028 | } | ||
1118 | 1029 | #endif | ||
1119 | 1030 | while (isForwardPass ? (x < xEnd) : (x > xEnd)) | ||
1120 | 1031 | { | ||
1121 | 1032 | __m256d vIn, vSum; | ||
1122 | 1033 | LoadDoubles(vIn, &in[y][x * channels]), | ||
1123 | 1034 | |||
1124 | 1035 | // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3] | ||
1125 | 1036 | #ifdef __FMA__ | ||
1126 | 1037 | // this expression uses fewer MADs than the max. possible, but has a shorter critical path and is actually faster | ||
1127 | 1038 | vSum = MultiplyAdd(vIn, vCoefficients[0], prevOut[2] * vCoefficients[3]) + MultiplyAdd(prevOut[1], vCoefficients[2], prevOut[0] * vCoefficients[1]); | ||
1128 | 1039 | #else | ||
1129 | 1040 | vSum = prevOut[0] * vCoefficients[1] | ||
1130 | 1041 | + prevOut[1] * vCoefficients[2] | ||
1131 | 1042 | + prevOut[2] * vCoefficients[3] | ||
1132 | 1043 | + vIn * vCoefficients[0]; | ||
1133 | 1044 | #endif | ||
1134 | 1045 | if (transposeOut) | ||
1135 | 1046 | StoreDoubles(&out[x][y * channels], vSum); | ||
1136 | 1047 | else | ||
1137 | 1048 | StoreDoubles(&out[y][x * channels], vSum); | ||
1138 | 1049 | |||
1139 | 1050 | prevOut[2] = prevOut[1]; | ||
1140 | 1051 | prevOut[1] = prevOut[0]; | ||
1141 | 1052 | prevOut[0] = vSum; | ||
1142 | 1053 | x += xStep; | ||
1143 | 1054 | } | ||
1144 | 1055 | nextIteration: | ||
1145 | 1056 | ++y; | ||
1146 | 1057 | } while (y < height); | ||
1147 | 1058 | #else | ||
1148 | 1059 | // todo: yuck, find some way to refactor (emulate m256d perhaps?) | ||
1149 | 1060 | ssize_t y = 0; | ||
1150 | 1061 | do | ||
1151 | 1062 | { | ||
1152 | 1063 | __m128d prevOut[N][2]; | ||
1153 | 1064 | |||
1154 | 1065 | ssize_t x = xStart; | ||
1155 | 1066 | if (isBorder && !isForwardPass) | ||
1156 | 1067 | { | ||
1157 | 1068 | // condition: xStart must be width - 1 | ||
1158 | 1069 | double u[N + 1][channels]; //[x][channels] | ||
1159 | 1070 | for (ssize_t i = 0; i < N + 1; ++i) | ||
1160 | 1071 | { | ||
1161 | 1072 | __m128d temp; | ||
1162 | 1073 | _mm_storeu_pd(u[i], LoadDoubles(temp, &in[y][(xStart + i * xStep) * channels])); | ||
1163 | 1074 | _mm_storeu_pd(&u[i][2], LoadDoubles(temp, &in[y][(xStart + i * xStep) * channels + 2])); | ||
1164 | 1075 | } | ||
1165 | 1076 | double backwardsInitialState[N][channels]; | ||
1166 | 1077 | calcTriggsSdikaInitialization<channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState); | ||
1167 | 1078 | for (ssize_t i = 0; i < N; ++i) | ||
1168 | 1079 | { | ||
1169 | 1080 | LoadDoubles(prevOut[i][0], backwardsInitialState[i]); | ||
1170 | 1081 | LoadDoubles(prevOut[i][1], &backwardsInitialState[i][2]); | ||
1171 | 1082 | } | ||
1172 | 1083 | |||
1173 | 1084 | if (transposeOut) | ||
1174 | 1085 | { | ||
1175 | 1086 | StoreDoubles(&out[x][y * channels], prevOut[0][0]); | ||
1176 | 1087 | StoreDoubles(&out[x][y * channels + 2], prevOut[0][1]); | ||
1177 | 1088 | } | ||
1178 | 1089 | else | ||
1179 | 1090 | { | ||
1180 | 1091 | StoreDoubles(&out[y][x * channels], prevOut[0][0]); | ||
1181 | 1092 | StoreDoubles(&out[y][x * channels + 2], prevOut[0][1]); | ||
1182 | 1093 | } | ||
1183 | 1094 | |||
1184 | 1095 | x += xStep; | ||
1185 | 1096 | if (x == xEnd) | ||
1186 | 1097 | goto nextIteration; | ||
1187 | 1098 | } | ||
1188 | 1099 | else if (isBorder && isForwardPass) | ||
1189 | 1100 | { | ||
1190 | 1101 | __m128d firstPixel[2]; | ||
1191 | 1102 | LoadDoubles(firstPixel[0], &in[y][0 * channels]); | ||
1192 | 1103 | LoadDoubles(firstPixel[1], &in[y][0 * channels + 2]); | ||
1193 | 1104 | for (ssize_t i = 0; i < N; ++i) | ||
1194 | 1105 | { | ||
1195 | 1106 | prevOut[i][0] = firstPixel[0]; | ||
1196 | 1107 | prevOut[i][1] = firstPixel[1]; | ||
1197 | 1108 | } | ||
1198 | 1109 | } | ||
1199 | 1110 | else | ||
1200 | 1111 | { | ||
1201 | 1112 | for (ssize_t i = 0; i < N; ++i) | ||
1202 | 1113 | { | ||
1203 | 1114 | if (transposeOut) | ||
1204 | 1115 | { | ||
1205 | 1116 | LoadDoubles(prevOut[i][0], &out[xStart - (i + 1) * xStep][y * channels]); | ||
1206 | 1117 | LoadDoubles(prevOut[i][1], &out[xStart - (i + 1) * xStep][y * channels + 2]); | ||
1207 | 1118 | } | ||
1208 | 1119 | else | ||
1209 | 1120 | { | ||
1210 | 1121 | LoadDoubles(prevOut[i][0], &out[y][(xStart - (i + 1) * xStep) * channels]); | ||
1211 | 1122 | LoadDoubles(prevOut[i][1], &out[y][(xStart - (i + 1) * xStep) * channels + 2]); | ||
1212 | 1123 | } | ||
1213 | 1124 | } | ||
1214 | 1125 | } | ||
1215 | 1126 | |||
1216 | 1127 | while (isForwardPass ? (x < xEnd) : (x > xEnd)) | ||
1217 | 1128 | { | ||
1218 | 1129 | __m128d vIn[2], vSum[2]; | ||
1219 | 1130 | LoadDoubles(vIn[0], &in[y][x * channels]), | ||
1220 | 1131 | LoadDoubles(vIn[1], &in[y][x * channels + 2]), | ||
1221 | 1132 | |||
1222 | 1133 | // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3] | ||
1223 | 1134 | vSum[0] = prevOut[0][0] * vCoefficients[1][0] | ||
1224 | 1135 | + prevOut[1][0] * vCoefficients[2][0] | ||
1225 | 1136 | + prevOut[2][0] * vCoefficients[3][0] | ||
1226 | 1137 | + vIn[0] * vCoefficients[0][0]; | ||
1227 | 1138 | |||
1228 | 1139 | vSum[1] = prevOut[0][1] * vCoefficients[1][1] | ||
1229 | 1140 | + prevOut[1][1] * vCoefficients[2][1] | ||
1230 | 1141 | + prevOut[2][1] * vCoefficients[3][1] | ||
1231 | 1142 | + vIn[1] * vCoefficients[0][1]; | ||
1232 | 1143 | if (transposeOut) | ||
1233 | 1144 | { | ||
1234 | 1145 | StoreDoubles(&out[x][y * channels], vSum[0]); | ||
1235 | 1146 | StoreDoubles(&out[x][y * channels + 2], vSum[1]); | ||
1236 | 1147 | } | ||
1237 | 1148 | else | ||
1238 | 1149 | { | ||
1239 | 1150 | StoreDoubles(&out[y][x * channels], vSum[0]); | ||
1240 | 1151 | StoreDoubles(&out[y][x * channels + 2], vSum[1]); | ||
1241 | 1152 | } | ||
1242 | 1153 | prevOut[2][0] = prevOut[1][0]; | ||
1243 | 1154 | prevOut[2][1] = prevOut[1][1]; | ||
1244 | 1155 | prevOut[1][0] = prevOut[0][0]; | ||
1245 | 1156 | prevOut[1][1] = prevOut[0][1]; | ||
1246 | 1157 | prevOut[0][0] = vSum[0]; | ||
1247 | 1158 | prevOut[0][1] = vSum[1]; | ||
1248 | 1159 | x += xStep; | ||
1249 | 1160 | } | ||
1250 | 1161 | nextIteration: | ||
1251 | 1162 | ++y; | ||
1252 | 1163 | } while (y < height); | ||
1253 | 1164 | #endif | ||
1254 | 1165 | } | ||
1255 | 1166 | else | ||
1256 | 1167 | { | ||
1257 | 1168 | ssize_t y = 0; | ||
1258 | 1169 | do | ||
1259 | 1170 | { | ||
1260 | 1171 | if (isForwardPass) | ||
1261 | 1172 | { | ||
1262 | 1173 | ssize_t x = xStart; | ||
1263 | 1174 | __m128d feedback[2], | ||
1264 | 1175 | k0[2]; | ||
1265 | 1176 | #ifdef __AVX__ | ||
1266 | 1177 | k0[0] = Cast256To128(vCoefficients[0]); | ||
1267 | 1178 | k0[1] = _mm256_extractf128_pd(vCoefficients[0], 1); | ||
1268 | 1179 | #else | ||
1269 | 1180 | k0[0] = vCoefficients[0]; | ||
1270 | 1181 | k0[1] = vCoefficients[1]; | ||
1271 | 1182 | #endif | ||
1272 | 1183 | |||
1273 | 1184 | if (isBorder && isForwardPass) | ||
1274 | 1185 | { | ||
1275 | 1186 | // xStart must be 0 | ||
1276 | 1187 | feedback[0] = feedback[1] = _mm_set1_pd(in[y][0]); | ||
1277 | 1188 | } | ||
1278 | 1189 | else | ||
1279 | 1190 | { | ||
1280 | 1191 | LoadDoubles(feedback[0], &out[y][xStart - 3 * xStep]); | ||
1281 | 1192 | LoadDoubles(feedback[1], &out[y][xStart - 1 * xStep]); | ||
1282 | 1193 | |||
1283 | 1194 | feedback[1] = _mm_shuffle_pd(feedback[0], feedback[1], _MM_SHUFFLE2(0, 1)); | ||
1284 | 1195 | feedback[0] = _mm_shuffle_pd(feedback[0], feedback[0], _MM_SHUFFLE2(0, 0)); | ||
1285 | 1196 | } | ||
1286 | 1197 | // feedback = [garbage y-3 y-2 y-1] | ||
1287 | 1198 | for (; x != xEnd; x += xStep) | ||
1288 | 1199 | { | ||
1289 | 1200 | __m128d _in = _mm_set1_pd(in[y][x]), | ||
1290 | 1201 | newOutput; | ||
1291 | 1202 | #ifdef __SSE4_1__ | ||
1292 | 1203 | feedback[0] = _mm_blend_pd(feedback[0], _in, 0x1); | ||
1293 | 1204 | newOutput = _mm_add_pd(_mm_dp_pd(feedback[0], k0[0], 0x31), | ||
1294 | 1205 | _mm_dp_pd(feedback[1], k0[1], 0x31)); | ||
1295 | 1206 | feedback[0] = _mm_blend_pd(feedback[0], newOutput, 0x1); // insert back input | ||
1296 | 1207 | #else | ||
1297 | 1208 | __m128d FIRST_ELEMENT_MASK = _mm_castsi128_pd(_mm_set_epi64x(0, ~uint64_t(0))); | ||
1298 | 1209 | feedback[0] = Select(feedback[0], _in, FIRST_ELEMENT_MASK); | ||
1299 | 1210 | |||
1300 | 1211 | __m128d partialDP = _mm_add_pd | ||
1301 | 1212 | ( | ||
1302 | 1213 | _mm_mul_pd(feedback[0], k0[0]), | ||
1303 | 1214 | _mm_mul_pd(feedback[1], k0[1]) | ||
1304 | 1215 | ); | ||
1305 | 1216 | newOutput = _mm_add_pd | ||
1306 | 1217 | ( | ||
1307 | 1218 | partialDP, | ||
1308 | 1219 | _mm_shuffle_pd(partialDP, partialDP, _MM_SHUFFLE2(0, 1)) | ||
1309 | 1220 | ); | ||
1310 | 1221 | feedback[0] = Select(feedback[0], newOutput, FIRST_ELEMENT_MASK); // insert back input | ||
1311 | 1222 | #endif | ||
1312 | 1223 | out[y][x] = _mm_cvtsd_f64(newOutput); | ||
1313 | 1224 | feedback[0] = _mm_shuffle_pd(feedback[0], feedback[1], _MM_SHUFFLE2(0, 0)); | ||
1314 | 1225 | feedback[1] = _mm_shuffle_pd(feedback[1], feedback[0], _MM_SHUFFLE2(0, 1)); | ||
1315 | 1226 | } | ||
1316 | 1227 | } | ||
1317 | 1228 | else | ||
1318 | 1229 | { | ||
1319 | 1230 | __m128d feedback[2], k4[2]; | ||
1320 | 1231 | #ifdef __AVX__ | ||
1321 | 1232 | k4[0] = Cast256To128(vCoefficients[4]); | ||
1322 | 1233 | k4[1] = _mm256_extractf128_pd(vCoefficients[4], 1); | ||
1323 | 1234 | #else | ||
1324 | 1235 | k4[0] = vCoefficients[8]; | ||
1325 | 1236 | k4[1] = vCoefficients[9]; | ||
1326 | 1237 | #endif | ||
1327 | 1238 | ssize_t x = xStart; | ||
1328 | 1239 | if (isBorder && !isForwardPass) | ||
1329 | 1240 | { | ||
1330 | 1241 | // xstart must be width - 1 | ||
1331 | 1242 | double u[N + 1][1]; //[x][y][channels] | ||
1332 | 1243 | for (ssize_t i = 0; i < N + 1; ++i) | ||
1333 | 1244 | { | ||
1334 | 1245 | u[i][0] = in[y][xStart + i * xStep]; | ||
1335 | 1246 | } | ||
1336 | 1247 | #define ROUND_UP(a, b) ((a + b - 1) / b * b) | ||
1337 | 1248 | double backwardsInitialState[ROUND_UP(N, 2)][1]; // pad so vector loads don't go past end | ||
1338 | 1249 | calcTriggsSdikaInitialization<1>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState); | ||
1339 | 1250 | |||
1340 | 1251 | feedback[0] = _mm_load_pd(&backwardsInitialState[0][0]); | ||
1341 | 1252 | feedback[1] = _mm_load_pd(&backwardsInitialState[2][0]); | ||
1342 | 1253 | |||
1343 | 1254 | out[y][x] = backwardsInitialState[0][0]; | ||
1344 | 1255 | x += xStep; | ||
1345 | 1256 | if (x == xEnd) | ||
1346 | 1257 | continue; | ||
1347 | 1258 | } | ||
1348 | 1259 | else | ||
1349 | 1260 | { | ||
1350 | 1261 | LoadDoubles(feedback[0], &out[y][xStart - xStep]); | ||
1351 | 1262 | LoadDoubles(feedback[1], &out[y][xStart - 3 * xStep]); | ||
1352 | 1263 | } | ||
1353 | 1264 | |||
1354 | 1265 | for (; x != xEnd; x += xStep) | ||
1355 | 1266 | { | ||
1356 | 1267 | __m128d _in = _mm_set1_pd(in[y][x]), | ||
1357 | 1268 | newOutput; | ||
1358 | 1269 | #ifdef __SSE4_1__ | ||
1359 | 1270 | feedback[1] = _mm_blend_pd(feedback[1], _in, 0x2); | ||
1360 | 1271 | newOutput = _mm_add_pd(_mm_dp_pd(feedback[0], k4[0], 0x32), | ||
1361 | 1272 | _mm_dp_pd(feedback[1], k4[1], 0x32)); | ||
1362 | 1273 | feedback[1] = _mm_blend_pd(feedback[1], newOutput, 0x2); // insert back input | ||
1363 | 1274 | #else | ||
1364 | 1275 | __m128d LAST_ELEMENT_MASK = _mm_castsi128_pd(_mm_set_epi64x(~uint64_t(0), 0)); | ||
1365 | 1276 | feedback[1] = Select(feedback[1], _in, LAST_ELEMENT_MASK); | ||
1366 | 1277 | |||
1367 | 1278 | __m128d partialDP = _mm_add_pd | ||
1368 | 1279 | ( | ||
1369 | 1280 | _mm_mul_pd(feedback[0], k4[0]), | ||
1370 | 1281 | _mm_mul_pd(feedback[1], k4[1]) | ||
1371 | 1282 | ); | ||
1372 | 1283 | newOutput = _mm_add_pd | ||
1373 | 1284 | ( | ||
1374 | 1285 | partialDP, | ||
1375 | 1286 | _mm_shuffle_pd(partialDP, partialDP, _MM_SHUFFLE2(0, 0)) | ||
1376 | 1287 | ); | ||
1377 | 1288 | feedback[1] = Select(feedback[1], newOutput, LAST_ELEMENT_MASK); | ||
1378 | 1289 | #endif | ||
1379 | 1290 | |||
1380 | 1291 | __m128d temp = _mm_shuffle_pd(feedback[1], feedback[1], _MM_SHUFFLE2(0, 1)); | ||
1381 | 1292 | out[y][x] = _mm_cvtsd_f64(temp); | ||
1382 | 1293 | |||
1383 | 1294 | feedback[1] = _mm_shuffle_pd(feedback[0], feedback[1], _MM_SHUFFLE2(1, 1)); | ||
1384 | 1295 | feedback[0] = _mm_shuffle_pd(feedback[1], feedback[0], _MM_SHUFFLE2(0, 1)); | ||
1385 | 1296 | } | ||
1386 | 1297 | } | ||
1387 | 1298 | ++y; | ||
1388 | 1299 | } while (y < height); | ||
1389 | 1300 | } | ||
1390 | 1301 | } | ||
1391 | 1302 | |||
1392 | 1303 | |||
1393 | 1304 | template <int channels, bool transposeOut, ssize_t xStep, int i0, int i1, int i2, typename OutType, typename InType> | ||
1394 | 1305 | FORCE_INLINE void DoOneIIR(SimpleImage<OutType> out, SimpleImage<InType> in, __m256 &vSum, __m256 &vIn, ssize_t x, ssize_t y, __m256 vCoefficients[N + 1], __m256 prevOut[N]) | ||
1395 | 1306 | { | ||
1396 | 1307 | vSum = vIn * vCoefficients[0]; | ||
1397 | 1308 | |||
1398 | 1309 | // load data for next iteration early to hide latency (software pipelining) | ||
1399 | 1310 | vIn = Load4x2Floats(&in[y][(x + xStep) * channels], | ||
1400 | 1311 | &in[y + 1][(x + xStep) * channels]); | ||
1401 | 1312 | |||
1402 | 1313 | // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3] | ||
1403 | 1314 | #ifdef __FMA__ | ||
1404 | 1315 | // this expression uses fewer MADs than the max. possible, but has a shorter critical path and is actually faster | ||
1405 | 1316 | vSum = MultiplyAdd(prevOut[i2], vCoefficients[3], vSum) + MultiplyAdd(prevOut[i1], vCoefficients[2], prevOut[i0] * vCoefficients[1]); | ||
1406 | 1317 | #else | ||
1407 | 1318 | vSum = prevOut[i0] * vCoefficients[1] | ||
1408 | 1319 | + prevOut[i1] * vCoefficients[2] | ||
1409 | 1320 | + prevOut[i2] * vCoefficients[3] | ||
1410 | 1321 | + vIn * vCoefficients[0]; | ||
1411 | 1322 | #endif | ||
1412 | 1323 | if (transposeOut) | ||
1413 | 1324 | { | ||
1414 | 1325 | StoreFloats(&out[x][y * channels], _mm256_castps256_ps128(vSum)); | ||
1415 | 1326 | StoreFloats(&out[x][(y + 1) * channels], _mm256_extractf128_ps(vSum, 1)); | ||
1416 | 1327 | } | ||
1417 | 1328 | else | ||
1418 | 1329 | { | ||
1419 | 1330 | StoreFloats(&out[y][x * channels], _mm256_castps256_ps128(vSum)); | ||
1420 | 1331 | StoreFloats(&out[y + 1][x * channels], _mm256_extractf128_ps(vSum, 1)); | ||
1421 | 1332 | } | ||
1422 | 1333 | } | ||
1423 | 1334 | |||
1424 | 1335 | // hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined | ||
1425 | 1336 | template <bool transposeOut, bool isForwardPass, bool isBorder, int channels, typename OutType, typename InType, typename SIMD_Type> | ||
1426 | 1337 | static /*FORCE_INLINE*/void Convolve1DHorizontal(SimpleImage<OutType> out, | ||
1427 | 1338 | SimpleImage<InType> in, | ||
1428 | 1339 | float *borderValues, | ||
1429 | 1340 | ssize_t xStart, ssize_t xEnd, ssize_t width, ssize_t height, | ||
1430 | 1341 | SIMD_Type *vCoefficients, double M[N * N]) | ||
1431 | 1342 | { | ||
1432 | 1343 | #if 0 | ||
1433 | 1344 | MyTraits<float>::SIMDtype coefficients2[4]; | ||
1434 | 1345 | |||
1435 | 1346 | if (channels == 1) | ||
1436 | 1347 | { | ||
1437 | 1348 | coefficients2[0] = _mm256_set1_ps(((float *)vCoefficients)[0]); | ||
1438 | 1349 | coefficients2[1] = _mm256_set1_ps(((float *)vCoefficients)[3]); | ||
1439 | 1350 | coefficients2[2] = _mm256_set1_ps(((float *)vCoefficients)[2]); | ||
1440 | 1351 | coefficients2[3] = _mm256_set1_ps(((float *)vCoefficients)[1]); | ||
1441 | 1352 | vCoefficients = coefficients2; | ||
1442 | 1353 | } | ||
1443 | 1354 | Convolve1DHorizontalRef<transposeOut, isForwardPass, isBorder, channels>(out, | ||
1444 | 1355 | in, | ||
1445 | 1356 | borderValues, | ||
1446 | 1357 | xStart, xEnd, width, height, | ||
1447 | 1358 | vCoefficients, M); | ||
1448 | 1359 | return; | ||
1449 | 1360 | #endif | ||
1450 | 1361 | const ssize_t xStep = isForwardPass ? 1 : -1; | ||
1451 | 1362 | |||
1452 | 1363 | if (channels == 4) | ||
1453 | 1364 | { | ||
1454 | 1365 | ssize_t y = 0; | ||
1455 | 1366 | #ifdef __AVX__ | ||
1456 | 1367 | for (; y <= height - 2; y += 2) // AVX code process 2 rows at a time | ||
1457 | 1368 | { | ||
1458 | 1369 | __m256 prevOut[N]; | ||
1459 | 1370 | |||
1460 | 1371 | ssize_t x = xStart; | ||
1461 | 1372 | |||
1462 | 1373 | if (isBorder && !isForwardPass) | ||
1463 | 1374 | { | ||
1464 | 1375 | float u[N + 1][2 * channels]; //[x][y][channels] | ||
1465 | 1376 | for (ssize_t i = 0; i < N + 1; ++i) | ||
1466 | 1377 | { | ||
1467 | 1378 | __m128 temp; | ||
1468 | 1379 | _mm_storeu_ps(&u[i][0], LoadFloats(temp, &in[y][(xStart + i * xStep) * channels])); | ||
1469 | 1380 | _mm_storeu_ps(&u[i][channels], LoadFloats(temp, &in[y + 1][(xStart + i * xStep) * channels])); | ||
1470 | 1381 | } | ||
1471 | 1382 | float backwardsInitialState[N][2 * channels]; | ||
1472 | 1383 | calcTriggsSdikaInitialization<2 * channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState); | ||
1473 | 1384 | for (ssize_t i = 0; i < N; ++i) | ||
1474 | 1385 | LoadFloats(prevOut[i], backwardsInitialState[i]); | ||
1475 | 1386 | |||
1476 | 1387 | if (transposeOut) | ||
1477 | 1388 | { | ||
1478 | 1389 | StoreFloats(&out[x][y * channels], _mm256_castps256_ps128(prevOut[0])); | ||
1479 | 1390 | StoreFloats(&out[x][(y + 1) * channels], _mm256_extractf128_ps(prevOut[0], 1)); | ||
1480 | 1391 | } | ||
1481 | 1392 | else | ||
1482 | 1393 | { | ||
1483 | 1394 | StoreFloats(&out[y][x * channels], _mm256_castps256_ps128(prevOut[0])); | ||
1484 | 1395 | StoreFloats(&out[y + 1][x * channels], _mm256_extractf128_ps(prevOut[0], 1)); | ||
1485 | 1396 | } | ||
1486 | 1397 | x += xStep; | ||
1487 | 1398 | if (x == xEnd) | ||
1488 | 1399 | continue; | ||
1489 | 1400 | } | ||
1490 | 1401 | else if (isBorder && isForwardPass) | ||
1491 | 1402 | { | ||
1492 | 1403 | // xStart must be 0 | ||
1493 | 1404 | __m256 firstPixel = Load4x2Floats(&in[y][0 * channels], &in[y + 1][0 * channels]); | ||
1494 | 1405 | for (ssize_t i = 0; i < N; ++i) | ||
1495 | 1406 | prevOut[i] = firstPixel; | ||
1496 | 1407 | } | ||
1497 | 1408 | else | ||
1498 | 1409 | { | ||
1499 | 1410 | for (ssize_t i = 0; i < N; ++i) | ||
1500 | 1411 | { | ||
1501 | 1412 | prevOut[i] = transposeOut ? Load4x2Floats(&out[xStart - (i + 1) * xStep][y * channels], | ||
1502 | 1413 | &out[xStart - (i + 1) * xStep][(y + 1) * channels]) | ||
1503 | 1414 | : Load4x2Floats(&out[y][(xStart - (i + 1) * xStep) * channels], | ||
1504 | 1415 | &out[y + 1][(xStart - (i + 1) * xStep) * channels]); | ||
1505 | 1416 | } | ||
1506 | 1417 | } | ||
1507 | 1418 | |||
1508 | 1419 | #if 0 // 2x slower than no unrolling - too many register spills? | ||
1509 | 1420 | // same as loop below, but unrolled 3 times to increase ||ism, hide latency, and reduce overhead of shifting the sliding window (prevOut) | ||
1510 | 1421 | __m256 vIn = Load4x2Floats(&in[y][xStart * channels], | ||
1511 | 1422 | &in[y + 1][xStart * channels]); | ||
1512 | 1423 | |||
1513 | 1424 | for (; isForwardPass ? (x < xEnd - 3) : (x > xEnd + 3); ) | ||
1514 | 1425 | { | ||
1515 | 1426 | __m256 vSum; | ||
1516 | 1427 | DoOneIIR<channels, transposeOut, xStep, 0, 1, 2>(out, in, vSum, vIn, x, y, vCoefficients, prevOut); | ||
1517 | 1428 | prevOut[2] = vSum; | ||
1518 | 1429 | x += xStep; | ||
1519 | 1430 | |||
1520 | 1431 | DoOneIIR<channels, transposeOut, xStep, 2, 0, 1>(out, in, vSum, vIn, x, y, vCoefficients, prevOut); | ||
1521 | 1432 | prevOut[1] = vSum; | ||
1522 | 1433 | x += xStep; | ||
1523 | 1434 | |||
1524 | 1435 | DoOneIIR<channels, transposeOut, xStep, 1, 2, 0>(out, in, vSum, vIn, x, y, vCoefficients, prevOut); | ||
1525 | 1436 | prevOut[0] = vSum; | ||
1526 | 1437 | x += xStep; | ||
1527 | 1438 | } | ||
1528 | 1439 | #endif | ||
1529 | 1440 | for (; x != xEnd; x += xStep) | ||
1530 | 1441 | { | ||
1531 | 1442 | __m256 vIn = Load4x2Floats(&in[y][x * channels], | ||
1532 | 1443 | &in[y + 1][x * channels]), | ||
1533 | 1444 | vSum; | ||
1534 | 1445 | |||
1535 | 1446 | // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3] | ||
1536 | 1447 | #ifdef __FMA__ | ||
1537 | 1448 | // this expression uses fewer MADs than the max. possible, but has a shorter critical path and is actually faster | ||
1538 | 1449 | vSum = MultiplyAdd(vIn, vCoefficients[0], prevOut[2] * vCoefficients[3]) + MultiplyAdd(prevOut[1], vCoefficients[2], prevOut[0] * vCoefficients[1]); | ||
1539 | 1450 | #else | ||
1540 | 1451 | vSum = prevOut[0] * vCoefficients[1] | ||
1541 | 1452 | + prevOut[1] * vCoefficients[2] | ||
1542 | 1453 | + prevOut[2] * vCoefficients[3] | ||
1543 | 1454 | + vIn * vCoefficients[0]; | ||
1544 | 1455 | #endif | ||
1545 | 1456 | |||
1546 | 1457 | if (transposeOut) | ||
1547 | 1458 | { | ||
1548 | 1459 | StoreFloats(&out[x][y * channels], _mm256_castps256_ps128(vSum)); | ||
1549 | 1460 | StoreFloats(&out[x][(y + 1) * channels], _mm256_extractf128_ps(vSum, 1)); | ||
1550 | 1461 | } | ||
1551 | 1462 | else | ||
1552 | 1463 | { | ||
1553 | 1464 | StoreFloats(&out[y][x * channels], _mm256_castps256_ps128(vSum)); | ||
1554 | 1465 | StoreFloats(&out[y + 1][x * channels], _mm256_extractf128_ps(vSum, 1)); | ||
1555 | 1466 | } | ||
1556 | 1467 | prevOut[2] = prevOut[1]; | ||
1557 | 1468 | prevOut[1] = prevOut[0]; | ||
1558 | 1469 | prevOut[0] = vSum; | ||
1559 | 1470 | } | ||
1560 | 1471 | } | ||
1561 | 1472 | #endif | ||
1562 | 1473 | for (; y < height; ++y) | ||
1563 | 1474 | { | ||
1564 | 1475 | __m128 prevOut[N]; | ||
1565 | 1476 | ssize_t x = xStart; | ||
1566 | 1477 | |||
1567 | 1478 | if (isBorder && !isForwardPass) | ||
1568 | 1479 | { | ||
1569 | 1480 | float u[N + 1][channels]; //[x][channels] | ||
1570 | 1481 | for (ssize_t i = 0; i < N + 1; ++i) | ||
1571 | 1482 | { | ||
1572 | 1483 | __m128 temp; | ||
1573 | 1484 | _mm_storeu_ps(u[i], LoadFloats(temp, &in[y][(xStart + i * xStep) * channels])); | ||
1574 | 1485 | } | ||
1575 | 1486 | float backwardsInitialState[N][channels]; | ||
1576 | 1487 | calcTriggsSdikaInitialization<channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState); | ||
1577 | 1488 | for (ssize_t i = 0; i < N; ++i) | ||
1578 | 1489 | LoadFloats(prevOut[i], backwardsInitialState[i]); | ||
1579 | 1490 | |||
1580 | 1491 | if (transposeOut) | ||
1581 | 1492 | StoreFloats(&out[x][y * channels], prevOut[0]); | ||
1582 | 1493 | else | ||
1583 | 1494 | StoreFloats(&out[y][x * channels], prevOut[0]); | ||
1584 | 1495 | x += xStep; | ||
1585 | 1496 | if (x == xEnd) | ||
1586 | 1497 | continue; | ||
1587 | 1498 | } | ||
1588 | 1499 | else if (isBorder && isForwardPass) | ||
1589 | 1500 | { | ||
1590 | 1501 | // xStart must be 0 | ||
1591 | 1502 | __m128 firstPixel; | ||
1592 | 1503 | LoadFloats(firstPixel, &in[y][0 * channels]); | ||
1593 | 1504 | for (ssize_t i = 0; i < N; ++i) | ||
1594 | 1505 | prevOut[i] = firstPixel; | ||
1595 | 1506 | } | ||
1596 | 1507 | else | ||
1597 | 1508 | { | ||
1598 | 1509 | for (ssize_t i = 0; i < N; ++i) | ||
1599 | 1510 | { | ||
1600 | 1511 | if (transposeOut) | ||
1601 | 1512 | LoadFloats(prevOut[i], &out[xStart - (i + 1) * xStep][y * channels]); | ||
1602 | 1513 | else | ||
1603 | 1514 | LoadFloats(prevOut[i], &out[y][(xStart - (i + 1) * xStep) * channels]); | ||
1604 | 1515 | } | ||
1605 | 1516 | } | ||
1606 | 1517 | |||
1607 | 1518 | do | ||
1608 | 1519 | { | ||
1609 | 1520 | __m128 vIn, vSum; | ||
1610 | 1521 | LoadFloats(vIn, &in[y][x * channels]); | ||
1611 | 1522 | // since coefficient[0] * in can be very small, it should be added to a similar magnitude term to minimize rounding error. For this gaussian filter, the 2nd smallest term is usually coefficient[3] * out[-3] | ||
1612 | 1523 | #ifdef __FMA__ | ||
1613 | 1524 | // this expression uses fewer MADs than the max. possible, but has a shorter critical path and is actually faster | ||
1614 | 1525 | vSum = MultiplyAdd(vIn, Cast256To128(vCoefficients[0]), prevOut[2] * Cast256To128(vCoefficients[3])) + MultiplyAdd(prevOut[1], Cast256To128(vCoefficients[2]), prevOut[0] * Cast256To128(vCoefficients[1])); | ||
1615 | 1526 | #else | ||
1616 | 1527 | vSum = prevOut[0] * Cast256To128(vCoefficients[1]) | ||
1617 | 1528 | + prevOut[1] * Cast256To128(vCoefficients[2]) | ||
1618 | 1529 | + prevOut[2] * Cast256To128(vCoefficients[3]) | ||
1619 | 1530 | + vIn * Cast256To128(vCoefficients[0]); | ||
1620 | 1531 | #endif | ||
1621 | 1532 | if (transposeOut) | ||
1622 | 1533 | { | ||
1623 | 1534 | StoreFloats(&out[x][y * channels], vSum); | ||
1624 | 1535 | } | ||
1625 | 1536 | else | ||
1626 | 1537 | { | ||
1627 | 1538 | StoreFloats(&out[y][x * channels], vSum); | ||
1628 | 1539 | } | ||
1629 | 1540 | prevOut[2] = prevOut[1]; | ||
1630 | 1541 | prevOut[1] = prevOut[0]; | ||
1631 | 1542 | prevOut[0] = vSum; | ||
1632 | 1543 | x += xStep; | ||
1633 | 1544 | } while (x != xEnd); | ||
1634 | 1545 | } | ||
1635 | 1546 | } | ||
1636 | 1547 | else | ||
1637 | 1548 | { | ||
1638 | 1549 | //static_assert(!transposeOut, "transpose not supported"); | ||
1639 | 1550 | ssize_t y = 0; | ||
1640 | 1551 | |||
1641 | 1552 | const ssize_t Y_BLOCK_SIZE = 8; | ||
1642 | 1553 | #ifdef __AVX__ | ||
1643 | 1554 | for (; y <= height - Y_BLOCK_SIZE; y += Y_BLOCK_SIZE) | ||
1644 | 1555 | { | ||
1645 | 1556 | if (isForwardPass) | ||
1646 | 1557 | { | ||
1647 | 1558 | ssize_t x = xStart; | ||
1648 | 1559 | __m256 feedback[4], | ||
1649 | 1560 | k0 = vCoefficients[0], | ||
1650 | 1561 | k1 = vCoefficients[1], | ||
1651 | 1562 | k2 = vCoefficients[2], | ||
1652 | 1563 | k3 = vCoefficients[3]; | ||
1653 | 1564 | |||
1654 | 1565 | if (isBorder && isForwardPass) | ||
1655 | 1566 | { | ||
1656 | 1567 | // xStart must be 0 | ||
1657 | 1568 | for (ssize_t i = 0; i < 4; ++i) | ||
1658 | 1569 | { | ||
1659 | 1570 | feedback[i] = _mm256_setr_m128(_mm_set1_ps(in[y + i * 2][0]), | ||
1660 | 1571 | _mm_set1_ps(in[y + i * 2 + 1][0])); | ||
1661 | 1572 | } | ||
1662 | 1573 | } | ||
1663 | 1574 | else | ||
1664 | 1575 | { | ||
1665 | 1576 | for (ssize_t i = 0; i < 4; ++i) | ||
1666 | 1577 | { | ||
1667 | 1578 | feedback[i] = Load4x2Floats(&out[y + i * 2][xStart - 3 * xStep], | ||
1668 | 1579 | &out[y + i * 2 + 1][xStart - 3 * xStep]); | ||
1669 | 1580 | feedback[i] = _mm256_shuffle_ps(feedback[i], feedback[i], _MM_SHUFFLE(2, 1, 0, 0)); | ||
1670 | 1581 | } | ||
1671 | 1582 | } | ||
1672 | 1583 | // feedback0 = [garbage y-3 y-2 y-1] | ||
1673 | 1584 | for (; x <= xEnd - 4; x += 4) | ||
1674 | 1585 | { | ||
1675 | 1586 | __m256 _in[4]; | ||
1676 | 1587 | for (ssize_t i = 0; i < 4; ++i) | ||
1677 | 1588 | _in[i] = Load4x2Floats(&in[y + i * 2][x], &in[y + i * 2 + 1][x]); | ||
1678 | 1589 | |||
1679 | 1590 | for (int i = 0; i < 4; ++i) | ||
1680 | 1591 | { | ||
1681 | 1592 | feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x11); | ||
1682 | 1593 | feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k0, 0xf1), 0x11); // insert back input | ||
1683 | 1594 | } | ||
1684 | 1595 | |||
1685 | 1596 | for (int i = 0; i < 4; ++i) | ||
1686 | 1597 | { | ||
1687 | 1598 | feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x22); | ||
1688 | 1599 | feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k1, 0xf2), 0x22); // insert back input | ||
1689 | 1600 | } | ||
1690 | 1601 | |||
1691 | 1602 | for (int i = 0; i < 4; ++i) | ||
1692 | 1603 | { | ||
1693 | 1604 | feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x44); | ||
1694 | 1605 | feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k2, 0xf4), 0x44); // insert back input | ||
1695 | 1606 | } | ||
1696 | 1607 | |||
1697 | 1608 | for (ssize_t i = 0; i < 4; ++i) | ||
1698 | 1609 | { | ||
1699 | 1610 | feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x88); | ||
1700 | 1611 | feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k3, 0xf8), 0x88); // insert back input | ||
1701 | 1612 | |||
1702 | 1613 | _mm_storeu_ps((float *)&out[y + i * 2][x], _mm256_castps256_ps128(feedback[i])); | ||
1703 | 1614 | _mm_storeu_ps((float *)&out[y + i * 2 + 1][x], _mm256_extractf128_ps(feedback[i], 1)); | ||
1704 | 1615 | } | ||
1705 | 1616 | } | ||
1706 | 1617 | for (; x != xEnd; x += xStep) | ||
1707 | 1618 | { | ||
1708 | 1619 | // todo: make these scalar loads to avoid buffer overflow | ||
1709 | 1620 | __m256 _in[4]; | ||
1710 | 1621 | for (ssize_t i = 0; i < 4; ++i) | ||
1711 | 1622 | { | ||
1712 | 1623 | _in[i] = Load4x2Floats(&in[y + i * 2][x], | ||
1713 | 1624 | &in[y + i * 2 + 1][x]); | ||
1714 | 1625 | } | ||
1715 | 1626 | |||
1716 | 1627 | for (int i = 0; i < 4; ++i) | ||
1717 | 1628 | { | ||
1718 | 1629 | feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x11); | ||
1719 | 1630 | feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k0, 0xf1), 0x11); // insert back input | ||
1720 | 1631 | } | ||
1721 | 1632 | |||
1722 | 1633 | for (ssize_t i = 0; i < 4; ++i) | ||
1723 | 1634 | { | ||
1724 | 1635 | out[y + i * 2][x] = _mm_cvtss_f32(_mm256_castps256_ps128(feedback[i])); | ||
1725 | 1636 | out[y + i * 2 + 1][x] = _mm_cvtss_f32(_mm256_extractf128_ps(feedback[i], 1)); | ||
1726 | 1637 | } | ||
1727 | 1638 | |||
1728 | 1639 | for (int i = 0; i < 4; ++i) | ||
1729 | 1640 | feedback[i] = _mm256_shuffle_ps(feedback[i], feedback[i], _MM_SHUFFLE(0, 3, 2, 0)); | ||
1730 | 1641 | } | ||
1731 | 1642 | } | ||
1732 | 1643 | else | ||
1733 | 1644 | { | ||
1734 | 1645 | __m256 feedback[4], | ||
1735 | 1646 | k4 = vCoefficients[4], | ||
1736 | 1647 | k5 = vCoefficients[5], | ||
1737 | 1648 | k6 = vCoefficients[6], | ||
1738 | 1649 | k7 = vCoefficients[7]; | ||
1739 | 1650 | ssize_t x = xStart; | ||
1740 | 1651 | if (isBorder && !isForwardPass) | ||
1741 | 1652 | { | ||
1742 | 1653 | // xstart must be width - 1 | ||
1743 | 1654 | float u[N + 1][8 * channels]; //[x][y][channels] | ||
1744 | 1655 | for (ssize_t i = 0; i < N + 1; ++i) | ||
1745 | 1656 | { | ||
1746 | 1657 | for (ssize_t _y = 0; _y < 8; ++_y) | ||
1747 | 1658 | u[i][_y] = in[y + _y][xStart + i * xStep]; | ||
1748 | 1659 | } | ||
1749 | 1660 | float backwardsInitialState[N][8 * channels]; | ||
1750 | 1661 | calcTriggsSdikaInitialization<8 * channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState); | ||
1751 | 1662 | |||
1752 | 1663 | for (ssize_t i = 0; i < 4; ++i) | ||
1753 | 1664 | { | ||
1754 | 1665 | float temp[2][N + 1]; | ||
1755 | 1666 | for (ssize_t j = 0; j < N; ++j) | ||
1756 | 1667 | { | ||
1757 | 1668 | temp[0][j] = backwardsInitialState[j][i * 2]; | ||
1758 | 1669 | temp[1][j] = backwardsInitialState[j][i * 2 + 1]; | ||
1759 | 1670 | } | ||
1760 | 1671 | feedback[i] = Load4x2Floats(temp[0], temp[1]); | ||
1761 | 1672 | } | ||
1762 | 1673 | |||
1763 | 1674 | for (ssize_t _y = 0; _y < Y_BLOCK_SIZE; ++_y) | ||
1764 | 1675 | out[y + _y][x] = backwardsInitialState[0][_y]; | ||
1765 | 1676 | |||
1766 | 1677 | x += xStep; | ||
1767 | 1678 | if (x == xEnd) | ||
1768 | 1679 | continue; | ||
1769 | 1680 | } | ||
1770 | 1681 | else | ||
1771 | 1682 | { | ||
1772 | 1683 | for (ssize_t i = 0; i < 4; ++i) | ||
1773 | 1684 | { | ||
1774 | 1685 | feedback[i] = Load4x2Floats(&out[y + i * 2][xStart - xStep], | ||
1775 | 1686 | &out[y + i * 2 + 1][xStart - xStep]); | ||
1776 | 1687 | } | ||
1777 | 1688 | } | ||
1778 | 1689 | for (; x - 4 >= xEnd; x -= 4) | ||
1779 | 1690 | { | ||
1780 | 1691 | __m256 _in[4]; | ||
1781 | 1692 | for (ssize_t i = 0; i < 4; ++i) | ||
1782 | 1693 | { | ||
1783 | 1694 | _in[i] = Load4x2Floats(&in[y + i * 2][x - 3], | ||
1784 | 1695 | &in[y + i * 2 + 1][x - 3]); | ||
1785 | 1696 | } | ||
1786 | 1697 | |||
1787 | 1698 | for (int i = 0; i < 4; ++i) | ||
1788 | 1699 | { | ||
1789 | 1700 | feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x88); | ||
1790 | 1701 | feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k4, 0xf8), 0x88); // insert back input | ||
1791 | 1702 | } | ||
1792 | 1703 | |||
1793 | 1704 | for (int i = 0; i < 4; ++i) | ||
1794 | 1705 | { | ||
1795 | 1706 | feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x44); | ||
1796 | 1707 | feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k5, 0xf4), 0x44); // insert back input | ||
1797 | 1708 | } | ||
1798 | 1709 | |||
1799 | 1710 | for (int i = 0; i < 4; ++i) | ||
1800 | 1711 | { | ||
1801 | 1712 | feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x22); | ||
1802 | 1713 | feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k6, 0xf2), 0x22); // insert back input | ||
1803 | 1714 | } | ||
1804 | 1715 | |||
1805 | 1716 | for (ssize_t i = 0; i < 4; ++i) | ||
1806 | 1717 | { | ||
1807 | 1718 | feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x11); | ||
1808 | 1719 | feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k7, 0xf1), 0x11); // insert back input | ||
1809 | 1720 | |||
1810 | 1721 | StoreFloats(&out[y + i * 2][x - 3], _mm256_castps256_ps128(feedback[i])); | ||
1811 | 1722 | StoreFloats(&out[y + i * 2 + 1][x - 3], _mm256_extractf128_ps(feedback[i], 1)); | ||
1812 | 1723 | } | ||
1813 | 1724 | } | ||
1814 | 1725 | |||
1815 | 1726 | for ( ; x != xEnd; x += xStep) | ||
1816 | 1727 | { | ||
1817 | 1728 | // todo: make these scalar loads to avoid buffer overflow | ||
1818 | 1729 | __m256 _in[4]; | ||
1819 | 1730 | for (ssize_t i = 0; i < 4; ++i) | ||
1820 | 1731 | { | ||
1821 | 1732 | _in[i] = Load4x2Floats(&in[y + i * 2][x - 3], | ||
1822 | 1733 | &in[y + i * 2 + 1][x - 3]); | ||
1823 | 1734 | } | ||
1824 | 1735 | |||
1825 | 1736 | for (int i = 0; i < 4; ++i) | ||
1826 | 1737 | { | ||
1827 | 1738 | feedback[i] = _mm256_blend_ps(feedback[i], _in[i], 0x88); | ||
1828 | 1739 | feedback[i] = _mm256_blend_ps(feedback[i], _mm256_dp_ps(feedback[i], k4, 0xf8), 0x88); // insert back input | ||
1829 | 1740 | } | ||
1830 | 1741 | |||
1831 | 1742 | for (ssize_t i = 0; i < 4; ++i) | ||
1832 | 1743 | { | ||
1833 | 1744 | __m256 temp = _mm256_shuffle_ps(feedback[i], feedback[i], _MM_SHUFFLE(0, 0, 0, 3)); | ||
1834 | 1745 | out[y + i * 2][x] = _mm_cvtss_f32(_mm256_castps256_ps128(temp)); | ||
1835 | 1746 | out[y + i * 2 + 1][x] = _mm_cvtss_f32(_mm256_extractf128_ps(temp, 1)); | ||
1836 | 1747 | } | ||
1837 | 1748 | |||
1838 | 1749 | |||
1839 | 1750 | for (int i = 0; i < 4; ++i) | ||
1840 | 1751 | feedback[i] = _mm256_shuffle_ps(feedback[i], feedback[i], _MM_SHUFFLE(2, 1, 0, 3)); | ||
1841 | 1752 | } | ||
1842 | 1753 | } | ||
1843 | 1754 | } | ||
1844 | 1755 | #endif | ||
1845 | 1756 | for (; y < height; ++y) | ||
1846 | 1757 | { | ||
1847 | 1758 | if (isForwardPass) | ||
1848 | 1759 | { | ||
1849 | 1760 | ssize_t x = xStart; | ||
1850 | 1761 | __m128 feedback0, | ||
1851 | 1762 | k0 = Cast256To128(vCoefficients[0]); | ||
1852 | 1763 | |||
1853 | 1764 | if (isBorder && isForwardPass) | ||
1854 | 1765 | { | ||
1855 | 1766 | // xStart must be 0 | ||
1856 | 1767 | feedback0 = _mm_set1_ps(in[y][0]); | ||
1857 | 1768 | } | ||
1858 | 1769 | else | ||
1859 | 1770 | { | ||
1860 | 1771 | LoadFloats(feedback0, &out[y][xStart - 3 * xStep]); | ||
1861 | 1772 | feedback0 = _mm_shuffle_ps(feedback0, feedback0, _MM_SHUFFLE(2, 1, 0, 0)); | ||
1862 | 1773 | } | ||
1863 | 1774 | // feedback0 = [garbage y-3 y-2 y-1] | ||
1864 | 1775 | for (; x != xEnd; x += xStep) | ||
1865 | 1776 | { | ||
1866 | 1777 | __m128 _in0 = _mm_set1_ps(in[y][x]); | ||
1867 | 1778 | |||
1868 | 1779 | #ifdef __SSE4_1__ | ||
1869 | 1780 | feedback0 = _mm_blend_ps(feedback0, _in0, 0x1); | ||
1870 | 1781 | feedback0 = _mm_blend_ps(feedback0, _mm_dp_ps(feedback0, k0, 0xf1), 0x1); // insert back input | ||
1871 | 1782 | #else | ||
1872 | 1783 | const __m128 FIRST_ELEMENT_MASK = _mm_castsi128_ps(_mm_set_epi32(0, 0, 0, ~0)); | ||
1873 | 1784 | feedback0 = Select(feedback0, _in0, FIRST_ELEMENT_MASK); | ||
1874 | 1785 | |||
1875 | 1786 | __m128 partialDP = _mm_mul_ps(feedback0, k0); | ||
1876 | 1787 | partialDP = _mm_add_ps(partialDP, _mm_shuffle_ps(partialDP, partialDP, _MM_SHUFFLE(0, 0, 3, 2))); | ||
1877 | 1788 | __m128 DP = _mm_add_ps(partialDP, _mm_shuffle_ps(partialDP, partialDP, _MM_SHUFFLE(0, 0, 0, 1))); | ||
1878 | 1789 | |||
1879 | 1790 | feedback0 = Select(feedback0, DP, FIRST_ELEMENT_MASK); // insert back input | ||
1880 | 1791 | #endif | ||
1881 | 1792 | out[y][x] = _mm_cvtss_f32(feedback0); | ||
1882 | 1793 | feedback0 = _mm_shuffle_ps(feedback0, feedback0, _MM_SHUFFLE(0, 3, 2, 0)); | ||
1883 | 1794 | } | ||
1884 | 1795 | } | ||
1885 | 1796 | else | ||
1886 | 1797 | { | ||
1887 | 1798 | __m128 feedback0, | ||
1888 | 1799 | k4 = Cast256To128(vCoefficients[4]); | ||
1889 | 1800 | |||
1890 | 1801 | ssize_t x = xStart; | ||
1891 | 1802 | if (isBorder && !isForwardPass) | ||
1892 | 1803 | { | ||
1893 | 1804 | // xstart must be width - 1 | ||
1894 | 1805 | float u[N + 1][channels]; //[x][y][channels] | ||
1895 | 1806 | for (ssize_t i = 0; i < N + 1; ++i) | ||
1896 | 1807 | { | ||
1897 | 1808 | u[i][0] = in[y][xStart + i * xStep]; | ||
1898 | 1809 | } | ||
1899 | 1810 | float backwardsInitialState[N][channels]; | ||
1900 | 1811 | calcTriggsSdikaInitialization<channels>(M, u, &borderValues[y * channels], &borderValues[y * channels], ExtractElement0(vCoefficients[0]), backwardsInitialState); | ||
1901 | 1812 | |||
1902 | 1813 | float temp[N + 1]; | ||
1903 | 1814 | for (ssize_t i = 0; i < N; ++i) | ||
1904 | 1815 | { | ||
1905 | 1816 | temp[i] = backwardsInitialState[i][0]; | ||
1906 | 1817 | } | ||
1907 | 1818 | LoadFloats(feedback0, temp); | ||
1908 | 1819 | |||
1909 | 1820 | out[y][x] = backwardsInitialState[0][0]; | ||
1910 | 1821 | x += xStep; | ||
1911 | 1822 | if (x == xEnd) | ||
1912 | 1823 | continue; | ||
1913 | 1824 | } | ||
1914 | 1825 | else | ||
1915 | 1826 | { | ||
1916 | 1827 | LoadFloats(feedback0, &out[y][xStart - xStep]); | ||
1917 | 1828 | } | ||
1918 | 1829 | |||
1919 | 1830 | for (; x != xEnd; x += xStep) | ||
1920 | 1831 | { | ||
1921 | 1832 | __m128 _in0 = _mm_set1_ps(in[y][x]); | ||
1922 | 1833 | |||
1923 | 1834 | #ifdef __SSE4_1__ | ||
1924 | 1835 | feedback0 = _mm_blend_ps(feedback0, _in0, 0x8); | ||
1925 | 1836 | feedback0 = _mm_blend_ps(feedback0, _mm_dp_ps(feedback0, k4, 0xf8), 0x8); // insert back input | ||
1926 | 1837 | #else | ||
1927 | 1838 | const __m128 LAST_ELEMENT_MASK = _mm_castsi128_ps(_mm_set_epi32(~0, 0, 0, 0)); | ||
1928 | 1839 | feedback0 = Select(feedback0, _in0, LAST_ELEMENT_MASK); | ||
1929 | 1840 | |||
1930 | 1841 | __m128 partialDP = _mm_mul_ps(feedback0, k4); | ||
1931 | 1842 | partialDP = _mm_add_ps(partialDP, _mm_shuffle_ps(partialDP, partialDP, _MM_SHUFFLE(1, 0, 0, 0))); | ||
1932 | 1843 | __m128 DP = _mm_add_ps(partialDP, _mm_shuffle_ps(partialDP, partialDP, _MM_SHUFFLE(2, 0, 0, 0))); | ||
1933 | 1844 | |||
1934 | 1845 | feedback0 = Select(feedback0, DP, LAST_ELEMENT_MASK); // insert back input | ||
1935 | 1846 | #endif | ||
1936 | 1847 | __m128 temp = _mm_shuffle_ps(feedback0, feedback0, _MM_SHUFFLE(0, 0, 0, 3)); | ||
1937 | 1848 | out[y][x] = _mm_cvtss_f32(temp); | ||
1938 | 1849 | feedback0 = _mm_shuffle_ps(feedback0, feedback0, _MM_SHUFFLE(2, 1, 0, 3)); | ||
1939 | 1850 | } | ||
1940 | 1851 | } | ||
1941 | 1852 | } | ||
1942 | 1853 | } | ||
1943 | 1854 | } | ||
1944 | 1855 | |||
1945 | 1856 | |||
1946 | 1857 | // does 1D IIR convolution on multiple rows (height) of data | ||
1947 | 1858 | // IntermediateType must be float or double | ||
1948 | 1859 | template <bool isForwardPass, bool isBorder, typename OutType, typename InType, typename IntermediateType> | ||
1949 | 1860 | FORCE_INLINE void Convolve1DVerticalRef(SimpleImage<OutType> out, | ||
1950 | 1861 | SimpleImage<InType> in, | ||
1951 | 1862 | IntermediateType *borderValues, // [y][color] | ||
1952 | 1863 | ssize_t yStart, ssize_t yEnd, ssize_t width, ssize_t height, | ||
1953 | 1864 | typename MyTraits<IntermediateType>::SIMDtype *vCoefficients, double M[N * N]) | ||
1954 | 1865 | { | ||
1955 | 1866 | ssize_t yStep = isForwardPass ? 1 : -1; | ||
1956 | 1867 | |||
1957 | 1868 | ssize_t x = 0; | ||
1958 | 1869 | do | ||
1959 | 1870 | { | ||
1960 | 1871 | IntermediateType prevOut[N]; | ||
1961 | 1872 | ssize_t y = yStart; | ||
1962 | 1873 | if (isBorder && !isForwardPass) | ||
1963 | 1874 | { | ||
1964 | 1875 | IntermediateType u[N + 1][1]; // u[0] = last forward filtered value, u[1] = 2nd last forward filtered value, ... | ||
1965 | 1876 | for (ssize_t i = 0; i < N + 1; ++i) | ||
1966 | 1877 | { | ||
1967 | 1878 | u[i][0] = in[yStart + i * yStep][x]; | ||
1968 | 1879 | } | ||
1969 | 1880 | IntermediateType backwardsInitialState[N][1]; | ||
1970 | 1881 | calcTriggsSdikaInitialization<1>(M, u, &borderValues[x], &borderValues[x], ExtractElement0(vCoefficients[0]), backwardsInitialState); | ||
1971 | 1882 | for (ssize_t i = 0; i < N; ++i) | ||
1972 | 1883 | prevOut[i] = backwardsInitialState[i][0]; | ||
1973 | 1884 | |||
1974 | 1885 | out[y][x] = clip_round_cast<OutType, IntermediateType>(prevOut[0]); | ||
1975 | 1886 | y += yStep; | ||
1976 | 1887 | if (y == yEnd) | ||
1977 | 1888 | goto nextIteration; | ||
1978 | 1889 | } | ||
1979 | 1890 | else if (isBorder && isForwardPass) | ||
1980 | 1891 | { | ||
1981 | 1892 | for (ssize_t i = 0; i < N; ++i) | ||
1982 | 1893 | prevOut[i] = in[0][x]; | ||
1983 | 1894 | } | ||
1984 | 1895 | else | ||
1985 | 1896 | { | ||
1986 | 1897 | for (ssize_t i = 0; i < N; ++i) | ||
1987 | 1898 | prevOut[i] = out[yStart - (i + 1) * yStep][x]; | ||
1988 | 1899 | } | ||
1989 | 1900 | |||
1990 | 1901 | do | ||
1991 | 1902 | { | ||
1992 | 1903 | IntermediateType sum = prevOut[0] * ExtractElement0(vCoefficients[1]) | ||
1993 | 1904 | + prevOut[1] * ExtractElement0(vCoefficients[2]) | ||
1994 | 1905 | + prevOut[2] * ExtractElement0(vCoefficients[3]) | ||
1995 | 1906 | + in[y][x] * ExtractElement0(vCoefficients[0]); // add last for best accuracy since this terms tends to be the smallest | ||
1996 | 1907 | |||
1997 | 1908 | |||
1998 | 1909 | out[y][x] = clip_round_cast<OutType, IntermediateType>(sum); | ||
1999 | 1910 | prevOut[2] = prevOut[1]; | ||
2000 | 1911 | prevOut[1] = prevOut[0]; | ||
2001 | 1912 | prevOut[0] = sum; | ||
2002 | 1913 | y += yStep; | ||
2003 | 1914 | } while (y != yEnd); | ||
2004 | 1915 | nextIteration: | ||
2005 | 1916 | ++x; | ||
2006 | 1917 | } while (x < width); | ||
2007 | 1918 | } | ||
2008 | 1919 | |||
2009 | 1920 | |||
2010 | 1921 | |||
2011 | 1922 | // input is always untransposed | ||
2012 | 1923 | // for reverse pass, input is output from forward pass | ||
2013 | 1924 | // for transposed output, in-place operation isn't possible | ||
2014 | 1925 | // hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined | ||
2015 | 1926 | template <bool isForwardPass, bool isBorder, typename OutType, typename InType, typename SIMD_Type> | ||
2016 | 1927 | static /*FORCE_INLINE*/ void Convolve1DVertical(SimpleImage<OutType> out, | ||
2017 | 1928 | SimpleImage<InType> in, | ||
2018 | 1929 | float *borderValues, | ||
2019 | 1930 | ssize_t yStart, ssize_t yEnd, ssize_t width, ssize_t height, | ||
2020 | 1931 | SIMD_Type *vCoefficients, double M[N * N]) | ||
2021 | 1932 | { | ||
2022 | 1933 | #if 0 | ||
2023 | 1934 | Convolve1DVerticalRef<isForwardPass, isBorder>(out, | ||
2024 | 1935 | in, | ||
2025 | 1936 | borderValues, | ||
2026 | 1937 | yStart, yEnd, width, height, | ||
2027 | 1938 | vCoefficients, M); | ||
2028 | 1939 | return; | ||
2029 | 1940 | #endif | ||
2030 | 1941 | const ssize_t yStep = isForwardPass ? 1 : -1; | ||
2031 | 1942 | |||
2032 | 1943 | const int SIMD_WIDTH = 8; | ||
2033 | 1944 | ssize_t x = 0; | ||
2034 | 1945 | #ifdef __AVX__ | ||
2035 | 1946 | for ( ; x <= width - SIMD_WIDTH; x += SIMD_WIDTH) | ||
2036 | 1947 | { | ||
2037 | 1948 | __m256 prevOut[N]; | ||
2038 | 1949 | ssize_t y = yStart; | ||
2039 | 1950 | if (isBorder && !isForwardPass) | ||
2040 | 1951 | { | ||
2041 | 1952 | float u[N + 1][SIMD_WIDTH]; //[x][channels] | ||
2042 | 1953 | for (ssize_t i = 0; i < N + 1; ++i) | ||
2043 | 1954 | { | ||
2044 | 1955 | __m256 temp; | ||
2045 | 1956 | _mm256_storeu_ps(u[i], LoadFloats(temp, &in[yStart + i * yStep][x])); | ||
2046 | 1957 | } | ||
2047 | 1958 | float backwardsInitialState[N][SIMD_WIDTH]; | ||
2048 | 1959 | calcTriggsSdikaInitialization<SIMD_WIDTH>(M, u, &borderValues[x], &borderValues[x], ExtractElement0(vCoefficients[0]), backwardsInitialState); | ||
2049 | 1960 | for (ssize_t i = 0; i < N; ++i) | ||
2050 | 1961 | LoadFloats(prevOut[i], backwardsInitialState[i]); | ||
2051 | 1962 | |||
2052 | 1963 | StoreFloats(&out[y][x], prevOut[0]); | ||
2053 | 1964 | |||
2054 | 1965 | y += yStep; | ||
2055 | 1966 | if (y == yEnd) | ||
2056 | 1967 | continue; | ||
2057 | 1968 | } | ||
2058 | 1969 | else if (isBorder && isForwardPass) | ||
2059 | 1970 | { | ||
2060 | 1971 | // yStart must be 0 | ||
2061 | 1972 | __m256 firstPixel; | ||
2062 | 1973 | LoadFloats(firstPixel, &in[0][x]); | ||
2063 | 1974 | for (ssize_t i = 0; i < N; ++i) | ||
2064 | 1975 | prevOut[i] = firstPixel; | ||
2065 | 1976 | } | ||
2066 | 1977 | else | ||
2067 | 1978 | { | ||
2068 | 1979 | for (ssize_t i = 0; i < N; ++i) | ||
2069 | 1980 | { | ||
2070 | 1981 | LoadFloats(prevOut[i], &out[yStart - (i + 1) * yStep][x]); | ||
2071 | 1982 | } | ||
2072 | 1983 | } | ||
2073 | 1984 | |||
2074 | 1985 | do | ||
2075 | 1986 | { | ||
2076 | 1987 | __m256 vIn; | ||
2077 | 1988 | LoadFloats(vIn, &in[y][x]); | ||
2078 | 1989 | __m256 vSum = vIn * vCoefficients[0]; | ||
2079 | 1990 | |||
2080 | 1991 | vSum = prevOut[0] * vCoefficients[1] | ||
2081 | 1992 | + prevOut[1] * vCoefficients[2] | ||
2082 | 1993 | + prevOut[2] * vCoefficients[3] | ||
2083 | 1994 | + vSum; | ||
2084 | 1995 | |||
2085 | 1996 | StoreFloats(&out[y][x], vSum); | ||
2086 | 1997 | |||
2087 | 1998 | prevOut[2] = prevOut[1]; | ||
2088 | 1999 | prevOut[1] = prevOut[0]; | ||
2089 | 2000 | prevOut[0] = vSum; | ||
2090 | 2001 | y += yStep; | ||
2091 | 2002 | } while (isForwardPass ? (y < yEnd) : (y > yEnd)); | ||
2092 | 2003 | } | ||
2093 | 2004 | #endif | ||
2094 | 2005 | { | ||
2095 | 2006 | const ssize_t SIMD_WIDTH = 4; | ||
2096 | 2007 | for (; x < width; x += SIMD_WIDTH) | ||
2097 | 2008 | { | ||
2098 | 2009 | __m128 prevOut[N]; | ||
2099 | 2010 | ssize_t y = yStart; | ||
2100 | 2011 | if (isBorder && !isForwardPass) | ||
2101 | 2012 | { | ||
2102 | 2013 | float u[N + 1][SIMD_WIDTH]; //[x][channels] | ||
2103 | 2014 | for (ssize_t i = 0; i < N + 1; ++i) | ||
2104 | 2015 | { | ||
2105 | 2016 | __m128 temp; | ||
2106 | 2017 | _mm_storeu_ps(u[i], LoadFloats(temp, &in[yStart + i * yStep][x])); | ||
2107 | 2018 | } | ||
2108 | 2019 | float backwardsInitialState[N][SIMD_WIDTH]; | ||
2109 | 2020 | calcTriggsSdikaInitialization<SIMD_WIDTH>(M, u, &borderValues[x], &borderValues[x], ExtractElement0(vCoefficients[0]), backwardsInitialState); | ||
2110 | 2021 | for (ssize_t i = 0; i < N; ++i) | ||
2111 | 2022 | LoadFloats(prevOut[i], backwardsInitialState[i]); | ||
2112 | 2023 | |||
2113 | 2024 | StoreFloats<true>(&out[y][x], prevOut[0], min(SIMD_WIDTH, width - x)); // todo: specialize loop to avoid partial stores | ||
2114 | 2025 | |||
2115 | 2026 | y += yStep; | ||
2116 | 2027 | if (y == yEnd) | ||
2117 | 2028 | continue; | ||
2118 | 2029 | } | ||
2119 | 2030 | else if (isBorder && isForwardPass) | ||
2120 | 2031 | { | ||
2121 | 2032 | // yStart must be 0 | ||
2122 | 2033 | __m128 firstPixel; | ||
2123 | 2034 | LoadFloats(firstPixel, &in[0][x]); | ||
2124 | 2035 | for (ssize_t i = 0; i < N; ++i) | ||
2125 | 2036 | prevOut[i] = firstPixel; | ||
2126 | 2037 | } | ||
2127 | 2038 | else | ||
2128 | 2039 | { | ||
2129 | 2040 | for (ssize_t i = 0; i < N; ++i) | ||
2130 | 2041 | { | ||
2131 | 2042 | LoadFloats(prevOut[i], &out[yStart - (i + 1) * yStep][x]); | ||
2132 | 2043 | } | ||
2133 | 2044 | } | ||
2134 | 2045 | |||
2135 | 2046 | do | ||
2136 | 2047 | { | ||
2137 | 2048 | __m128 vIn; | ||
2138 | 2049 | LoadFloats(vIn, &in[y][x]); | ||
2139 | 2050 | __m128 vSum = vIn * Cast256To128(vCoefficients[0]); | ||
2140 | 2051 | |||
2141 | 2052 | vSum = prevOut[0] * Cast256To128(vCoefficients[1]) | ||
2142 | 2053 | + prevOut[1] * Cast256To128(vCoefficients[2]) | ||
2143 | 2054 | + prevOut[2] * Cast256To128(vCoefficients[3]) | ||
2144 | 2055 | + vSum; | ||
2145 | 2056 | |||
2146 | 2057 | StoreFloats<true>(&out[y][x], vSum, min(SIMD_WIDTH, width - x)); // todo: specialize loop to avoid partial stores | ||
2147 | 2058 | |||
2148 | 2059 | prevOut[2] = prevOut[1]; | ||
2149 | 2060 | prevOut[1] = prevOut[0]; | ||
2150 | 2061 | prevOut[0] = vSum; | ||
2151 | 2062 | y += yStep; | ||
2152 | 2063 | } while (isForwardPass ? (y < yEnd) : (y > yEnd)); | ||
2153 | 2064 | } | ||
2154 | 2065 | } | ||
2155 | 2066 | } | ||
2156 | 2067 | |||
2157 | 2068 | // input is always untransposed | ||
2158 | 2069 | // for reverse pass, input is output from forward pass | ||
2159 | 2070 | // for transposed output, in-place operation isn't possible | ||
2160 | 2071 | // hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined | ||
2161 | 2072 | template <bool isForwardPass, bool isBorder, typename OutType, typename InType, typename SIMD_Type> | ||
2162 | 2073 | static /*FORCE_INLINE*/ void Convolve1DVertical(SimpleImage<OutType> out, | ||
2163 | 2074 | SimpleImage<InType> in, | ||
2164 | 2075 | double *borderValues, | ||
2165 | 2076 | ssize_t yStart, ssize_t yEnd, ssize_t width, ssize_t height, | ||
2166 | 2077 | SIMD_Type *vCoefficients, double M[N * N]) | ||
2167 | 2078 | { | ||
2168 | 2079 | #if 0 | ||
2169 | 2080 | Convolve1DVerticalRef<isForwardPass, isBorder>(out, | ||
2170 | 2081 | in, | ||
2171 | 2082 | borderValues, | ||
2172 | 2083 | yStart, yEnd, width, height, | ||
2173 | 2084 | vCoefficients, M); | ||
2174 | 2085 | return; | ||
2175 | 2086 | #endif | ||
2176 | 2087 | |||
2177 | 2088 | const ssize_t yStep = isForwardPass ? 1 : -1, | ||
2178 | 2089 | SIMD_WIDTH = 4; | ||
2179 | 2090 | ssize_t x = 0; | ||
2180 | 2091 | #ifdef __AVX__ | ||
2181 | 2092 | for ( ; x <= width - SIMD_WIDTH; x += SIMD_WIDTH) | ||
2182 | 2093 | { | ||
2183 | 2094 | __m256d prevOut[N]; | ||
2184 | 2095 | ssize_t y = yStart; | ||
2185 | 2096 | |||
2186 | 2097 | if (isBorder && !isForwardPass) | ||
2187 | 2098 | { | ||
2188 | 2099 | // condition: yStart must be height - 1 | ||
2189 | 2100 | double u[N + 1][SIMD_WIDTH]; //[x][channels] | ||
2190 | 2101 | for (ssize_t i = 0; i < N + 1; ++i) | ||
2191 | 2102 | { | ||
2192 | 2103 | __m256d temp; | ||
2193 | 2104 | _mm256_storeu_pd(u[i], LoadDoubles(temp, &in[yStart + i * yStep][x])); | ||
2194 | 2105 | } | ||
2195 | 2106 | double backwardsInitialState[N][SIMD_WIDTH]; | ||
2196 | 2107 | calcTriggsSdikaInitialization<SIMD_WIDTH>(M, u, &borderValues[x], &borderValues[x], ExtractElement0(vCoefficients[0]), backwardsInitialState); | ||
2197 | 2108 | for (ssize_t i = 0; i < N; ++i) | ||
2198 | 2109 | LoadDoubles(prevOut[i], backwardsInitialState[i]); | ||
2199 | 2110 | |||
2200 | 2111 | StoreDoubles(&out[y][x], prevOut[0]); | ||
2201 | 2112 | |||
2202 | 2113 | y += yStep; | ||
2203 | 2114 | if (y == yEnd) | ||
2204 | 2115 | continue; | ||
2205 | 2116 | } | ||
2206 | 2117 | else if (isBorder && isForwardPass) | ||
2207 | 2118 | { | ||
2208 | 2119 | // condition: yStart must be 0 | ||
2209 | 2120 | __m256d firstPixel; | ||
2210 | 2121 | LoadDoubles(firstPixel, &in[0][x]); | ||
2211 | 2122 | for (ssize_t i = 0; i < N; ++i) | ||
2212 | 2123 | prevOut[i] = firstPixel; | ||
2213 | 2124 | } | ||
2214 | 2125 | else | ||
2215 | 2126 | { | ||
2216 | 2127 | for (ssize_t i = 0; i < N; ++i) | ||
2217 | 2128 | LoadDoubles(prevOut[i], &out[yStart - (i + 1) * yStep][x]); | ||
2218 | 2129 | } | ||
2219 | 2130 | |||
2220 | 2131 | do | ||
2221 | 2132 | { | ||
2222 | 2133 | __m256d vIn; | ||
2223 | 2134 | LoadDoubles(vIn, &in[y][x]); | ||
2224 | 2135 | __m256d vSum = vIn * vCoefficients[0]; | ||
2225 | 2136 | |||
2226 | 2137 | vSum = prevOut[0] * vCoefficients[1] | ||
2227 | 2138 | + prevOut[1] * vCoefficients[2] | ||
2228 | 2139 | + prevOut[2] * vCoefficients[3] | ||
2229 | 2140 | + vSum; | ||
2230 | 2141 | |||
2231 | 2142 | StoreDoubles(&out[y][x], vSum); | ||
2232 | 2143 | |||
2233 | 2144 | prevOut[2] = prevOut[1]; | ||
2234 | 2145 | prevOut[1] = prevOut[0]; | ||
2235 | 2146 | prevOut[0] = vSum; | ||
2236 | 2147 | y += yStep; | ||
2237 | 2148 | } while (y != yEnd); | ||
2238 | 2149 | } | ||
2239 | 2150 | #endif | ||
2240 | 2151 | { | ||
2241 | 2152 | const ssize_t SIMD_WIDTH = 2; | ||
2242 | 2153 | for (; x < width; x += SIMD_WIDTH) | ||
2243 | 2154 | { | ||
2244 | 2155 | __m128d prevOut[N]; | ||
2245 | 2156 | ssize_t y = yStart; | ||
2246 | 2157 | |||
2247 | 2158 | if (isBorder && !isForwardPass) | ||
2248 | 2159 | { | ||
2249 | 2160 | // condition: yStart must be height - 1 | ||
2250 | 2161 | double u[N + 1][SIMD_WIDTH]; //[x][channels] | ||
2251 | 2162 | for (ssize_t i = 0; i < N + 1; ++i) | ||
2252 | 2163 | { | ||
2253 | 2164 | __m128d temp; | ||
2254 | 2165 | _mm_storeu_pd(u[i], LoadDoubles(temp, &in[yStart + i * yStep][x])); | ||
2255 | 2166 | } | ||
2256 | 2167 | double backwardsInitialState[N][SIMD_WIDTH]; | ||
2257 | 2168 | calcTriggsSdikaInitialization<SIMD_WIDTH>(M, u, &borderValues[x], &borderValues[x], ExtractElement0(vCoefficients[0]), backwardsInitialState); | ||
2258 | 2169 | for (ssize_t i = 0; i < N; ++i) | ||
2259 | 2170 | LoadDoubles(prevOut[i], backwardsInitialState[i]); | ||
2260 | 2171 | |||
2261 | 2172 | StoreDoubles<true>(&out[y][x], prevOut[0], min(SIMD_WIDTH, width - x)); // todo: specialize loop to avoid partial stores | ||
2262 | 2173 | |||
2263 | 2174 | y += yStep; | ||
2264 | 2175 | if (y == yEnd) | ||
2265 | 2176 | continue; | ||
2266 | 2177 | } | ||
2267 | 2178 | else if (isBorder && isForwardPass) | ||
2268 | 2179 | { | ||
2269 | 2180 | // condition: yStart must be 0 | ||
2270 | 2181 | __m128d firstPixel; | ||
2271 | 2182 | LoadDoubles(firstPixel, &in[0][x]); | ||
2272 | 2183 | for (ssize_t i = 0; i < N; ++i) | ||
2273 | 2184 | prevOut[i] = firstPixel; | ||
2274 | 2185 | } | ||
2275 | 2186 | else | ||
2276 | 2187 | { | ||
2277 | 2188 | for (ssize_t i = 0; i < N; ++i) | ||
2278 | 2189 | LoadDoubles(prevOut[i], &out[yStart - (i + 1) * yStep][x]); | ||
2279 | 2190 | } | ||
2280 | 2191 | |||
2281 | 2192 | do | ||
2282 | 2193 | { | ||
2283 | 2194 | __m128d vIn; | ||
2284 | 2195 | LoadDoubles(vIn, &in[y][x]); | ||
2285 | 2196 | __m128d vSum = vIn * Cast256To128(vCoefficients[0]); | ||
2286 | 2197 | |||
2287 | 2198 | vSum = prevOut[0] * Cast256To128(vCoefficients[1]) | ||
2288 | 2199 | + prevOut[1] * Cast256To128(vCoefficients[2]) | ||
2289 | 2200 | + prevOut[2] * Cast256To128(vCoefficients[3]) | ||
2290 | 2201 | + vSum; | ||
2291 | 2202 | |||
2292 | 2203 | StoreDoubles<true>(&out[y][x], vSum, min(SIMD_WIDTH, width - x)); // todo: specialize loop to avoid partial stores | ||
2293 | 2204 | |||
2294 | 2205 | prevOut[2] = prevOut[1]; | ||
2295 | 2206 | prevOut[1] = prevOut[0]; | ||
2296 | 2207 | prevOut[0] = vSum; | ||
2297 | 2208 | y += yStep; | ||
2298 | 2209 | } while (y != yEnd); | ||
2299 | 2210 | } | ||
2300 | 2211 | } | ||
2301 | 2212 | } | ||
2302 | 2213 | |||
2303 | 2214 | // hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined | ||
2304 | 2215 | template <bool transposeOut, int channels> | ||
2305 | 2216 | static /*FORCE_INLINE*/ void Copy2D(SimpleImage<uint8_t> out, SimpleImage<float> in, ssize_t width, ssize_t height) | ||
2306 | 2217 | { | ||
2307 | 2218 | ssize_t y = 0; | ||
2308 | 2219 | do | ||
2309 | 2220 | { | ||
2310 | 2221 | ssize_t x = 0; | ||
2311 | 2222 | if (channels == 4) | ||
2312 | 2223 | { | ||
2313 | 2224 | #ifdef __AVX2__ | ||
2314 | 2225 | for (; x <= width - 2; x += 2) | ||
2315 | 2226 | { | ||
2316 | 2227 | __m256i vInt = _mm256_cvtps_epi32(_mm256_loadu_ps(&in[y][x * channels])); | ||
2317 | 2228 | __m256i vInt2 = _mm256_permute2x128_si256(vInt, vInt, 1); | ||
2318 | 2229 | |||
2319 | 2230 | __m128i u16 = _mm256_castsi256_si128(_mm256_packus_epi32(vInt, vInt2)); | ||
2320 | 2231 | __m128i vRGBA = _mm_packus_epi16(u16, u16); | ||
2321 | 2232 | |||
2322 | 2233 | if (transposeOut) | ||
2323 | 2234 | { | ||
2324 | 2235 | *(uint32_t *)&out[x][y * channels] = _mm_extract_epi32(vRGBA, 0); | ||
2325 | 2236 | *(uint32_t *)&out[x + 1][y * channels] = _mm_extract_epi32(vRGBA, 1); | ||
2326 | 2237 | } | ||
2327 | 2238 | else | ||
2328 | 2239 | { | ||
2329 | 2240 | _mm_storel_epi64((__m128i *)&out[y][x * channels], vRGBA); | ||
2330 | 2241 | } | ||
2331 | 2242 | } | ||
2332 | 2243 | #endif | ||
2333 | 2244 | while (x < width) | ||
2334 | 2245 | { | ||
2335 | 2246 | __m128 data = _mm_loadu_ps(&in[y][x * channels]); | ||
2336 | 2247 | if (transposeOut) | ||
2337 | 2248 | StoreFloats(&out[x][y * channels], data); | ||
2338 | 2249 | else | ||
2339 | 2250 | StoreFloats(&out[y][x * channels], data); | ||
2340 | 2251 | ++x; | ||
2341 | 2252 | } | ||
2342 | 2253 | } | ||
2343 | 2254 | else if (channels == 1) | ||
2344 | 2255 | { | ||
2345 | 2256 | #ifdef __AVX__ | ||
2346 | 2257 | for (; x <= width - 8; x += 8) | ||
2347 | 2258 | { | ||
2348 | 2259 | StoreFloats(&out[y][x], _mm256_loadu_ps(&in[y][x])); | ||
2349 | 2260 | } | ||
2350 | 2261 | if (x < width) | ||
2351 | 2262 | StoreFloats<true>(&out[y][x], _mm256_loadu_ps(&in[y][x]), width - x); | ||
2352 | 2263 | #else | ||
2353 | 2264 | for (; x <= width - 4; x += 4) | ||
2354 | 2265 | { | ||
2355 | 2266 | StoreFloats(&out[y][x], _mm_loadu_ps(&in[y][x])); | ||
2356 | 2267 | } | ||
2357 | 2268 | if (x < width) | ||
2358 | 2269 | StoreFloats<true>(&out[y][x], _mm_loadu_ps(&in[y][x]), width - x); | ||
2359 | 2270 | #endif | ||
2360 | 2271 | } | ||
2361 | 2272 | ++y; | ||
2362 | 2273 | } while (y < height); | ||
2363 | 2274 | } | ||
2364 | 2275 | |||
2365 | 2276 | // hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined | ||
2366 | 2277 | template <bool transposeOut, int channels> | ||
2367 | 2278 | static /*FORCE_INLINE*/ void Copy2D(SimpleImage<uint16_t> out, SimpleImage<float> in, ssize_t width, ssize_t height) | ||
2368 | 2279 | { | ||
2369 | 2280 | ssize_t y = 0; | ||
2370 | 2281 | do | ||
2371 | 2282 | { | ||
2372 | 2283 | ssize_t x = 0; | ||
2373 | 2284 | if (channels == 4) | ||
2374 | 2285 | { | ||
2375 | 2286 | #ifdef __AVX2__ | ||
2376 | 2287 | for (; x <= width - 2; x += 2) | ||
2377 | 2288 | { | ||
2378 | 2289 | __m256i vInt = _mm256_cvtps_epi32(_mm256_loadu_ps(&in[y][x * channels])); | ||
2379 | 2290 | __m256i vInt2 = _mm256_permute2x128_si256(vInt, vInt, 1); | ||
2380 | 2291 | |||
2381 | 2292 | __m128i vRGBA = _mm256_castsi256_si128(_mm256_packus_epi32(vInt, vInt2)); | ||
2382 | 2293 | |||
2383 | 2294 | if (transposeOut) | ||
2384 | 2295 | { | ||
2385 | 2296 | _mm_storel_epi64((__m128i *)&out[x][y * channels], vRGBA); | ||
2386 | 2297 | _mm_storel_epi64((__m128i *)&out[x + 1][y * channels], _mm_shuffle_epi32(vRGBA, _MM_SHUFFLE(0, 0, 3, 2))); | ||
2387 | 2298 | } | ||
2388 | 2299 | else | ||
2389 | 2300 | { | ||
2390 | 2301 | _mm_storeu_si128((__m128i *)&out[y][x * channels], vRGBA); | ||
2391 | 2302 | } | ||
2392 | 2303 | } | ||
2393 | 2304 | #endif | ||
2394 | 2305 | while (x < width) | ||
2395 | 2306 | { | ||
2396 | 2307 | __m128 data = _mm_loadu_ps(&in[y][x * channels]); | ||
2397 | 2308 | if (transposeOut) | ||
2398 | 2309 | StoreFloats(&out[x][y * channels], data); | ||
2399 | 2310 | else | ||
2400 | 2311 | StoreFloats(&out[y][x * channels], data); | ||
2401 | 2312 | ++x; | ||
2402 | 2313 | } | ||
2403 | 2314 | } | ||
2404 | 2315 | else if (channels == 1) | ||
2405 | 2316 | { | ||
2406 | 2317 | #ifdef __AVX__ | ||
2407 | 2318 | for (; x <= width - 8; x += 8) | ||
2408 | 2319 | { | ||
2409 | 2320 | StoreFloats(&out[y][x], _mm256_loadu_ps(&in[y][x])); | ||
2410 | 2321 | } | ||
2411 | 2322 | if (x < width) | ||
2412 | 2323 | StoreFloats<true>(&out[y][x], _mm256_loadu_ps(&in[y][x]), width - x); | ||
2413 | 2324 | #else | ||
2414 | 2325 | for (; x <= width - 4; x += 4) | ||
2415 | 2326 | { | ||
2416 | 2327 | StoreFloats(&out[y][x], _mm_loadu_ps(&in[y][x])); | ||
2417 | 2328 | } | ||
2418 | 2329 | if (x < width) | ||
2419 | 2330 | StoreFloats<true>(&out[y][x], _mm_loadu_ps(&in[y][x]), width - x); | ||
2420 | 2331 | #endif | ||
2421 | 2332 | } | ||
2422 | 2333 | ++y; | ||
2423 | 2334 | } while (y < height); | ||
2424 | 2335 | } | ||
2425 | 2336 | |||
2426 | 2337 | // hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined | ||
2427 | 2338 | template <bool transposeOut, int channels> | ||
2428 | 2339 | static /*FORCE_INLINE*/ void Copy2D(SimpleImage<uint16_t> out, SimpleImage<double> in, ssize_t width, ssize_t height) | ||
2429 | 2340 | { | ||
2430 | 2341 | ssize_t y = 0; | ||
2431 | 2342 | do | ||
2432 | 2343 | { | ||
2433 | 2344 | ssize_t x = 0; | ||
2434 | 2345 | if (channels == 4) | ||
2435 | 2346 | { | ||
2436 | 2347 | for ( ; x < width; x++) | ||
2437 | 2348 | { | ||
2438 | 2349 | #ifdef __AVX__ | ||
2439 | 2350 | __m128i i32 = _mm256_cvtpd_epi32(_mm256_loadu_pd(&in[y][x * channels])); | ||
2440 | 2351 | #else | ||
2441 | 2352 | __m128d in0 = _mm_load_pd(&in[y][x * channels]), | ||
2442 | 2353 | in1 = _mm_load_pd(&in[y][x * channels + 2]); | ||
2443 | 2354 | __m128i i32 = _mm_castps_si128(_mm_shuffle_ps(_mm_castsi128_ps(_mm_cvtpd_epi32(in0)), _mm_castsi128_ps(_mm_cvtpd_epi32(in1)), _MM_SHUFFLE(1, 0, 1, 0))); | ||
2444 | 2355 | #endif | ||
2445 | 2356 | #ifdef __SSE4_1__ | ||
2446 | 2357 | __m128i vRGBA = _mm_packus_epi32(i32, i32); | ||
2447 | 2358 | #else | ||
2448 | 2359 | __m128i vRGBA = _mm_max_epi16(_mm_packs_epi32(i32, i32), _mm_setzero_si128()); // hack: can get away with i16 for now | ||
2449 | 2360 | #endif | ||
2450 | 2361 | |||
2451 | 2362 | if (transposeOut) | ||
2452 | 2363 | { | ||
2453 | 2364 | _mm_storel_epi64((__m128i *)&out[x][y * channels], vRGBA); | ||
2454 | 2365 | } | ||
2455 | 2366 | else | ||
2456 | 2367 | { | ||
2457 | 2368 | _mm_storel_epi64((__m128i *)&out[y][x * channels], vRGBA); | ||
2458 | 2369 | } | ||
2459 | 2370 | } | ||
2460 | 2371 | } | ||
2461 | 2372 | else if (channels == 1) | ||
2462 | 2373 | { | ||
2463 | 2374 | #ifdef __AVX__ | ||
2464 | 2375 | for (; x <= width - 4; x += 4) | ||
2465 | 2376 | { | ||
2466 | 2377 | StoreDoubles(&out[y][x], _mm256_loadu_pd(&in[y][x])); | ||
2467 | 2378 | } | ||
2468 | 2379 | if (x < width) | ||
2469 | 2380 | { | ||
2470 | 2381 | StoreDoubles<true>(&out[y][x], _mm256_loadu_pd(&in[y][x]), width - x); | ||
2471 | 2382 | } | ||
2472 | 2383 | #else | ||
2473 | 2384 | for (; x <= width - 2; x += 2) | ||
2474 | 2385 | { | ||
2475 | 2386 | StoreDoubles(&out[y][x], _mm_loadu_pd(&in[y][x])); | ||
2476 | 2387 | } | ||
2477 | 2388 | if (x < width) | ||
2478 | 2389 | { | ||
2479 | 2390 | StoreDoubles<true>(&out[y][x], _mm_loadu_pd(&in[y][x]), width - x); | ||
2480 | 2391 | } | ||
2481 | 2392 | #endif | ||
2482 | 2393 | } | ||
2483 | 2394 | ++y; | ||
2484 | 2395 | } while (y < height); | ||
2485 | 2396 | } | ||
2486 | 2397 | |||
2487 | 2398 | template <bool transposeOut, int channels> | ||
2488 | 2399 | FORCE_INLINE void Copy2D(SimpleImage<float> out, SimpleImage<double> in, ssize_t width, ssize_t height) | ||
2489 | 2400 | { | ||
2490 | 2401 | ssize_t y = 0; | ||
2491 | 2402 | do | ||
2492 | 2403 | { | ||
2493 | 2404 | if (channels == 4) | ||
2494 | 2405 | { | ||
2495 | 2406 | ssize_t x = 0; | ||
2496 | 2407 | do | ||
2497 | 2408 | { | ||
2498 | 2409 | #ifdef __AVX__ | ||
2499 | 2410 | __m128 v4f_data = _mm256_cvtpd_ps(_mm256_loadu_pd(&in[y][x * channels])); | ||
2500 | 2411 | #else | ||
2501 | 2412 | __m128 v4f_data = _mm_shuffle_ps(_mm_cvtpd_ps(_mm_loadu_pd(&in[y][x * channels])), | ||
2502 | 2413 | _mm_cvtpd_ps(_mm_loadu_pd(&in[y][x * channels + 2])), _MM_SHUFFLE(1, 0, 1, 0)); | ||
2503 | 2414 | #endif | ||
2504 | 2415 | if (transposeOut) | ||
2505 | 2416 | _mm_store_ps(&out[x][y * channels], v4f_data); | ||
2506 | 2417 | else | ||
2507 | 2418 | _mm_store_ps(&out[y][x * channels], v4f_data); | ||
2508 | 2419 | ++x; | ||
2509 | 2420 | } while (x < width); | ||
2510 | 2421 | } | ||
2511 | 2422 | else | ||
2512 | 2423 | { | ||
2513 | 2424 | // 1 channel | ||
2514 | 2425 | ssize_t x; | ||
2515 | 2426 | #ifdef __AVX__ | ||
2516 | 2427 | for (x = 0; x <= width - 4; x += 4) | ||
2517 | 2428 | StoreDoubles(&out[y][x], _mm256_loadu_pd(&in[y][x])); | ||
2518 | 2429 | if (x < width) | ||
2519 | 2430 | StoreDoubles<true>(&out[y][x], _mm256_loadu_pd(&in[y][x]), width - x); | ||
2520 | 2431 | #else | ||
2521 | 2432 | for (x = 0; x <= width - 2; x += 2) | ||
2522 | 2433 | StoreDoubles(&out[y][x], _mm_loadu_pd(&in[y][x])); | ||
2523 | 2434 | if (x < width) | ||
2524 | 2435 | StoreDoubles<true>(&out[y][x], _mm_loadu_pd(&in[y][x]), width - x); | ||
2525 | 2436 | #endif | ||
2526 | 2437 | } | ||
2527 | 2438 | ++y; | ||
2528 | 2439 | } while (y < height); | ||
2529 | 2440 | } | ||
2530 | 2441 | |||
2531 | 2442 | // hack: GCC fails to compile when FORCE_INLINE on, most likely because OpenMP doesn't generate code using the target defined in #pragma, but the default (SSE2 only), creating 2 incompatible functions that can't be inlined | ||
2532 | 2443 | template <bool transposeOut, int channels> | ||
2533 | 2444 | static /*FORCE_INLINE*/ void Copy2D(SimpleImage<uint8_t> out, SimpleImage <double> in, ssize_t width, ssize_t height) | ||
2534 | 2445 | { | ||
2535 | 2446 | ssize_t y = 0; | ||
2536 | 2447 | do | ||
2537 | 2448 | { | ||
2538 | 2449 | if (channels == 4) | ||
2539 | 2450 | { | ||
2540 | 2451 | ssize_t x = 0; | ||
2541 | 2452 | do | ||
2542 | 2453 | { | ||
2543 | 2454 | #ifdef __AVX__ | ||
2544 | 2455 | __m256d _in = _mm256_load_pd(&in[y][x * channels]); | ||
2545 | 2456 | __m128i i32 = _mm256_cvtpd_epi32(_in), | ||
2546 | 2457 | #else | ||
2547 | 2458 | __m128d in0 = _mm_load_pd(&in[y][x * channels]), | ||
2548 | 2459 | in1 = _mm_load_pd(&in[y][x * channels + 2]); | ||
2549 | 2460 | __m128i i32 = _mm_castps_si128(_mm_shuffle_ps(_mm_castsi128_ps(_mm_cvtpd_epi32(in0)), _mm_castsi128_ps(_mm_cvtpd_epi32(in1)), _MM_SHUFFLE(1, 0, 1, 0))), | ||
2550 | 2461 | #endif | ||
2551 | 2462 | #ifdef __SSE4_1__ | ||
2552 | 2463 | u16 = _mm_packus_epi32(i32, i32), | ||
2553 | 2464 | #else | ||
2554 | 2465 | u16 = _mm_max_epi16(_mm_packs_epi32(i32, i32), _mm_setzero_si128()), | ||
2555 | 2466 | #endif | ||
2556 | 2467 | u8 = _mm_packus_epi16(u16, u16); | ||
2557 | 2468 | if (transposeOut) | ||
2558 | 2469 | *(int32_t *)&out[x][y * channels] = _mm_cvtsi128_si32(u8); | ||
2559 | 2470 | else | ||
2560 | 2471 | *(int32_t *)&out[y][x * channels] = _mm_cvtsi128_si32(u8); | ||
2561 | 2472 | ++x; | ||
2562 | 2473 | } while (x < width); | ||
2563 | 2474 | } | ||
2564 | 2475 | else | ||
2565 | 2476 | { | ||
2566 | 2477 | // 1 channel | ||
2567 | 2478 | ssize_t x; | ||
2568 | 2479 | #ifdef __AVX__ | ||
2569 | 2480 | for (x = 0; x <= width - 4; x += 4) | ||
2570 | 2481 | StoreDoubles(&out[y][x], _mm256_load_pd(&in[y][x * channels])); | ||
2571 | 2482 | if (x < width) | ||
2572 | 2483 | StoreDoubles<true>(&out[y][x], _mm256_load_pd(&in[y][x * channels]), width - x); | ||
2573 | 2484 | #else | ||
2574 | 2485 | for (x = 0; x <= width - 2; x += 2) | ||
2575 | 2486 | StoreDoubles(&out[y][x], _mm_load_pd(&in[y][x * channels])); | ||
2576 | 2487 | if (x < width) | ||
2577 | 2488 | StoreDoubles<true>(&out[y][x], _mm_load_pd(&in[y][x * channels]), width - x); | ||
2578 | 2489 | #endif | ||
2579 | 2490 | } | ||
2580 | 2491 | ++y; | ||
2581 | 2492 | } while (y < height); | ||
2582 | 2493 | } | ||
2583 | 2494 | |||
2584 | 2495 | #if 0 // comment this function out to ensure everything is vectorized | ||
2585 | 2496 | template <bool transposeOut, int channels, typename OutType, typename InType> | ||
2586 | 2497 | FORCE_INLINE void Copy2D(SimpleImage<OutType> out, SimpleImage <InType> in, ssize_t width, ssize_t height) | ||
2587 | 2498 | { | ||
2588 | 2499 | ssize_t y = 0; | ||
2589 | 2500 | do | ||
2590 | 2501 | { | ||
2591 | 2502 | ssize_t x = 0; | ||
2592 | 2503 | do | ||
2593 | 2504 | { | ||
2594 | 2505 | ssize_t c = 0; | ||
2595 | 2506 | do | ||
2596 | 2507 | { | ||
2597 | 2508 | if (transposeOut) | ||
2598 | 2509 | out[x][y * channels + c] = clip_round_cast<OutType, InType>(in[y][x * channels + c]); | ||
2599 | 2510 | else | ||
2600 | 2511 | out[y][x * channels + c] = clip_round_cast<OutType, InType>(in[y][x * channels + c]); | ||
2601 | 2512 | ++c; | ||
2602 | 2513 | } while (c < channels); | ||
2603 | 2514 | ++x; | ||
2604 | 2515 | } while (x < width); | ||
2605 | 2516 | ++y; | ||
2606 | 2517 | } while (y < height); | ||
2607 | 2518 | } | ||
2608 | 2519 | #endif | ||
2609 | 2520 | |||
2610 | 2521 | template <typename AnyType> | ||
2611 | 2522 | void StoreFloatsTransposed(SimpleImage<AnyType> out, __m128 x) | ||
2612 | 2523 | { | ||
2613 | 2524 | out[0][0] = _mm_cvtss_f32(x); | ||
2614 | 2525 | out[1][0] = _mm_cvtss_f32(_mm_shuffle_ps(x, x, _MM_SHUFFLE(0, 0, 0, 1))); | ||
2615 | 2526 | out[2][0] = _mm_cvtss_f32(_mm_shuffle_ps(x, x, _MM_SHUFFLE(0, 0, 0, 2))); | ||
2616 | 2527 | out[3][0] = _mm_cvtss_f32(_mm_shuffle_ps(x, x, _MM_SHUFFLE(0, 0, 0, 3))); | ||
2617 | 2528 | } | ||
2618 | 2529 | |||
2619 | 2530 | // input & output are color interleaved | ||
2620 | 2531 | template <bool transposeOut, int channels, typename IntermediateType, typename OutType, typename InType> | ||
2621 | 2532 | void ConvolveHorizontal(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, float sigmaX, bool canOverwriteInput = false) | ||
2622 | 2533 | { | ||
2623 | 2534 | const bool convertOutput = typeid(OutType) != typeid(IntermediateType); | ||
2624 | 2535 | |||
2625 | 2536 | typedef typename MyTraits<IntermediateType>::SIMDtype SIMDtype; | ||
2626 | 2537 | |||
2627 | 2538 | double bf[N]; | ||
2628 | 2539 | double M[N*N]; // matrix used for initialization procedure (has to be double) | ||
2629 | 2540 | double b[N + 1]; | ||
2630 | 2541 | |||
2631 | 2542 | calcFilter(sigmaX, bf); | ||
2632 | 2543 | |||
2633 | 2544 | for (size_t i = 0; i<N; i++) | ||
2634 | 2545 | bf[i] = -bf[i]; | ||
2635 | 2546 | |||
2636 | 2547 | b[0] = 1; // b[0] == alpha (scaling coefficient) | ||
2637 | 2548 | |||
2638 | 2549 | for (size_t i = 0; i<N; i++) | ||
2639 | 2550 | { | ||
2640 | 2551 | b[i + 1] = bf[i]; | ||
2641 | 2552 | b[0] -= b[i + 1]; | ||
2642 | 2553 | } | ||
2643 | 2554 | |||
2644 | 2555 | calcTriggsSdikaM(bf, M); // Compute initialization matrix | ||
2645 | 2556 | |||
2646 | 2557 | SIMDtype *vCoefficients; | ||
2647 | 2558 | |||
2648 | 2559 | if (channels == 4) | ||
2649 | 2560 | { | ||
2650 | 2561 | vCoefficients = (SIMDtype *)ALIGNED_ALLOCA(sizeof(SIMDtype) * (N + 1), sizeof(SIMDtype)); | ||
2651 | 2562 | for (ssize_t i = 0; i <= N; ++i) | ||
2652 | 2563 | BroadcastSIMD(vCoefficients[i], (IntermediateType)b[i]); | ||
2653 | 2564 | } | ||
2654 | 2565 | else | ||
2655 | 2566 | { | ||
2656 | 2567 | if (typeid(IntermediateType) == typeid(double)) | ||
2657 | 2568 | { | ||
2658 | 2569 | #ifdef __AVX2__ | ||
2659 | 2570 | __m256d *_vCoefficients = (__m256d *)ALIGNED_ALLOCA(sizeof(SIMDtype) * 2 * (N + 1), sizeof(SIMDtype)); | ||
2660 | 2571 | vCoefficients = (SIMDtype *)_vCoefficients; | ||
2661 | 2572 | |||
2662 | 2573 | __m256d temp = _mm256_loadu_pd(b); | ||
2663 | 2574 | _vCoefficients[0] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(1, 2, 3, 0)); | ||
2664 | 2575 | _vCoefficients[1] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(2, 3, 0, 1)); | ||
2665 | 2576 | _vCoefficients[2] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(3, 0, 1, 2)); | ||
2666 | 2577 | _vCoefficients[3] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(0, 1, 2, 3)); | ||
2667 | 2578 | |||
2668 | 2579 | // permutations for backward pass | ||
2669 | 2580 | _vCoefficients[4] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(0, 3, 2, 1)); | ||
2670 | 2581 | _vCoefficients[5] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(1, 0, 3, 2)); | ||
2671 | 2582 | _vCoefficients[6] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(2, 1, 0, 3)); | ||
2672 | 2583 | _vCoefficients[7] = _mm256_permute4x64_pd(temp, _MM_SHUFFLE(3, 2, 1, 0)); | ||
2673 | 2584 | #else | ||
2674 | 2585 | double *coefficients = (double *)ALIGNED_ALLOCA(sizeof(SIMDtype) * 4 * (N + 1), sizeof(SIMDtype)); | ||
2675 | 2586 | vCoefficients = (SIMDtype *)coefficients; | ||
2676 | 2587 | |||
2677 | 2588 | coefficients[0] = b[0]; | ||
2678 | 2589 | coefficients[1] = b[3]; | ||
2679 | 2590 | coefficients[2] = b[2]; | ||
2680 | 2591 | coefficients[3] = b[1]; | ||
2681 | 2592 | |||
2682 | 2593 | coefficients[4] = b[1]; | ||
2683 | 2594 | coefficients[5] = b[0]; | ||
2684 | 2595 | coefficients[6] = b[3]; | ||
2685 | 2596 | coefficients[7] = b[2]; | ||
2686 | 2597 | |||
2687 | 2598 | coefficients[8] = b[2]; | ||
2688 | 2599 | coefficients[9] = b[1]; | ||
2689 | 2600 | coefficients[10] = b[0]; | ||
2690 | 2601 | coefficients[11] = b[3]; | ||
2691 | 2602 | |||
2692 | 2603 | coefficients[12] = b[3]; | ||
2693 | 2604 | coefficients[13] = b[2]; | ||
2694 | 2605 | coefficients[14] = b[1]; | ||
2695 | 2606 | coefficients[15] = b[0]; | ||
2696 | 2607 | |||
2697 | 2608 | // permutations for backward pass | ||
2698 | 2609 | coefficients[16] = b[1]; | ||
2699 | 2610 | coefficients[17] = b[2]; | ||
2700 | 2611 | coefficients[18] = b[3]; | ||
2701 | 2612 | coefficients[19] = b[0]; | ||
2702 | 2613 | |||
2703 | 2614 | coefficients[20] = b[0]; | ||
2704 | 2615 | coefficients[21] = b[1]; | ||
2705 | 2616 | coefficients[22] = b[2]; | ||
2706 | 2617 | coefficients[23] = b[3]; | ||
2707 | 2618 | |||
2708 | 2619 | coefficients[24] = b[3]; | ||
2709 | 2620 | coefficients[25] = b[0]; | ||
2710 | 2621 | coefficients[26] = b[1]; | ||
2711 | 2622 | coefficients[27] = b[2]; | ||
2712 | 2623 | |||
2713 | 2624 | coefficients[28] = b[2]; | ||
2714 | 2625 | coefficients[29] = b[3]; | ||
2715 | 2626 | coefficients[30] = b[0]; | ||
2716 | 2627 | coefficients[31] = b[1]; | ||
2717 | 2628 | #endif | ||
2718 | 2629 | } | ||
2719 | 2630 | else | ||
2720 | 2631 | { | ||
2721 | 2632 | #ifdef __AVX__ | ||
2722 | 2633 | __m256 *_vCoefficients = (__m256 *)ALIGNED_ALLOCA(sizeof(SIMDtype) * 2 * (N + 1), sizeof(SIMDtype)); | ||
2723 | 2634 | vCoefficients = (SIMDtype *)_vCoefficients; | ||
2724 | 2635 | |||
2725 | 2636 | __m256 temp = _mm256_castps128_ps256(_mm256_cvtpd_ps(_mm256_loadu_pd(b))); | ||
2726 | 2637 | temp = _mm256_permute2f128_ps(temp, temp, 0); | ||
2727 | 2638 | _vCoefficients[0] = _mm256_permute_ps(temp, _MM_SHUFFLE(1, 2, 3, 0)); | ||
2728 | 2639 | _vCoefficients[1] = _mm256_permute_ps(temp, _MM_SHUFFLE(2, 3, 0, 1)); | ||
2729 | 2640 | _vCoefficients[2] = _mm256_permute_ps(temp, _MM_SHUFFLE(3, 0, 1, 2)); | ||
2730 | 2641 | _vCoefficients[3] = _mm256_permute_ps(temp, _MM_SHUFFLE(0, 1, 2, 3)); | ||
2731 | 2642 | |||
2732 | 2643 | // permutations for backward pass | ||
2733 | 2644 | _vCoefficients[4] = _mm256_permute_ps(temp, _MM_SHUFFLE(0, 3, 2, 1)); | ||
2734 | 2645 | _vCoefficients[5] = _mm256_permute_ps(temp, _MM_SHUFFLE(1, 0, 3, 2)); | ||
2735 | 2646 | _vCoefficients[6] = _mm256_permute_ps(temp, _MM_SHUFFLE(2, 1, 0, 3)); | ||
2736 | 2647 | _vCoefficients[7] = _mm256_permute_ps(temp, _MM_SHUFFLE(3, 2, 1, 0)); | ||
2737 | 2648 | #else | ||
2738 | 2649 | __m128 *_vCoefficients = (__m128 *)ALIGNED_ALLOCA(sizeof(SIMDtype) * 2 * (N + 1), sizeof(SIMDtype)); | ||
2739 | 2650 | vCoefficients = (SIMDtype *)_vCoefficients; | ||
2740 | 2651 | |||
2741 | 2652 | __m128 temp = _mm_shuffle_ps | ||
2742 | 2653 | ( | ||
2743 | 2654 | _mm_cvtpd_ps(_mm_loadu_pd(b)), | ||
2744 | 2655 | _mm_cvtpd_ps(_mm_loadu_pd(&b[2])), | ||
2745 | 2656 | _MM_SHUFFLE(1, 0, 1, 0) | ||
2746 | 2657 | ); | ||
2747 | 2658 | _vCoefficients[0] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(1, 2, 3, 0)); | ||
2748 | 2659 | _vCoefficients[1] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(2, 3, 0, 1)); | ||
2749 | 2660 | _vCoefficients[2] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(3, 0, 1, 2)); | ||
2750 | 2661 | _vCoefficients[3] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(0, 1, 2, 3)); | ||
2751 | 2662 | |||
2752 | 2663 | // permutations for backward pass | ||
2753 | 2664 | _vCoefficients[4] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(0, 3, 2, 1)); | ||
2754 | 2665 | _vCoefficients[5] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(1, 0, 3, 2)); | ||
2755 | 2666 | _vCoefficients[6] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(2, 1, 0, 3)); | ||
2756 | 2667 | _vCoefficients[7] = _mm_shuffle_ps(temp, temp, _MM_SHUFFLE(3, 2, 1, 0)); | ||
2757 | 2668 | #endif | ||
2758 | 2669 | } | ||
2759 | 2670 | } | ||
2760 | 2671 | const ssize_t Y_BLOCK_SIZE = 8; | ||
2761 | 2672 | |||
2762 | 2673 | // X_BLOCK_SIZE * channels * sizeof(InType) had better be SIMD aligned | ||
2763 | 2674 | const ssize_t X_BLOCK_SIZE = transposeOut ? 8 | ||
2764 | 2675 | : INT32_MAX / 2; | ||
2765 | 2676 | |||
2766 | 2677 | #pragma omp parallel | ||
2767 | 2678 | { | ||
2768 | 2679 | AlignedImage<IntermediateType, sizeof(SIMDtype)> forwardFilteredTemp; // TODO: can directly output to output buffer if !transposeOut & output type is float | ||
2769 | 2680 | SimpleImage<IntermediateType> forwardFiltered; | ||
2770 | 2681 | if (!convertOutput && !transposeOut) | ||
2771 | 2682 | { | ||
2772 | 2683 | forwardFiltered = SimpleImage<IntermediateType>((IntermediateType *)out.buffer, out.pitch); | ||
2773 | 2684 | } | ||
2774 | 2685 | /* | ||
2775 | 2686 | else if (canOverwriteInput && typeid(InType) == typeid(IntermediateType)) | ||
2776 | 2687 | { | ||
2777 | 2688 | forwardFiltered = SimpleImage<IntermediateType>((IntermediateType *)in.buffer, in.pitch); | ||
2778 | 2689 | }*/ | ||
2779 | 2690 | else | ||
2780 | 2691 | { | ||
2781 | 2692 | forwardFilteredTemp.Resize(width * channels, Y_BLOCK_SIZE); | ||
2782 | 2693 | forwardFiltered = forwardFilteredTemp; | ||
2783 | 2694 | } | ||
2784 | 2695 | |||
2785 | 2696 | IntermediateType *borderValues = (IntermediateType *)alloca(channels * Y_BLOCK_SIZE * sizeof(IntermediateType)); | ||
2786 | 2697 | |||
2787 | 2698 | #pragma omp for | ||
2788 | 2699 | for (ssize_t y0 = 0; y0 < height; y0 += Y_BLOCK_SIZE) | ||
2789 | 2700 | { | ||
2790 | 2701 | ssize_t x = 0; | ||
2791 | 2702 | ssize_t yBlockSize = min(height - y0, Y_BLOCK_SIZE); | ||
2792 | 2703 | |||
2793 | 2704 | ssize_t i = 0; | ||
2794 | 2705 | do | ||
2795 | 2706 | { | ||
2796 | 2707 | ssize_t color = 0; | ||
2797 | 2708 | do | ||
2798 | 2709 | { | ||
2799 | 2710 | borderValues[i * channels + color] = in[y0 + i][(width - 1) * channels + color]; | ||
2800 | 2711 | ++color; | ||
2801 | 2712 | } while (color < channels); | ||
2802 | 2713 | ++i; | ||
2803 | 2714 | } while (i < yBlockSize); | ||
2804 | 2715 | |||
2805 | 2716 | ssize_t xBlockSize = min(max(X_BLOCK_SIZE, ssize_t(N)), width); // try to process at least X_BLOCK_SIZE or else later, data won't be SIMD aligned | ||
2806 | 2717 | // convolve pixels[0:FILTER_SIZE - 1] | ||
2807 | 2718 | Convolve1DHorizontal<false, true, true, channels>(forwardFiltered, | ||
2808 | 2719 | in.SubImage(0, y0), | ||
2809 | 2720 | (IntermediateType *)NULL, | ||
2810 | 2721 | x, x + xBlockSize, width, yBlockSize, | ||
2811 | 2722 | vCoefficients, | ||
2812 | 2723 | M); | ||
2813 | 2724 | |||
2814 | 2725 | x += xBlockSize; | ||
2815 | 2726 | while (x < width) | ||
2816 | 2727 | { | ||
2817 | 2728 | xBlockSize = min(width - x, X_BLOCK_SIZE); | ||
2818 | 2729 | |||
2819 | 2730 | Convolve1DHorizontal<false, true, false, channels>(forwardFiltered, | ||
2820 | 2731 | in.SubImage(0, y0), | ||
2821 | 2732 | (IntermediateType *)NULL, | ||
2822 | 2733 | x, x + xBlockSize, width, yBlockSize, | ||
2823 | 2734 | vCoefficients, | ||
2824 | 2735 | M); | ||
2825 | 2736 | x += xBlockSize; | ||
2826 | 2737 | } | ||
2827 | 2738 | |||
2828 | 2739 | //--------------- backward pass-------------------------- | ||
2829 | 2740 | SimpleImage <IntermediateType> floatOut; | ||
2830 | 2741 | // if output type is fixed point, we still compute an intermediate result as float for better precision | ||
2831 | 2742 | if (convertOutput) | ||
2832 | 2743 | { | ||
2833 | 2744 | floatOut = forwardFiltered; | ||
2834 | 2745 | } | ||
2835 | 2746 | else | ||
2836 | 2747 | { | ||
2837 | 2748 | floatOut = SimpleImage<IntermediateType>((IntermediateType *)&out[transposeOut ? 0 : y0][(transposeOut ? y0 : 0) * channels], out.pitch); | ||
2838 | 2749 | } | ||
2839 | 2750 | x = width - 1; | ||
2840 | 2751 | |||
2841 | 2752 | ssize_t lastAligned = RoundDown(width, X_BLOCK_SIZE); | ||
2842 | 2753 | // todo: check is this really vector aligned? | ||
2843 | 2754 | xBlockSize = min(max(width - lastAligned, ssize_t(N)), width); // try to process more than N pixels so that later, data is SIMD aligned | ||
2844 | 2755 | |||
2845 | 2756 | // in-place operation (use forwardFiltered as both input & output) is possible due to internal register buffering | ||
2846 | 2757 | if (transposeOut && !convertOutput) | ||
2847 | 2758 | Convolve1DHorizontal<true, false, true, channels>(floatOut, | ||
2848 | 2759 | forwardFiltered, | ||
2849 | 2760 | borderValues, | ||
2850 | 2761 | x, x - xBlockSize, width, yBlockSize, | ||
2851 | 2762 | vCoefficients, | ||
2852 | 2763 | M); | ||
2853 | 2764 | else | ||
2854 | 2765 | Convolve1DHorizontal<false, false, true, channels>(floatOut, | ||
2855 | 2766 | forwardFiltered, | ||
2856 | 2767 | borderValues, | ||
2857 | 2768 | x, x - xBlockSize, width, yBlockSize, | ||
2858 | 2769 | vCoefficients, | ||
2859 | 2770 | M); | ||
2860 | 2771 | |||
2861 | 2772 | if (convertOutput) | ||
2862 | 2773 | { | ||
2863 | 2774 | ssize_t outCornerX = x + 1 - xBlockSize; | ||
2864 | 2775 | Copy2D<transposeOut, channels>(out.SubImage((transposeOut ? y0 : outCornerX) * channels, transposeOut ? outCornerX : y0), floatOut.SubImage(outCornerX * channels, 0), xBlockSize, yBlockSize); | ||
2865 | 2776 | } | ||
2866 | 2777 | x -= xBlockSize; | ||
2867 | 2778 | while (x >= 0) | ||
2868 | 2779 | { | ||
2869 | 2780 | xBlockSize = min(X_BLOCK_SIZE, x + 1); | ||
2870 | 2781 | |||
2871 | 2782 | if (transposeOut && !convertOutput) | ||
2872 | 2783 | Convolve1DHorizontal<true, false, false, channels>(floatOut, | ||
2873 | 2784 | forwardFiltered, | ||
2874 | 2785 | borderValues, | ||
2875 | 2786 | x, x - xBlockSize, width, yBlockSize, | ||
2876 | 2787 | vCoefficients, | ||
2877 | 2788 | M); | ||
2878 | 2789 | else | ||
2879 | 2790 | Convolve1DHorizontal<false, false, false, channels>(floatOut, | ||
2880 | 2791 | forwardFiltered, | ||
2881 | 2792 | borderValues, | ||
2882 | 2793 | x, x - xBlockSize, width, yBlockSize, | ||
2883 | 2794 | vCoefficients, | ||
2884 | 2795 | M); | ||
2885 | 2796 | |||
2886 | 2797 | if (convertOutput) | ||
2887 | 2798 | { | ||
2888 | 2799 | ssize_t outCornerX = x + 1 - xBlockSize; | ||
2889 | 2800 | Copy2D<transposeOut, channels>(out.SubImage((transposeOut ? y0 : outCornerX) * channels, transposeOut ? outCornerX : y0), floatOut.SubImage(outCornerX * channels, 0), xBlockSize, yBlockSize); | ||
2890 | 2801 | } | ||
2891 | 2802 | x -= xBlockSize; | ||
2892 | 2803 | } | ||
2893 | 2804 | } | ||
2894 | 2805 | } | ||
2895 | 2806 | } | ||
2896 | 2807 | |||
2897 | 2808 | template <bool transposeOut, int channels, typename OutType, typename InType> | ||
2898 | 2809 | void ConvolveHorizontal(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, float sigmaX, bool canOverwriteInput = false) | ||
2899 | 2810 | { | ||
2900 | 2811 | if (sigmaX > MAX_SIZE_FOR_SINGLE_PRECISION) | ||
2901 | 2812 | ConvolveHorizontal<transposeOut, channels, double, OutType, InType>(out, in, width, height, sigmaX, canOverwriteInput); | ||
2902 | 2813 | else | ||
2903 | 2814 | ConvolveHorizontal<transposeOut, channels, float, OutType, InType>(out, in, width, height, sigmaX, canOverwriteInput); | ||
2904 | 2815 | } | ||
2905 | 2816 | |||
2906 | 2817 | // handles blocking | ||
2907 | 2818 | // input & output are color interleaved | ||
2908 | 2819 | template <typename IntermediateType, typename OutType, typename InType> | ||
2909 | 2820 | void ConvolveVertical(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, double sigmaY) | ||
2910 | 2821 | { | ||
2911 | 2822 | const bool convertOutput = typeid(OutType) != typeid(IntermediateType); | ||
2912 | 2823 | |||
2913 | 2824 | const ssize_t Y_BLOCK_SIZE = 8, | ||
2914 | 2825 | X_BLOCK_SIZE = 40; // must be multiple of SIMD width or else say goodbye to throughput | ||
2915 | 2826 | |||
2916 | 2827 | typedef typename MyTraits<IntermediateType>::SIMDtype SIMDtype; | ||
2917 | 2828 | |||
2918 | 2829 | double bf[N]; | ||
2919 | 2830 | double M[N*N]; // matrix used for initialization procedure (has to be double) | ||
2920 | 2831 | double b[N + 1]; | ||
2921 | 2832 | |||
2922 | 2833 | calcFilter(sigmaY, bf); | ||
2923 | 2834 | |||
2924 | 2835 | for (size_t i = 0; i<N; i++) | ||
2925 | 2836 | bf[i] = -bf[i]; | ||
2926 | 2837 | |||
2927 | 2838 | b[0] = 1; // b[0] == alpha (scaling coefficient) | ||
2928 | 2839 | |||
2929 | 2840 | for (size_t i = 0; i<N; i++) | ||
2930 | 2841 | { | ||
2931 | 2842 | b[i + 1] = bf[i]; | ||
2932 | 2843 | b[0] -= b[i + 1]; | ||
2933 | 2844 | } | ||
2934 | 2845 | b[3] = 1 - (b[0] + b[1] + b[2]); | ||
2935 | 2846 | // Compute initialization matrix | ||
2936 | 2847 | calcTriggsSdikaM(bf, M); | ||
2937 | 2848 | |||
2938 | 2849 | SIMDtype *vCoefficients = (SIMDtype *)ALIGNED_ALLOCA(sizeof(SIMDtype) * (N + 1), sizeof(SIMDtype)); | ||
2939 | 2850 | |||
2940 | 2851 | for (ssize_t i = 0; i <= N; ++i) | ||
2941 | 2852 | { | ||
2942 | 2853 | BroadcastSIMD(vCoefficients[i], (IntermediateType)b[i]); | ||
2943 | 2854 | } | ||
2944 | 2855 | |||
2945 | 2856 | #pragma omp parallel | ||
2946 | 2857 | { | ||
2947 | 2858 | AlignedImage<IntermediateType, sizeof(SIMDtype)> forwardFiltered; // TODO: can directly output to output buffer if output type is float | ||
2948 | 2859 | forwardFiltered.Resize(X_BLOCK_SIZE, height); | ||
2949 | 2860 | |||
2950 | 2861 | IntermediateType *borderValues = (IntermediateType *)alloca(X_BLOCK_SIZE * sizeof(IntermediateType)); | ||
2951 | 2862 | |||
2952 | 2863 | #pragma omp for | ||
2953 | 2864 | for (ssize_t x0 = 0; x0 < width; x0 += X_BLOCK_SIZE) | ||
2954 | 2865 | { | ||
2955 | 2866 | ssize_t y = 0; | ||
2956 | 2867 | ssize_t xBlockSize = min(width - x0, X_BLOCK_SIZE); | ||
2957 | 2868 | |||
2958 | 2869 | ssize_t i = 0; | ||
2959 | 2870 | do | ||
2960 | 2871 | { | ||
2961 | 2872 | borderValues[i] = in[height - 1][x0 + i]; | ||
2962 | 2873 | ++i; | ||
2963 | 2874 | } while (i < xBlockSize); | ||
2964 | 2875 | |||
2965 | 2876 | ssize_t yBlockSize = min(ssize_t(N), height); | ||
2966 | 2877 | // convolve pixels[0:filterSize - 1] | ||
2967 | 2878 | Convolve1DVertical<true, true>(forwardFiltered, | ||
2968 | 2879 | in.SubImage(x0, 0), | ||
2969 | 2880 | (IntermediateType *)NULL, | ||
2970 | 2881 | y, y + yBlockSize, xBlockSize, height, | ||
2971 | 2882 | vCoefficients, | ||
2972 | 2883 | M); | ||
2973 | 2884 | |||
2974 | 2885 | y += yBlockSize; | ||
2975 | 2886 | while (y < height) | ||
2976 | 2887 | { | ||
2977 | 2888 | yBlockSize = min(height - y, Y_BLOCK_SIZE); | ||
2978 | 2889 | |||
2979 | 2890 | Convolve1DVertical<true, false>(forwardFiltered, | ||
2980 | 2891 | in.SubImage(x0, 0), | ||
2981 | 2892 | (IntermediateType *)NULL, | ||
2982 | 2893 | y, y + yBlockSize, xBlockSize, height, | ||
2983 | 2894 | vCoefficients, | ||
2984 | 2895 | M); | ||
2985 | 2896 | y += yBlockSize; | ||
2986 | 2897 | } | ||
2987 | 2898 | |||
2988 | 2899 | //--------------- backward pass-------------------------- | ||
2989 | 2900 | SimpleImage<IntermediateType> floatOut; | ||
2990 | 2901 | // if output type is fixed point, we still compute an intermediate result as float for better precision | ||
2991 | 2902 | if (convertOutput) | ||
2992 | 2903 | { | ||
2993 | 2904 | floatOut = forwardFiltered; | ||
2994 | 2905 | } | ||
2995 | 2906 | else | ||
2996 | 2907 | { | ||
2997 | 2908 | floatOut = SimpleImage<IntermediateType>((IntermediateType *)&out[0][x0], out.pitch); | ||
2998 | 2909 | } | ||
2999 | 2910 | y = height - 1; | ||
3000 | 2911 | yBlockSize = min(ssize_t(N), height); | ||
3001 | 2912 | |||
3002 | 2913 | // in-place operation (use forwardFiltered as both input & output) is possible due to internal register buffering | ||
3003 | 2914 | Convolve1DVertical<false, true>(floatOut, | ||
3004 | 2915 | forwardFiltered, | ||
3005 | 2916 | borderValues, | ||
3006 | 2917 | y, y - yBlockSize, xBlockSize, height, | ||
3007 | 2918 | vCoefficients, | ||
3008 | 2919 | M); | ||
3009 | 2920 | |||
3010 | 2921 | if (convertOutput) | ||
3011 | 2922 | { | ||
3012 | 2923 | ssize_t outCornerY = y + 1 - yBlockSize; | ||
3013 | 2924 | Copy2D<false, 1>(out.SubImage(x0, outCornerY), floatOut.SubImage(0, outCornerY), xBlockSize, yBlockSize); | ||
3014 | 2925 | } | ||
3015 | 2926 | y -= yBlockSize; | ||
3016 | 2927 | while (y >= 0) | ||
3017 | 2928 | { | ||
3018 | 2929 | yBlockSize = min(Y_BLOCK_SIZE, y + 1); | ||
3019 | 2930 | |||
3020 | 2931 | Convolve1DVertical<false, false>(floatOut, | ||
3021 | 2932 | forwardFiltered, | ||
3022 | 2933 | borderValues, | ||
3023 | 2934 | y, y - yBlockSize, xBlockSize, y, | ||
3024 | 2935 | vCoefficients, | ||
3025 | 2936 | M); | ||
3026 | 2937 | |||
3027 | 2938 | if (convertOutput) | ||
3028 | 2939 | { | ||
3029 | 2940 | ssize_t outCornerY = y + 1 - yBlockSize; | ||
3030 | 2941 | Copy2D<false, 1>(out.SubImage(x0, outCornerY), floatOut.SubImage(0, outCornerY), xBlockSize, yBlockSize); | ||
3031 | 2942 | } | ||
3032 | 2943 | y -= yBlockSize; | ||
3033 | 2944 | } | ||
3034 | 2945 | } | ||
3035 | 2946 | } | ||
3036 | 2947 | } | ||
3037 | 2948 | |||
3038 | 2949 | template <typename OutType, typename InType> | ||
3039 | 2950 | void ConvolveVertical(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, float sigmaY) | ||
3040 | 2951 | { | ||
3041 | 2952 | if (sigmaY > MAX_SIZE_FOR_SINGLE_PRECISION) | ||
3042 | 2953 | ConvolveVertical<double>(out, in, width, height, sigmaY); | ||
3043 | 2954 | else | ||
3044 | 2955 | ConvolveVertical<float>(out, in, width, height, sigmaY); | ||
3045 | 2956 | } | ||
3046 | 2957 | |||
3047 | 2958 | // 2D | ||
3048 | 2959 | template <int channels, typename OutType, typename InType> | ||
3049 | 2960 | void Convolve(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY) | ||
3050 | 2961 | { | ||
3051 | 2962 | using namespace std::chrono; | ||
3052 | 2963 | typedef uint16_t HorizontalFilteredType; | ||
3053 | 2964 | AlignedImage<HorizontalFilteredType, sizeof(__m256)> horizontalFiltered; | ||
3054 | 2965 | |||
3055 | 2966 | const bool DO_TIMING = false; | ||
3056 | 2967 | high_resolution_clock::time_point t0; | ||
3057 | 2968 | if (DO_TIMING) | ||
3058 | 2969 | t0 = high_resolution_clock::now(); | ||
3059 | 2970 | |||
3060 | 2971 | const bool TRANSPOSE = channels != 1; // means for the 1st and 2nd pass to transposes output | ||
3061 | 2972 | |||
3062 | 2973 | if (TRANSPOSE) | ||
3063 | 2974 | { | ||
3064 | 2975 | horizontalFiltered.Resize(height * channels, width); | ||
3065 | 2976 | } | ||
3066 | 2977 | else | ||
3067 | 2978 | { | ||
3068 | 2979 | horizontalFiltered.Resize(width * channels, height); | ||
3069 | 2980 | } | ||
3070 | 2981 | ConvolveHorizontal<TRANSPOSE, channels>(horizontalFiltered, in, width, height, sigmaX); | ||
3071 | 2982 | |||
3072 | 2983 | #if 0 | ||
3073 | 2984 | // save intermediate image | ||
3074 | 2985 | float scale; | ||
3075 | 2986 | if (typeid(InType) == typeid(uint8_t)) | ||
3076 | 2987 | scale = 1.0f; | ||
3077 | 2988 | else if (typeid(InType) == typeid(uint16_t)) | ||
3078 | 2989 | scale = 1.0f; | ||
3079 | 2990 | else | ||
3080 | 2991 | scale = 1.0f; | ||
3081 | 2992 | SaveImage("horizontal_filtered.png", horizontalFiltered, TRANSPOSE ? height : width, TRANPOSE ? width : height, channels, scale); | ||
3082 | 2993 | #endif | ||
3083 | 2994 | |||
3084 | 2995 | if (DO_TIMING) | ||
3085 | 2996 | cout << "Thoriz=" << duration_cast<milliseconds>(high_resolution_clock::now() - t0).count() << " ms" << endl; | ||
3086 | 2997 | |||
3087 | 2998 | //--------------------------------------------------- | ||
3088 | 2999 | |||
3089 | 3000 | if (DO_TIMING) | ||
3090 | 3001 | t0 = high_resolution_clock::now(); | ||
3091 | 3002 | if (TRANSPOSE) | ||
3092 | 3003 | { | ||
3093 | 3004 | ConvolveHorizontal<true, channels>(out, horizontalFiltered, height, width, sigmaY, true); | ||
3094 | 3005 | } | ||
3095 | 3006 | else | ||
3096 | 3007 | { | ||
3097 | 3008 | ConvolveVertical(out, horizontalFiltered, width * channels, height, sigmaY); | ||
3098 | 3009 | } | ||
3099 | 3010 | if (DO_TIMING) | ||
3100 | 3011 | cout << "Tvert=" << duration_cast<milliseconds>(high_resolution_clock::now() - t0).count() << " ms" << endl; | ||
3101 | 3012 | } | ||
3102 | 3013 | |||
3103 | 3014 | #ifndef __SSSE3__ | ||
3104 | 3015 | #define DO_FIR_IN_FLOAT // without mulhrs_epi16, int16 not competitive | ||
3105 | 3016 | #else | ||
3106 | 3017 | #undef DO_FIR_IN_FLOAT | ||
3107 | 3018 | #endif | ||
3108 | 3019 | |||
3109 | 3020 | #ifdef DO_FIR_IN_FLOAT | ||
3110 | 3021 | |||
3111 | 3022 | // in-place (out = in) operation not allowed | ||
3112 | 3023 | template <int channels, bool symmetric, bool onBorder, typename OutType, typename InType, typename SIMD_Type> | ||
3113 | 3024 | void ConvolveHorizontalFIR(SimpleImage<OutType> out, SimpleImage<InType> in, | ||
3114 | 3025 | ssize_t width, ssize_t height, | ||
3115 | 3026 | ssize_t xStart, ssize_t xEnd, | ||
3116 | 3027 | SIMD_Type *vFilter, int filterSize) // filterSize assumed to be odd | ||
3117 | 3028 | { | ||
3118 | 3029 | if (channels == 4) | ||
3119 | 3030 | { | ||
3120 | 3031 | ssize_t y = 0; | ||
3121 | 3032 | do | ||
3122 | 3033 | { | ||
3123 | 3034 | ssize_t x = xStart; | ||
3124 | 3035 | #ifdef __AVX__ | ||
3125 | 3036 | const ssize_t SIMD_WIDTH = 8, | ||
3126 | 3037 | PIXELS_PER_ITERATION = SIMD_WIDTH / channels; | ||
3127 | 3038 | __m256 vSum, | ||
3128 | 3039 | leftBorderValue, rightBorderValue; | ||
3129 | 3040 | if (onBorder) | ||
3130 | 3041 | { | ||
3131 | 3042 | __m128 temp; | ||
3132 | 3043 | LoadFloats(temp, &in[y][0 * channels]); | ||
3133 | 3044 | leftBorderValue = _mm256_setr_m128(temp, temp); | ||
3134 | 3045 | |||
3135 | 3046 | LoadFloats(temp, &in[y][(width - 1) * channels]); | ||
3136 | 3047 | rightBorderValue = _mm256_setr_m128(temp, temp); | ||
3137 | 3048 | } | ||
3138 | 3049 | goto middle; | ||
3139 | 3050 | |||
3140 | 3051 | do | ||
3141 | 3052 | { | ||
3142 | 3053 | StoreFloats(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum); | ||
3143 | 3054 | middle: | ||
3144 | 3055 | if (symmetric) | ||
3145 | 3056 | { | ||
3146 | 3057 | __m256 vIn; | ||
3147 | 3058 | vSum = vFilter[0] * LoadFloats(vIn, &in[y][x * channels]); | ||
3148 | 3059 | for (ssize_t i = 1; i <= filterSize; ++i) | ||
3149 | 3060 | { | ||
3150 | 3061 | __m256 filter = vFilter[i]; | ||
3151 | 3062 | ssize_t srcX = x - i; | ||
3152 | 3063 | if (onBorder) | ||
3153 | 3064 | srcX = max(-(PIXELS_PER_ITERATION - 1), srcX); // hack: do this for now until LoadFloats is modified to support partial loads | ||
3154 | 3065 | |||
3155 | 3066 | __m256 leftNeighbor, rightNeighbor; | ||
3156 | 3067 | LoadFloats(leftNeighbor, &in[y][srcX * channels]); | ||
3157 | 3068 | srcX = x + i; | ||
3158 | 3069 | if (onBorder) | ||
3159 | 3070 | srcX = min(width - 1, srcX); // hack: do this for now until LoadFloats is modified to support partial loads | ||
3160 | 3071 | |||
3161 | 3072 | LoadFloats(rightNeighbor, &in[y][srcX * channels]); | ||
3162 | 3073 | if (onBorder) | ||
3163 | 3074 | { | ||
3164 | 3075 | __m256i notPastEnd = PartialVectorMask32(min(PIXELS_PER_ITERATION, max(ssize_t(0), width - i - x)) * channels * sizeof(float)), | ||
3165 | 3076 | beforeBeginning = PartialVectorMask32(min(PIXELS_PER_ITERATION, max(ssize_t(0), i - x)) * channels * sizeof(float)); | ||
3166 | 3077 | leftNeighbor = _mm256_blendv_ps(leftNeighbor, leftBorderValue, _mm256_castsi256_ps(beforeBeginning)); | ||
3167 | 3078 | rightNeighbor = _mm256_blendv_ps(rightBorderValue, rightNeighbor, _mm256_castsi256_ps(notPastEnd)); | ||
3168 | 3079 | } | ||
3169 | 3080 | vSum = vSum + filter * (leftNeighbor + rightNeighbor); | ||
3170 | 3081 | } | ||
3171 | 3082 | } | ||
3172 | 3083 | else | ||
3173 | 3084 | { | ||
3174 | 3085 | vSum = _mm256_setzero_ps(); | ||
3175 | 3086 | ssize_t i = 0; | ||
3176 | 3087 | // the smaller & simpler machine code for do-while probably outweighs the cost of the extra add | ||
3177 | 3088 | do | ||
3178 | 3089 | { | ||
3179 | 3090 | ssize_t srcX = x - filterSize / 2 + i; | ||
3180 | 3091 | // todo: border not handled | ||
3181 | 3092 | __m256 vIn; | ||
3182 | 3093 | LoadFloats(vIn, &in[y][srcX * channels]); | ||
3183 | 3094 | vSum = vSum + vFilter[i] * vIn; | ||
3184 | 3095 | ++i; | ||
3185 | 3096 | } while (i < filterSize); | ||
3186 | 3097 | } | ||
3187 | 3098 | x += PIXELS_PER_ITERATION; | ||
3188 | 3099 | } while (x < xEnd); | ||
3189 | 3100 | StoreFloats<true>(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum, (xEnd - (x - PIXELS_PER_ITERATION)) * channels); | ||
3190 | 3101 | #else | ||
3191 | 3102 | do | ||
3192 | 3103 | { | ||
3193 | 3104 | __m128 vSum; | ||
3194 | 3105 | if (symmetric) | ||
3195 | 3106 | { | ||
3196 | 3107 | __m128 vIn; | ||
3197 | 3108 | LoadFloats(vIn, &in[y][x * channels]); | ||
3198 | 3109 | vSum = vFilter[0] * vIn; | ||
3199 | 3110 | for (ssize_t i = 1; i <= filterSize; ++i) | ||
3200 | 3111 | { | ||
3201 | 3112 | ssize_t srcX = x - i; | ||
3202 | 3113 | if (onBorder) | ||
3203 | 3114 | srcX = max(ssize_t(0), srcX); | ||
3204 | 3115 | |||
3205 | 3116 | __m128 leftNeighbor, rightNeighbor; | ||
3206 | 3117 | LoadFloats(leftNeighbor, &in[y][srcX * channels]); | ||
3207 | 3118 | |||
3208 | 3119 | srcX = x + i; | ||
3209 | 3120 | if (onBorder) | ||
3210 | 3121 | srcX = min(width - 1, srcX); | ||
3211 | 3122 | LoadFloats(rightNeighbor, &in[y][srcX * channels]); | ||
3212 | 3123 | |||
3213 | 3124 | vSum = vSum + vFilter[i] * (leftNeighbor + rightNeighbor); | ||
3214 | 3125 | } | ||
3215 | 3126 | } | ||
3216 | 3127 | else | ||
3217 | 3128 | { | ||
3218 | 3129 | vSum = _mm_setzero_ps(); | ||
3219 | 3130 | ssize_t i = 0; | ||
3220 | 3131 | do | ||
3221 | 3132 | { | ||
3222 | 3133 | ssize_t srcX = x - filterSize / 2 + i; | ||
3223 | 3134 | if (onBorder) | ||
3224 | 3135 | srcX = min(width - 1, max(ssize_t(0), srcX)); | ||
3225 | 3136 | |||
3226 | 3137 | __m128 vIn; | ||
3227 | 3138 | LoadFloats(vIn, &in[y][srcX * channels]); | ||
3228 | 3139 | vSum = vSum + vFilter[i] * vIn; | ||
3229 | 3140 | ++i; | ||
3230 | 3141 | } while (i < filterSize); | ||
3231 | 3142 | } | ||
3232 | 3143 | StoreFloats(&out[y][x * channels], vSum); | ||
3233 | 3144 | ++x; | ||
3234 | 3145 | } while (x < xEnd); | ||
3235 | 3146 | |||
3236 | 3147 | #endif | ||
3237 | 3148 | ++y; | ||
3238 | 3149 | } while (y < height); | ||
3239 | 3150 | } | ||
3240 | 3151 | else | ||
3241 | 3152 | { | ||
3242 | 3153 | // 1 channel | ||
3243 | 3154 | // todo: can merge with 4 channel? | ||
3244 | 3155 | ssize_t y = 0; | ||
3245 | 3156 | do | ||
3246 | 3157 | { | ||
3247 | 3158 | ssize_t x = xStart; | ||
3248 | 3159 | #ifdef __AVX__ | ||
3249 | 3160 | const ssize_t SIMD_WIDTH = 8; | ||
3250 | 3161 | |||
3251 | 3162 | __m256 leftBorderValue, rightBorderValue; | ||
3252 | 3163 | if (onBorder) | ||
3253 | 3164 | { | ||
3254 | 3165 | leftBorderValue = _mm256_set1_ps(in[y][0 * channels]); | ||
3255 | 3166 | rightBorderValue = _mm256_set1_ps(in[y][(width - 1) * channels]); | ||
3256 | 3167 | } | ||
3257 | 3168 | __m256 vSum; | ||
3258 | 3169 | // trashed performance from basic block reordering warning | ||
3259 | 3170 | goto middle2; | ||
3260 | 3171 | do | ||
3261 | 3172 | { | ||
3262 | 3173 | // write out values from previous iteration | ||
3263 | 3174 | StoreFloats(&out[y][(x - SIMD_WIDTH) * channels], vSum); | ||
3264 | 3175 | middle2: | ||
3265 | 3176 | if (symmetric) | ||
3266 | 3177 | { | ||
3267 | 3178 | __m256 vIn; | ||
3268 | 3179 | vSum = vFilter[0] * LoadFloats(vIn, &in[y][x * channels]); | ||
3269 | 3180 | for (ssize_t i = 1; i <= filterSize; ++i) | ||
3270 | 3181 | { | ||
3271 | 3182 | __m256 filter = vFilter[i]; | ||
3272 | 3183 | |||
3273 | 3184 | ssize_t srcX = x - i; | ||
3274 | 3185 | if (onBorder) | ||
3275 | 3186 | srcX = max(-(SIMD_WIDTH - 1), srcX); // hack: do this for now until LoadFloats is modified to support partial loads | ||
3276 | 3187 | |||
3277 | 3188 | __m256 leftNeighbor, rightNeighbor; | ||
3278 | 3189 | LoadFloats(leftNeighbor, &in[y][srcX * channels]); | ||
3279 | 3190 | srcX = x + i; | ||
3280 | 3191 | if (onBorder) | ||
3281 | 3192 | srcX = min(width - 1, srcX); // hack: do this for now until LoadFloats is modified to support partial loads | ||
3282 | 3193 | LoadFloats(rightNeighbor, &in[y][srcX * channels]); | ||
3283 | 3194 | |||
3284 | 3195 | if (onBorder) | ||
3285 | 3196 | { | ||
3286 | 3197 | __m256i notPastEnd = PartialVectorMask32(min(SIMD_WIDTH, max(ssize_t(0), width - i - x)) * sizeof(float)), | ||
3287 | 3198 | beforeBeginning = PartialVectorMask32(min(SIMD_WIDTH, max(ssize_t(0), i - x)) * sizeof(float)); | ||
3288 | 3199 | leftNeighbor = _mm256_blendv_ps(leftNeighbor, leftBorderValue, _mm256_castsi256_ps(beforeBeginning)); | ||
3289 | 3200 | rightNeighbor = _mm256_blendv_ps(rightBorderValue, rightNeighbor, _mm256_castsi256_ps(notPastEnd)); | ||
3290 | 3201 | } | ||
3291 | 3202 | vSum = vSum + filter * (leftNeighbor + rightNeighbor); | ||
3292 | 3203 | } | ||
3293 | 3204 | } | ||
3294 | 3205 | else | ||
3295 | 3206 | { | ||
3296 | 3207 | vSum = _mm256_setzero_ps(); | ||
3297 | 3208 | ssize_t i = 0; | ||
3298 | 3209 | // the smaller & simpler machine code for do-while probably outweighs the cost of the extra add | ||
3299 | 3210 | do | ||
3300 | 3211 | { | ||
3301 | 3212 | ssize_t srcX = x - filterSize / 2 + i; | ||
3302 | 3213 | // todo: border not handled | ||
3303 | 3214 | __m256 vIn; | ||
3304 | 3215 | LoadFloats(vIn, &in[y][srcX * channels]); | ||
3305 | 3216 | vSum = vSum + vFilter[i] * vIn; | ||
3306 | 3217 | ++i; | ||
3307 | 3218 | } while (i < filterSize); | ||
3308 | 3219 | } | ||
3309 | 3220 | x += SIMD_WIDTH; | ||
3310 | 3221 | } while (x < xEnd); | ||
3311 | 3222 | StoreFloats<true>(&out[y][(x - SIMD_WIDTH) * channels], vSum, xEnd - (x - SIMD_WIDTH)); | ||
3312 | 3223 | #else | ||
3313 | 3224 | // SSE only | ||
3314 | 3225 | const ssize_t SIMD_WIDTH = 4; | ||
3315 | 3226 | |||
3316 | 3227 | __m128 leftBorderValue, rightBorderValue; | ||
3317 | 3228 | if (onBorder) | ||
3318 | 3229 | { | ||
3319 | 3230 | leftBorderValue = _mm_set1_ps(in[y][0 * channels]); | ||
3320 | 3231 | rightBorderValue = _mm_set1_ps(in[y][(width - 1) * channels]); | ||
3321 | 3232 | } | ||
3322 | 3233 | __m128 vSum; | ||
3323 | 3234 | // trashed performance from basic block reordering warning | ||
3324 | 3235 | goto middle2; | ||
3325 | 3236 | do | ||
3326 | 3237 | { | ||
3327 | 3238 | // write out values from previous iteration | ||
3328 | 3239 | StoreFloats(&out[y][(x - SIMD_WIDTH) * channels], vSum); | ||
3329 | 3240 | middle2: | ||
3330 | 3241 | if (symmetric) | ||
3331 | 3242 | { | ||
3332 | 3243 | __m128 vIn; | ||
3333 | 3244 | vSum = vFilter[0] * LoadFloats(vIn, &in[y][x * channels]); | ||
3334 | 3245 | for (ssize_t i = 1; i <= filterSize; ++i) | ||
3335 | 3246 | { | ||
3336 | 3247 | __m128 filter = vFilter[i]; | ||
3337 | 3248 | |||
3338 | 3249 | ssize_t srcX = x - i; | ||
3339 | 3250 | if (onBorder) | ||
3340 | 3251 | srcX = max(-(SIMD_WIDTH - 1), srcX); // hack: do this for now until LoadFloats is modified to support partial loads | ||
3341 | 3252 | |||
3342 | 3253 | __m128 leftNeighbor, rightNeighbor; | ||
3343 | 3254 | LoadFloats(leftNeighbor, &in[y][srcX * channels]); | ||
3344 | 3255 | srcX = x + i; | ||
3345 | 3256 | if (onBorder) | ||
3346 | 3257 | srcX = min(width - 1, srcX); // hack: do this for now until LoadFloats is modified to support partial loads | ||
3347 | 3258 | LoadFloats(rightNeighbor, &in[y][srcX * channels]); | ||
3348 | 3259 | |||
3349 | 3260 | if (onBorder) | ||
3350 | 3261 | { | ||
3351 | 3262 | __m128i notPastEnd = PartialVectorMask(min(SIMD_WIDTH, max(ssize_t(0), width - i - x)) * sizeof(float)), | ||
3352 | 3263 | beforeBeginning = PartialVectorMask(min(SIMD_WIDTH, max(ssize_t(0), i - x)) * sizeof(float)); | ||
3353 | 3264 | leftNeighbor = Select(leftNeighbor, leftBorderValue, _mm_castsi128_ps(beforeBeginning)); | ||
3354 | 3265 | rightNeighbor = Select(rightBorderValue, rightNeighbor, _mm_castsi128_ps(notPastEnd)); | ||
3355 | 3266 | } | ||
3356 | 3267 | vSum = vSum + filter * (leftNeighbor + rightNeighbor); | ||
3357 | 3268 | } | ||
3358 | 3269 | } | ||
3359 | 3270 | else | ||
3360 | 3271 | { | ||
3361 | 3272 | vSum = _mm_setzero_ps(); | ||
3362 | 3273 | ssize_t i = 0; | ||
3363 | 3274 | // the smaller & simpler machine code for do-while probably outweighs the cost of the extra add | ||
3364 | 3275 | do | ||
3365 | 3276 | { | ||
3366 | 3277 | ssize_t srcX = x - filterSize / 2 + i; | ||
3367 | 3278 | // todo: border not handled | ||
3368 | 3279 | __m128 vIn; | ||
3369 | 3280 | LoadFloats(vIn, &in[y][srcX * channels]); | ||
3370 | 3281 | vSum = vSum + vFilter[i] * vIn; | ||
3371 | 3282 | ++i; | ||
3372 | 3283 | } while (i < filterSize); | ||
3373 | 3284 | } | ||
3374 | 3285 | x += SIMD_WIDTH; | ||
3375 | 3286 | } while (x < xEnd); | ||
3376 | 3287 | StoreFloats<true>(&out[y][(x - SIMD_WIDTH) * channels], vSum, xEnd - (x - SIMD_WIDTH)); | ||
3377 | 3288 | #endif | ||
3378 | 3289 | ++y; | ||
3379 | 3290 | } while (y < height); | ||
3380 | 3291 | } | ||
3381 | 3292 | } | ||
3382 | 3293 | |||
3383 | 3294 | #else | ||
3384 | 3295 | |||
3385 | 3296 | // DO_FIR_IN_INT16 | ||
3386 | 3297 | |||
3387 | 3298 | // in-place (out = in) operation not allowed | ||
3388 | 3299 | template <int channels, bool symmetric, bool onBorder, typename OutType, typename InType, typename SIMD_Type> | ||
3389 | 3300 | void ConvolveHorizontalFIR(SimpleImage<OutType> out, SimpleImage<InType> in, | ||
3390 | 3301 | ssize_t width, ssize_t height, | ||
3391 | 3302 | ssize_t xStart, ssize_t xEnd, | ||
3392 | 3303 | SIMD_Type *vFilter, int filterSize) | ||
3393 | 3304 | { | ||
3394 | 3305 | if (channels == 4) | ||
3395 | 3306 | { | ||
3396 | 3307 | ssize_t y = 0; | ||
3397 | 3308 | do | ||
3398 | 3309 | { | ||
3399 | 3310 | #ifdef __AVX2__ | ||
3400 | 3311 | int16_t *convertedIn; | ||
3401 | 3312 | if (typeid(InType) == typeid(int16_t)) | ||
3402 | 3313 | { | ||
3403 | 3314 | convertedIn = (int16_t *)&in[y][0]; | ||
3404 | 3315 | } | ||
3405 | 3316 | else | ||
3406 | 3317 | { | ||
3407 | 3318 | convertedIn = (int16_t *)ALIGNED_ALLOCA(RoundUp(width * channels, ssize_t(sizeof(__m256i) / sizeof(int16_t))) * sizeof(int16_t), sizeof(__m256i)); | ||
3408 | 3319 | for (ssize_t x = 0; x < width * channels; x += 16) | ||
3409 | 3320 | { | ||
3410 | 3321 | __m128i u8 = _mm_loadu_si128((__m128i *)&in[y][x]); | ||
3411 | 3322 | __m256i i16 = _mm256_slli_epi16(_mm256_cvtepu8_epi16(u8), 6); | ||
3412 | 3323 | _mm256_store_si256((__m256i *)&convertedIn[x], i16); | ||
3413 | 3324 | } | ||
3414 | 3325 | } | ||
3415 | 3326 | ssize_t x = xStart; | ||
3416 | 3327 | const ssize_t SIMD_WIDTH = 16, | ||
3417 | 3328 | PIXELS_PER_ITERATION = SIMD_WIDTH / channels; | ||
3418 | 3329 | __m256i vSum; | ||
3419 | 3330 | __m256i leftBorderValue, rightBorderValue; | ||
3420 | 3331 | if (onBorder) | ||
3421 | 3332 | { | ||
3422 | 3333 | __m128i temp = _mm_set1_epi64x(*(int64_t *)&convertedIn[0 * channels]); | ||
3423 | 3334 | leftBorderValue = _mm256_setr_m128i(temp, temp); | ||
3424 | 3335 | temp = _mm_set1_epi64x(*(int64_t *)&convertedIn[(width - 1) * channels]); | ||
3425 | 3336 | rightBorderValue = _mm256_setr_m128i(temp, temp); | ||
3426 | 3337 | } | ||
3427 | 3338 | goto middle2; | ||
3428 | 3339 | do | ||
3429 | 3340 | { | ||
3430 | 3341 | ScaleAndStoreInt16(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum); | ||
3431 | 3342 | middle2: | ||
3432 | 3343 | if (symmetric) | ||
3433 | 3344 | { | ||
3434 | 3345 | __m256i center = _mm256_loadu_si256((__m256i *)&convertedIn[x * channels]); | ||
3435 | 3346 | vSum = _mm256_mulhrs_epi16(vFilter[0], center); | ||
3436 | 3347 | for (ssize_t i = 1; i <= filterSize; ++i) | ||
3437 | 3348 | { | ||
3438 | 3349 | __m256i filter = vFilter[i]; | ||
3439 | 3350 | |||
3440 | 3351 | ssize_t srcX = x - i; | ||
3441 | 3352 | if (onBorder) | ||
3442 | 3353 | srcX = max(-(PIXELS_PER_ITERATION - 1), srcX); // todo: use this for now until LoadAndScaleToInt16() supports partial loads | ||
3443 | 3354 | |||
3444 | 3355 | __m256i leftNeighbor, rightNeighbor; | ||
3445 | 3356 | leftNeighbor = _mm256_loadu_si256((__m256i *)&convertedIn[srcX * channels]); | ||
3446 | 3357 | |||
3447 | 3358 | srcX = x + i; | ||
3448 | 3359 | if (onBorder) | ||
3449 | 3360 | srcX = min(width - 1, srcX); | ||
3450 | 3361 | rightNeighbor = _mm256_loadu_si256((__m256i *)&convertedIn[srcX * channels]); | ||
3451 | 3362 | |||
3452 | 3363 | if (onBorder) | ||
3453 | 3364 | { | ||
3454 | 3365 | __m256i leftMask = PartialVectorMask32(min(PIXELS_PER_ITERATION, max(ssize_t(0), i - x)) * channels * sizeof(int16_t)), | ||
3455 | 3366 | rightMask = PartialVectorMask32(min(PIXELS_PER_ITERATION, width - (x + i)) * channels * sizeof(int16_t)); | ||
3456 | 3367 | leftNeighbor = _mm256_blendv_epi8(leftNeighbor, leftBorderValue, leftMask); | ||
3457 | 3368 | rightNeighbor = _mm256_blendv_epi8(rightBorderValue, rightNeighbor, rightMask); | ||
3458 | 3369 | } | ||
3459 | 3370 | vSum = _mm256_adds_epi16(vSum, _mm256_mulhrs_epi16(filter, _mm256_adds_epi16(leftNeighbor, rightNeighbor))); | ||
3460 | 3371 | } | ||
3461 | 3372 | } | ||
3462 | 3373 | else | ||
3463 | 3374 | { | ||
3464 | 3375 | throw 0; | ||
3465 | 3376 | } | ||
3466 | 3377 | x += PIXELS_PER_ITERATION; | ||
3467 | 3378 | } while (x < xEnd); | ||
3468 | 3379 | ScaleAndStoreInt16<true>(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum, (xEnd - (x - PIXELS_PER_ITERATION)) * channels); | ||
3469 | 3380 | #else | ||
3470 | 3381 | // SSSE3 only | ||
3471 | 3382 | int16_t *convertedIn; | ||
3472 | 3383 | if (typeid(InType) == typeid(int16_t)) | ||
3473 | 3384 | { | ||
3474 | 3385 | convertedIn = (int16_t *)&in[y][0]; | ||
3475 | 3386 | } | ||
3476 | 3387 | else | ||
3477 | 3388 | { | ||
3478 | 3389 | convertedIn = (int16_t *)ALIGNED_ALLOCA(RoundUp(width * channels, ssize_t(sizeof(__m128i) / sizeof(int16_t))) * sizeof(int16_t), sizeof(__m128i)); | ||
3479 | 3390 | for (ssize_t x = 0; x < width * channels; x += 8) | ||
3480 | 3391 | { | ||
3481 | 3392 | __m128i u8 = _mm_loadl_epi64((__m128i *)&in[y][x]); | ||
3482 | 3393 | __m128i i16 = _mm_slli_epi16(_mm_cvtepu8_epi16(u8), 6); | ||
3483 | 3394 | _mm_store_si128((__m128i *)&convertedIn[x], i16); | ||
3484 | 3395 | } | ||
3485 | 3396 | } | ||
3486 | 3397 | ssize_t x = xStart; | ||
3487 | 3398 | const ssize_t SIMD_WIDTH = 8, | ||
3488 | 3399 | PIXELS_PER_ITERATION = SIMD_WIDTH / channels; | ||
3489 | 3400 | __m128i vSum; | ||
3490 | 3401 | __m128i leftBorderValue, rightBorderValue; | ||
3491 | 3402 | if (onBorder) | ||
3492 | 3403 | { | ||
3493 | 3404 | leftBorderValue = _mm_set1_epi64x(*(int64_t *)&convertedIn[0 * channels]); | ||
3494 | 3405 | rightBorderValue = _mm_set1_epi64x(*(int64_t *)&convertedIn[(width - 1) * channels]); | ||
3495 | 3406 | } | ||
3496 | 3407 | goto middle3; | ||
3497 | 3408 | do | ||
3498 | 3409 | { | ||
3499 | 3410 | ScaleAndStoreInt16(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum); | ||
3500 | 3411 | middle3: | ||
3501 | 3412 | if (symmetric) | ||
3502 | 3413 | { | ||
3503 | 3414 | __m128i center; | ||
3504 | 3415 | vSum = _mm_mulhrs_epi16(Cast256To128(vFilter[0]), LoadAndScaleToInt16(center, &convertedIn[x * channels])); | ||
3505 | 3416 | for (ssize_t i = 1; i <= filterSize; ++i) | ||
3506 | 3417 | { | ||
3507 | 3418 | __m128i filter = Cast256To128(vFilter[i]); | ||
3508 | 3419 | |||
3509 | 3420 | ssize_t srcX = x - i; | ||
3510 | 3421 | if (onBorder) | ||
3511 | 3422 | srcX = max(-(PIXELS_PER_ITERATION - 1), srcX); // todo: use this for now until LoadAndScaleToInt16() supports partial loads | ||
3512 | 3423 | |||
3513 | 3424 | __m128i leftNeighbor = _mm_loadu_si128((__m128i *)&convertedIn[srcX * channels]); | ||
3514 | 3425 | |||
3515 | 3426 | srcX = x + i; | ||
3516 | 3427 | if (onBorder) | ||
3517 | 3428 | srcX = min(width - 1, srcX); | ||
3518 | 3429 | |||
3519 | 3430 | __m128i rightNeighbor = _mm_loadu_si128((__m128i *)&convertedIn[srcX * channels]); | ||
3520 | 3431 | |||
3521 | 3432 | if (onBorder) | ||
3522 | 3433 | { | ||
3523 | 3434 | __m128i leftMask = PartialVectorMask(min(PIXELS_PER_ITERATION, max(ssize_t(0), i - x)) * channels * sizeof(int16_t)), | ||
3524 | 3435 | rightMask = PartialVectorMask(min(PIXELS_PER_ITERATION, width - (x + i)) * channels * sizeof(int16_t)); | ||
3525 | 3436 | leftNeighbor = _mm_blendv_epi8(leftNeighbor, leftBorderValue, leftMask); | ||
3526 | 3437 | rightNeighbor = _mm_blendv_epi8(rightBorderValue, rightNeighbor, rightMask); | ||
3527 | 3438 | } | ||
3528 | 3439 | vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(filter, _mm_adds_epi16(leftNeighbor, rightNeighbor))); | ||
3529 | 3440 | } | ||
3530 | 3441 | } | ||
3531 | 3442 | else | ||
3532 | 3443 | { | ||
3533 | 3444 | throw 0; | ||
3534 | 3445 | } | ||
3535 | 3446 | x += PIXELS_PER_ITERATION; | ||
3536 | 3447 | } while (x < xEnd); | ||
3537 | 3448 | ScaleAndStoreInt16<true>(&out[y][(x - PIXELS_PER_ITERATION) * channels], vSum, (xEnd - (x - PIXELS_PER_ITERATION)) * channels); | ||
3538 | 3449 | #endif | ||
3539 | 3450 | |||
3540 | 3451 | ++y; | ||
3541 | 3452 | } while (y < height); | ||
3542 | 3453 | } | ||
3543 | 3454 | else | ||
3544 | 3455 | { | ||
3545 | 3456 | #ifdef __GNUC__ | ||
3546 | 3457 | const static void *labels[] = | ||
3547 | 3458 | { | ||
3548 | 3459 | NULL, | ||
3549 | 3460 | &&remainder1, | ||
3550 | 3461 | &&remainder2, | ||
3551 | 3462 | &&remainder3, | ||
3552 | 3463 | &&remainder4, | ||
3553 | 3464 | |||
3554 | 3465 | &&remainder5, | ||
3555 | 3466 | &&remainder6, | ||
3556 | 3467 | &&remainder7, | ||
3557 | 3468 | &&remainder8 | ||
3558 | 3469 | }; | ||
3559 | 3470 | #endif | ||
3560 | 3471 | // 1 channel | ||
3561 | 3472 | // todo: can merge with 4 channel? | ||
3562 | 3473 | ssize_t y = 0; | ||
3563 | 3474 | do | ||
3564 | 3475 | { | ||
3565 | 3476 | int16_t *convertedIn; | ||
3566 | 3477 | if (typeid(InType) == typeid(int16_t)) | ||
3567 | 3478 | { | ||
3568 | 3479 | convertedIn = (int16_t *)&in[y][0]; | ||
3569 | 3480 | } | ||
3570 | 3481 | else | ||
3571 | 3482 | { | ||
3572 | 3483 | convertedIn = (int16_t *)ALIGNED_ALLOCA(RoundUp(width, ssize_t(sizeof(__m256i) / sizeof(int16_t))) * sizeof(int16_t), sizeof(__m256i)); | ||
3573 | 3484 | #ifdef __AVX2__ | ||
3574 | 3485 | for (ssize_t x = 0; x < width; x += 16) | ||
3575 | 3486 | { | ||
3576 | 3487 | __m128i u8 = _mm_loadu_si128((__m128i *)&in[y][x]); | ||
3577 | 3488 | __m256i i16 = _mm256_slli_epi16(_mm256_cvtepu8_epi16(u8), 6); | ||
3578 | 3489 | _mm256_store_si256((__m256i *)&convertedIn[x], i16); | ||
3579 | 3490 | } | ||
3580 | 3491 | #else | ||
3581 | 3492 | for (ssize_t x = 0; x < width; x += 8) | ||
3582 | 3493 | { | ||
3583 | 3494 | __m128i u8 = _mm_loadl_epi64((__m128i *)&in[y][x]); | ||
3584 | 3495 | __m128i i16 = _mm_slli_epi16(_mm_cvtepu8_epi16(u8), 6); | ||
3585 | 3496 | _mm_store_si128((__m128i *)&convertedIn[x], i16); | ||
3586 | 3497 | } | ||
3587 | 3498 | #endif | ||
3588 | 3499 | } | ||
3589 | 3500 | ssize_t x = xStart; | ||
3590 | 3501 | const ssize_t SIMD_WIDTH = 8; | ||
3591 | 3502 | __m128i vSum; | ||
3592 | 3503 | __m128i leftBorderValue, rightBorderValue; | ||
3593 | 3504 | if (onBorder) | ||
3594 | 3505 | { | ||
3595 | 3506 | leftBorderValue = _mm_set1_epi16(convertedIn[0 * channels]); | ||
3596 | 3507 | rightBorderValue = _mm_set1_epi16(convertedIn[(width - 1) * channels]); | ||
3597 | 3508 | } | ||
3598 | 3509 | goto middle; | ||
3599 | 3510 | do | ||
3600 | 3511 | { | ||
3601 | 3512 | ScaleAndStoreInt16(&out[y][x - SIMD_WIDTH], vSum); | ||
3602 | 3513 | middle: | ||
3603 | 3514 | if (symmetric) | ||
3604 | 3515 | { | ||
3605 | 3516 | #ifdef __GNUC__ | ||
3606 | 3517 | // up to 1.2x faster by using palignr instead of unaligned loads! | ||
3607 | 3518 | // the greatest difficulty with using palignr is that the offset must be a compile time constant | ||
3608 | 3519 | // solution? Duff's device | ||
3609 | 3520 | __m128i leftHalf[2], | ||
3610 | 3521 | rightHalf[2], | ||
3611 | 3522 | center = _mm_loadu_si128((__m128i *)&convertedIn[x]); | ||
3612 | 3523 | |||
3613 | 3524 | if (onBorder) | ||
3614 | 3525 | { | ||
3615 | 3526 | __m128i mask = PartialVectorMask(min(SIMD_WIDTH, width - x) * sizeof(int16_t)); | ||
3616 | 3527 | center = _mm_blendv_epi8(rightBorderValue, center, mask); | ||
3617 | 3528 | } | ||
3618 | 3529 | rightHalf[0] = leftHalf[1] = center; | ||
3619 | 3530 | |||
3620 | 3531 | vSum = _mm_mulhrs_epi16(Cast256To128(vFilter[0]), rightHalf[0]); | ||
3621 | 3532 | |||
3622 | 3533 | ssize_t base = 0; | ||
3623 | 3534 | while (base < filterSize) | ||
3624 | 3535 | { | ||
3625 | 3536 | leftHalf[0] = rightHalf[0]; | ||
3626 | 3537 | rightHalf[1] = leftHalf[1]; | ||
3627 | 3538 | rightHalf[0] = _mm_loadu_si128((__m128i *)&convertedIn[x + base + 8]); | ||
3628 | 3539 | leftHalf[1] = _mm_loadu_si128((__m128i *)&convertedIn[x - base - 8]); | ||
3629 | 3540 | |||
3630 | 3541 | if (onBorder) | ||
3631 | 3542 | { | ||
3632 | 3543 | __m128i leftMask = PartialVectorMask(min(SIMD_WIDTH, max(ssize_t(0), (base + 8) - x)) * sizeof(int16_t)), | ||
3633 | 3544 | rightMask = PartialVectorMask(min(SIMD_WIDTH, width - (x + base + 8)) * sizeof(int16_t)); | ||
3634 | 3545 | leftHalf[1] = _mm_blendv_epi8(leftHalf[1], leftBorderValue, leftMask); | ||
3635 | 3546 | rightHalf[0] = _mm_blendv_epi8(rightBorderValue, rightHalf[0], rightMask); | ||
3636 | 3547 | } | ||
3637 | 3548 | |||
3638 | 3549 | goto *labels[min(ssize_t(8), filterSize - base)]; | ||
3639 | 3550 | __m128i v, v2; | ||
3640 | 3551 | remainder8: | ||
3641 | 3552 | v = rightHalf[0]; // same as palignr(right, left, 16) | ||
3642 | 3553 | v2 = leftHalf[1]; | ||
3643 | 3554 | vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 8]), _mm_adds_epi16(v, v2))); | ||
3644 | 3555 | remainder7: | ||
3645 | 3556 | v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 14); | ||
3646 | 3557 | v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 2); | ||
3647 | 3558 | vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 7]), _mm_adds_epi16(v, v2))); | ||
3648 | 3559 | remainder6: | ||
3649 | 3560 | v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 12); | ||
3650 | 3561 | v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 4); | ||
3651 | 3562 | vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 6]), _mm_adds_epi16(v, v2))); | ||
3652 | 3563 | remainder5: | ||
3653 | 3564 | v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 10); | ||
3654 | 3565 | v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 6); | ||
3655 | 3566 | vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 5]), _mm_adds_epi16(v, v2))); | ||
3656 | 3567 | remainder4: | ||
3657 | 3568 | v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 8); | ||
3658 | 3569 | v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 8); | ||
3659 | 3570 | vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 4]), _mm_adds_epi16(v, v2))); | ||
3660 | 3571 | remainder3: | ||
3661 | 3572 | v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 6); | ||
3662 | 3573 | v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 10); | ||
3663 | 3574 | vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 3]), _mm_adds_epi16(v, v2))); | ||
3664 | 3575 | remainder2: | ||
3665 | 3576 | v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 4); | ||
3666 | 3577 | v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 12); | ||
3667 | 3578 | vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 2]), _mm_adds_epi16(v, v2))); | ||
3668 | 3579 | remainder1: | ||
3669 | 3580 | v = _mm_alignr_epi8(rightHalf[0], leftHalf[0], 2); | ||
3670 | 3581 | v2 = _mm_alignr_epi8(rightHalf[1], leftHalf[1], 14); | ||
3671 | 3582 | vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(Cast256To128(vFilter[base + 1]), _mm_adds_epi16(v, v2))); | ||
3672 | 3583 | base += 8; | ||
3673 | 3584 | } | ||
3674 | 3585 | #else | ||
3675 | 3586 | __m128i center; | ||
3676 | 3587 | vSum = _mm_mulhrs_epi16(Cast256To128(vFilter[0]), LoadAndScaleToInt16(center, &convertedIn[x * channels])); | ||
3677 | 3588 | for (ssize_t i = 1; i <= filterSize; ++i) | ||
3678 | 3589 | { | ||
3679 | 3590 | __m128i filter = Cast256To128(vFilter[i]); | ||
3680 | 3591 | |||
3681 | 3592 | ssize_t srcX = x - i; | ||
3682 | 3593 | if (onBorder) | ||
3683 | 3594 | srcX = max(-(SIMD_WIDTH - 1), srcX); // todo: use this for now until LoadAndScaleToInt16() supports partial loads | ||
3684 | 3595 | |||
3685 | 3596 | __m128i leftNeighbor = _mm_loadu_si128((__m128i *)&convertedIn[srcX * channels]); | ||
3686 | 3597 | |||
3687 | 3598 | srcX = x + i; | ||
3688 | 3599 | if (onBorder) | ||
3689 | 3600 | srcX = min(width - 1, srcX); | ||
3690 | 3601 | |||
3691 | 3602 | __m128i rightNeighbor = _mm_loadu_si128((__m128i *)&convertedIn[srcX * channels]); | ||
3692 | 3603 | |||
3693 | 3604 | if (onBorder) | ||
3694 | 3605 | { | ||
3695 | 3606 | __m128i leftMask = PartialVectorMask(min(SIMD_WIDTH, max(ssize_t(0), i - x)) * sizeof(int16_t)), | ||
3696 | 3607 | rightMask = PartialVectorMask(min(SIMD_WIDTH, width - (x + i)) * sizeof(int16_t)); | ||
3697 | 3608 | leftNeighbor = _mm_blendv_epi8(leftNeighbor, leftBorderValue, leftMask); | ||
3698 | 3609 | rightNeighbor = _mm_blendv_epi8(rightBorderValue, rightNeighbor, rightMask); | ||
3699 | 3610 | } | ||
3700 | 3611 | vSum = _mm_adds_epi16(vSum, _mm_mulhrs_epi16(filter, _mm_adds_epi16(leftNeighbor, rightNeighbor))); | ||
3701 | 3612 | } | ||
3702 | 3613 | #endif | ||
3703 | 3614 | } | ||
3704 | 3615 | else | ||
3705 | 3616 | { | ||
3706 | 3617 | throw 0; | ||
3707 | 3618 | } | ||
3708 | 3619 | x += SIMD_WIDTH; | ||
3709 | 3620 | } while (x < xEnd); | ||
3710 | 3621 | ScaleAndStoreInt16<true>(&out[y][x - SIMD_WIDTH], vSum, xEnd - (x - SIMD_WIDTH)); | ||
3711 | 3622 | ++y; | ||
3712 | 3623 | } while (y < height); | ||
3713 | 3624 | } | ||
3714 | 3625 | } | ||
3715 | 3626 | #endif | ||
3716 | 3627 | |||
3717 | 3628 | // handles blocking | ||
3718 | 3629 | // in-place (out = in) operation not allowed | ||
3719 | 3630 | template <int channels, typename OutType, typename InType> | ||
3720 | 3631 | void ConvolveHorizontalFIR(SimpleImage<OutType> out, | ||
3721 | 3632 | SimpleImage<InType> in, | ||
3722 | 3633 | ssize_t width, ssize_t height, float sigmaX) | ||
3723 | 3634 | { | ||
3724 | 3635 | #ifdef DO_FIR_IN_FLOAT | ||
3725 | 3636 | typedef MyTraits<float>::SIMDtype SIMDtype; | ||
3726 | 3637 | #else | ||
3727 | 3638 | typedef MyTraits<int16_t>::SIMDtype SIMDtype; | ||
3728 | 3639 | #endif | ||
3729 | 3640 | |||
3730 | 3641 | ssize_t halfFilterSize = _effect_area_scr(sigmaX); | ||
3731 | 3642 | float *filter = (float *)alloca((halfFilterSize + 1) * sizeof(float)); | ||
3732 | 3643 | _make_kernel(filter, sigmaX); | ||
3733 | 3644 | |||
3734 | 3645 | SIMDtype *vFilter = (SIMDtype *)ALIGNED_ALLOCA((halfFilterSize + 1) * sizeof(SIMDtype), sizeof(SIMDtype)); | ||
3735 | 3646 | |||
3736 | 3647 | for (ssize_t i = 0; i <= halfFilterSize; ++i) | ||
3737 | 3648 | { | ||
3738 | 3649 | #ifdef DO_FIR_IN_FLOAT | ||
3739 | 3650 | BroadcastSIMD(vFilter[i], filter[i]); | ||
3740 | 3651 | #else | ||
3741 | 3652 | BroadcastSIMD(vFilter[i], clip_round_cast<int16_t>(filter[i] * 32768)); | ||
3742 | 3653 | #endif | ||
3743 | 3654 | } | ||
3744 | 3655 | |||
3745 | 3656 | const ssize_t IDEAL_Y_BLOCK_SIZE = 1; // pointless for now, but might be needed in the future when SIMD code processes 2 rows at a time | ||
3746 | 3657 | |||
3747 | 3658 | #pragma omp parallel | ||
3748 | 3659 | { | ||
3749 | 3660 | #pragma omp for | ||
3750 | 3661 | for (ssize_t y = 0; y < height; y += IDEAL_Y_BLOCK_SIZE) | ||
3751 | 3662 | { | ||
3752 | 3663 | ssize_t yBlockSize = min(height - y, IDEAL_Y_BLOCK_SIZE); | ||
3753 | 3664 | |||
3754 | 3665 | ssize_t nonBorderStart = min(width, RoundUp(halfFilterSize, ssize_t(sizeof(__m256) / channels / sizeof(InType)))); // so that data for non-border region is vector aligned | ||
3755 | 3666 | if (nonBorderStart < width - halfFilterSize) | ||
3756 | 3667 | ConvolveHorizontalFIR<channels, true, false>(out.SubImage(0, y), | ||
3757 | 3668 | in.SubImage(0, y), | ||
3758 | 3669 | width, yBlockSize, | ||
3759 | 3670 | nonBorderStart, width - halfFilterSize, | ||
3760 | 3671 | vFilter, halfFilterSize); | ||
3761 | 3672 | ssize_t xStart = 0, | ||
3762 | 3673 | xEnd = nonBorderStart; | ||
3763 | 3674 | processEnd: | ||
3764 | 3675 | ConvolveHorizontalFIR<channels, true, true>(out.SubImage(0, y), | ||
3765 | 3676 | in.SubImage(0, y), | ||
3766 | 3677 | width, yBlockSize, | ||
3767 | 3678 | xStart, xEnd, | ||
3768 | 3679 | vFilter, halfFilterSize); | ||
3769 | 3680 | if (xStart == 0) | ||
3770 | 3681 | { | ||
3771 | 3682 | // avoid inline happy compiler from inlining another call to ConvolveHorizontalFIR() | ||
3772 | 3683 | xStart = max(nonBorderStart, width - halfFilterSize); // don't refilter anything in case the 2 border regions overlap | ||
3773 | 3684 | xEnd = width; | ||
3774 | 3685 | goto processEnd; | ||
3775 | 3686 | } | ||
3776 | 3687 | } | ||
3777 | 3688 | } // omp parallel | ||
3778 | 3689 | } | ||
3779 | 3690 | |||
3780 | 3691 | #ifdef DO_FIR_IN_FLOAT | ||
3781 | 3692 | // in-place (out = in) operation not allowed | ||
3782 | 3693 | template <bool symmetric, bool onBorder, typename OutType, typename InType, typename SIMD_Type> | ||
3783 | 3694 | void ConvolveVerticalFIR(SimpleImage<OutType> out, SimpleImage<InType> in, | ||
3784 | 3695 | ssize_t width, ssize_t height, | ||
3785 | 3696 | ssize_t yStart, ssize_t yEnd, | ||
3786 | 3697 | SIMD_Type *vFilter, int filterSize) | ||
3787 | 3698 | { | ||
3788 | 3699 | ssize_t y = yStart; | ||
3789 | 3700 | do | ||
3790 | 3701 | { | ||
3791 | 3702 | ssize_t x = 0; | ||
3792 | 3703 | #ifdef __AVX__ | ||
3793 | 3704 | const ssize_t SIMD_WIDTH = 8; | ||
3794 | 3705 | __m256 vSum; | ||
3795 | 3706 | goto middle; | ||
3796 | 3707 | do | ||
3797 | 3708 | { | ||
3798 | 3709 | StoreFloats(&out[y][x - SIMD_WIDTH], vSum); // write out data from previous iteration | ||
3799 | 3710 | middle: | ||
3800 | 3711 | if (symmetric) | ||
3801 | 3712 | { | ||
3802 | 3713 | __m256 vIn; | ||
3803 | 3714 | LoadFloats(vIn, &in[y][x]); | ||
3804 | 3715 | vSum = vFilter[0] * vIn; | ||
3805 | 3716 | for (ssize_t i = 1; i <= filterSize; ++i) | ||
3806 | 3717 | { | ||
3807 | 3718 | ssize_t srcY = y - i; | ||
3808 | 3719 | if (onBorder) | ||
3809 | 3720 | srcY = max(ssize_t(0), srcY); | ||
3810 | 3721 | __m256 bottom, top; | ||
3811 | 3722 | LoadFloats(bottom, &in[srcY][x]); | ||
3812 | 3723 | |||
3813 | 3724 | srcY = y + i; | ||
3814 | 3725 | if (onBorder) | ||
3815 | 3726 | srcY = min(height - 1, srcY); | ||
3816 | 3727 | LoadFloats(top, &in[srcY][x]); | ||
3817 | 3728 | |||
3818 | 3729 | vSum = vSum + vFilter[i] * (bottom + top); | ||
3819 | 3730 | } | ||
3820 | 3731 | } | ||
3821 | 3732 | else | ||
3822 | 3733 | { | ||
3823 | 3734 | // wouldn't be surprised if the smaller & simpler do-while code outweighs the cost of the extra add | ||
3824 | 3735 | vSum = _mm256_setzero_ps(); | ||
3825 | 3736 | ssize_t i = 0; | ||
3826 | 3737 | do | ||
3827 | 3738 | { | ||
3828 | 3739 | ssize_t srcY = y - filterSize / 2 + i; | ||
3829 | 3740 | if (onBorder) | ||
3830 | 3741 | srcY = min(height - 1, max(ssize_t(0), srcY)); | ||
3831 | 3742 | __m256 vIn; | ||
3832 | 3743 | LoadFloats(vIn, &in[srcY][x]); | ||
3833 | 3744 | vSum = vSum + vFilter[i] * vIn; | ||
3834 | 3745 | ++i; | ||
3835 | 3746 | } while (i < filterSize); | ||
3836 | 3747 | } | ||
3837 | 3748 | x += SIMD_WIDTH; | ||
3838 | 3749 | } while (x < width); | ||
3839 | 3750 | StoreFloats<true>(&out[y][x - SIMD_WIDTH], vSum, width - (x - SIMD_WIDTH)); | ||
3840 | 3751 | #else | ||
3841 | 3752 | // for SSE only | ||
3842 | 3753 | const ssize_t SIMD_WIDTH = 4; | ||
3843 | 3754 | __m128 vSum; | ||
3844 | 3755 | goto middle; | ||
3845 | 3756 | do | ||
3846 | 3757 | { | ||
3847 | 3758 | StoreFloats(&out[y][x - SIMD_WIDTH], vSum); // write out data from previous iteration | ||
3848 | 3759 | middle: | ||
3849 | 3760 | if (symmetric) | ||
3850 | 3761 | { | ||
3851 | 3762 | __m128 vIn; | ||
3852 | 3763 | LoadFloats(vIn, &in[y][x]); | ||
3853 | 3764 | vSum = Cast256To128(vFilter[0]) * vIn; | ||
3854 | 3765 | for (ssize_t i = 1; i <= filterSize; ++i) | ||
3855 | 3766 | { | ||
3856 | 3767 | ssize_t srcY = y - i; | ||
3857 | 3768 | if (onBorder) | ||
3858 | 3769 | srcY = max(ssize_t(0), srcY); | ||
3859 | 3770 | __m128 bottom, top; | ||
3860 | 3771 | LoadFloats(bottom, &in[srcY][x]); | ||
3861 | 3772 | |||
3862 | 3773 | srcY = y + i; | ||
3863 | 3774 | if (onBorder) | ||
3864 | 3775 | srcY = min(height - 1, srcY); | ||
3865 | 3776 | LoadFloats(top, &in[srcY][x]); | ||
3866 | 3777 | |||
3867 | 3778 | vSum = vSum + Cast256To128(vFilter[i]) * (bottom + top); | ||
3868 | 3779 | } | ||
3869 | 3780 | } | ||
3870 | 3781 | else | ||
3871 | 3782 | { | ||
3872 | 3783 | vSum = _mm_setzero_ps(); | ||
3873 | 3784 | ssize_t i = 0; | ||
3874 | 3785 | do | ||
3875 | 3786 | { | ||
3876 | 3787 | ssize_t srcY = y - filterSize / 2 + i; | ||
3877 | 3788 | if (onBorder) | ||
3878 | 3789 | srcY = min(height - 1, max(ssize_t(0), srcY)); | ||
3879 | 3790 | __m128 _vFilter = Cast256To128(vFilter[i]); | ||
3880 | 3791 | |||
3881 | 3792 | __m128 vIn; | ||
3882 | 3793 | LoadFloats(vIn, &in[srcY][x]); | ||
3883 | 3794 | vSum = vSum + _vFilter * vIn; | ||
3884 | 3795 | ++i; | ||
3885 | 3796 | } while (i < filterSize); | ||
3886 | 3797 | } | ||
3887 | 3798 | x += SIMD_WIDTH; | ||
3888 | 3799 | } while (x < width); | ||
3889 | 3800 | StoreFloats<true>(&out[y][x - SIMD_WIDTH], vSum, width - (x - SIMD_WIDTH)); | ||
3890 | 3801 | #endif | ||
3891 | 3802 | ++y; | ||
3892 | 3803 | } while (y < yEnd); | ||
3893 | 3804 | } | ||
3894 | 3805 | |||
3895 | 3806 | #else // DO_FIR_IN_FLOAT | ||
3896 | 3807 | |||
3897 | 3808 | // in-place (out = in) operation not allowed | ||
3898 | 3809 | template <bool symmetric, bool onBorder, typename OutType, typename InType, typename SIMD_Type> | ||
3899 | 3810 | void ConvolveVerticalFIR(SimpleImage<OutType> out, SimpleImage<InType> in, | ||
3900 | 3811 | ssize_t width, ssize_t height, | ||
3901 | 3812 | ssize_t yStart, ssize_t yEnd, | ||
3902 | 3813 | SIMD_Type *vFilter, int filterSize) | ||
3903 | 3814 | { | ||
3904 | 3815 | ssize_t y = yStart; | ||
3905 | 3816 | do | ||
3906 | 3817 | { | ||
3907 | 3818 | ssize_t x = 0; | ||
3908 | 3819 | #ifdef __AVX2__ | ||
3909 | 3820 | const ssize_t SIMD_WIDTH = 16; | ||
3910 | 3821 | __m256i vSum; | ||
3911 | 3822 | goto middle; | ||
3912 | 3823 | do | ||
3913 | 3824 | { | ||
3914 | 3825 | ScaleAndStoreInt16(&out[y][x - SIMD_WIDTH], vSum); // store data from previous iteration | ||
3915 | 3826 | middle: | ||
3916 | 3827 | if (symmetric) | ||
3917 | 3828 | { | ||
3918 | 3829 | __m256i center; | ||
3919 | 3830 | vSum = _mm256_mulhrs_epi16(vFilter[0], LoadAndScaleToInt16(center, &in[y][x])); | ||
3920 | 3831 | for (ssize_t i = 1; i <= filterSize; ++i) | ||
3921 | 3832 | { | ||
3922 | 3833 | __m256i filter = vFilter[i]; | ||
3923 | 3834 | ssize_t srcY = y + i; | ||
3924 | 3835 | if (onBorder) | ||
3925 | 3836 | srcY = min(srcY, height - 1); | ||
3926 | 3837 | __m256i topNeighbor; | ||
3927 | 3838 | LoadAndScaleToInt16(topNeighbor, &in[srcY][x]); | ||
3928 | 3839 | |||
3929 | 3840 | srcY = y - i; | ||
3930 | 3841 | if (onBorder) | ||
3931 | 3842 | srcY = max(srcY, ssize_t(0)); | ||
3932 | 3843 | __m256i bottomNeighbor; | ||
3933 | 3844 | LoadAndScaleToInt16(bottomNeighbor, &in[srcY][x]); | ||
3934 | 3845 | vSum = _mm256_adds_epi16 | ||
3935 | 3846 | ( | ||
3936 | 3847 | vSum, | ||
3937 | 3848 | _mm256_mulhrs_epi16 | ||
3938 | 3849 | ( | ||
3939 | 3850 | filter, | ||
3940 | 3851 | _mm256_adds_epi16(bottomNeighbor, topNeighbor) | ||
3941 | 3852 | ) | ||
3942 | 3853 | ); | ||
3943 | 3854 | } | ||
3944 | 3855 | } | ||
3945 | 3856 | else | ||
3946 | 3857 | { | ||
3947 | 3858 | throw 0; | ||
3948 | 3859 | } | ||
3949 | 3860 | x += SIMD_WIDTH; | ||
3950 | 3861 | } while (x < width); | ||
3951 | 3862 | ScaleAndStoreInt16<true>(&out[y][x - SIMD_WIDTH], vSum, width - (x - SIMD_WIDTH)); | ||
3952 | 3863 | #else | ||
3953 | 3864 | const ssize_t SIMD_WIDTH = 8; | ||
3954 | 3865 | __m128i vSum; | ||
3955 | 3866 | goto middle; | ||
3956 | 3867 | do | ||
3957 | 3868 | { | ||
3958 | 3869 | ScaleAndStoreInt16(&out[y][x - SIMD_WIDTH], vSum); // store data from previous iteration | ||
3959 | 3870 | middle: | ||
3960 | 3871 | if (symmetric) | ||
3961 | 3872 | { | ||
3962 | 3873 | __m128i center; | ||
3963 | 3874 | vSum = _mm_mulhrs_epi16(Cast256To128(vFilter[0]), LoadAndScaleToInt16(center, &in[y][x])); | ||
3964 | 3875 | for (ssize_t i = 1; i <= filterSize; ++i) | ||
3965 | 3876 | { | ||
3966 | 3877 | __m128i filter = Cast256To128(vFilter[i]); | ||
3967 | 3878 | ssize_t srcY = y + i; | ||
3968 | 3879 | if (onBorder) | ||
3969 | 3880 | srcY = min(srcY, height - 1); | ||
3970 | 3881 | __m128i topNeighbor; | ||
3971 | 3882 | LoadAndScaleToInt16(topNeighbor, &in[srcY][x]); | ||
3972 | 3883 | |||
3973 | 3884 | srcY = y - i; | ||
3974 | 3885 | if (onBorder) | ||
3975 | 3886 | srcY = max(srcY, ssize_t(0)); | ||
3976 | 3887 | __m128i bottomNeighbor; | ||
3977 | 3888 | LoadAndScaleToInt16(bottomNeighbor, &in[srcY][x]); | ||
3978 | 3889 | vSum = _mm_adds_epi16 | ||
3979 | 3890 | ( | ||
3980 | 3891 | vSum, | ||
3981 | 3892 | _mm_mulhrs_epi16 | ||
3982 | 3893 | ( | ||
3983 | 3894 | filter, | ||
3984 | 3895 | _mm_adds_epi16(bottomNeighbor, topNeighbor) | ||
3985 | 3896 | ) | ||
3986 | 3897 | ); | ||
3987 | 3898 | } | ||
3988 | 3899 | } | ||
3989 | 3900 | else | ||
3990 | 3901 | { | ||
3991 | 3902 | throw 0; | ||
3992 | 3903 | } | ||
3993 | 3904 | x += SIMD_WIDTH; | ||
3994 | 3905 | } while (x < width); | ||
3995 | 3906 | ScaleAndStoreInt16<true>(&out[y][x - SIMD_WIDTH], vSum, width - (x - SIMD_WIDTH)); | ||
3996 | 3907 | #endif | ||
3997 | 3908 | ++y; | ||
3998 | 3909 | } while (y < yEnd); | ||
3999 | 3910 | } | ||
4000 | 3911 | #endif | ||
4001 | 3912 | |||
4002 | 3913 | // in-place (out = in) operation not allowed | ||
4003 | 3914 | template <typename OutType, typename InType> | ||
4004 | 3915 | void ConvolveVerticalFIR(SimpleImage<OutType> out, | ||
4005 | 3916 | SimpleImage<InType> in, | ||
4006 | 3917 | ssize_t width, ssize_t height, | ||
4007 | 3918 | float sigmaY) | ||
4008 | 3919 | { | ||
4009 | 3920 | #ifdef DO_FIR_IN_FLOAT | ||
4010 | 3921 | typedef MyTraits<float>::SIMDtype SIMDtype; | ||
4011 | 3922 | #else | ||
4012 | 3923 | typedef MyTraits<int16_t>::SIMDtype SIMDtype; | ||
4013 | 3924 | #endif | ||
4014 | 3925 | int halfFilterSize = _effect_area_scr(sigmaY); | ||
4015 | 3926 | |||
4016 | 3927 | float *filter = (float *)alloca((halfFilterSize + 1) * sizeof(float)); | ||
4017 | 3928 | |||
4018 | 3929 | _make_kernel(filter, sigmaY); | ||
4019 | 3930 | |||
4020 | 3931 | SIMDtype *vFilter = (SIMDtype *)ALIGNED_ALLOCA((halfFilterSize + 1) * sizeof(SIMDtype), sizeof(SIMDtype)); | ||
4021 | 3932 | |||
4022 | 3933 | for (ssize_t i = 0; i <= halfFilterSize; ++i) | ||
4023 | 3934 | { | ||
4024 | 3935 | #ifdef DO_FIR_IN_FLOAT | ||
4025 | 3936 | BroadcastSIMD(vFilter[i], filter[i]); | ||
4026 | 3937 | #else | ||
4027 | 3938 | BroadcastSIMD(vFilter[i], clip_round_cast<int16_t>(filter[i] * 32768)); | ||
4028 | 3939 | #endif | ||
4029 | 3940 | } | ||
4030 | 3941 | |||
4031 | 3942 | const ssize_t IDEAL_Y_BLOCK_SIZE = 2; // currently, no advantage to making > 1 | ||
4032 | 3943 | |||
4033 | 3944 | #pragma omp parallel | ||
4034 | 3945 | { | ||
4035 | 3946 | #pragma omp for | ||
4036 | 3947 | for (ssize_t y = 0; y < height; y += IDEAL_Y_BLOCK_SIZE) | ||
4037 | 3948 | { | ||
4038 | 3949 | ssize_t yBlockSize = min(height - y, IDEAL_Y_BLOCK_SIZE); | ||
4039 | 3950 | bool onBorder = y < halfFilterSize || y + IDEAL_Y_BLOCK_SIZE + halfFilterSize > height; | ||
4040 | 3951 | if (onBorder) | ||
4041 | 3952 | { | ||
4042 | 3953 | ConvolveVerticalFIR<true, true>(out, in, | ||
4043 | 3954 | width, height, | ||
4044 | 3955 | y, y + yBlockSize, | ||
4045 | 3956 | vFilter, halfFilterSize); | ||
4046 | 3957 | |||
4047 | 3958 | } | ||
4048 | 3959 | else | ||
4049 | 3960 | { | ||
4050 | 3961 | ConvolveVerticalFIR<true, false>(out, in, | ||
4051 | 3962 | width, height, | ||
4052 | 3963 | y, y + yBlockSize, | ||
4053 | 3964 | vFilter, halfFilterSize); | ||
4054 | 3965 | } | ||
4055 | 3966 | } | ||
4056 | 3967 | } // omp parallel | ||
4057 | 3968 | } | ||
4058 | 3969 | |||
4059 | 3970 | template <int channels, typename OutType, typename InType> | ||
4060 | 3971 | void ConvolveFIR(SimpleImage<OutType> out, SimpleImage<InType> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY) | ||
4061 | 3972 | { | ||
4062 | 3973 | using namespace std::chrono; | ||
4063 | 3974 | #ifdef DO_FIR_IN_FLOAT | ||
4064 | 3975 | AlignedImage<float, sizeof(__m256)> horizontalFiltered; | ||
4065 | 3976 | #else | ||
4066 | 3977 | AlignedImage<int16_t, sizeof(__m256)> horizontalFiltered; | ||
4067 | 3978 | #endif | ||
4068 | 3979 | horizontalFiltered.Resize(width * channels, height); | ||
4069 | 3980 | |||
4070 | 3981 | const bool DO_TIMING = false; | ||
4071 | 3982 | |||
4072 | 3983 | high_resolution_clock::time_point t0; | ||
4073 | 3984 | if (DO_TIMING) | ||
4074 | 3985 | t0 = high_resolution_clock::now(); | ||
4075 | 3986 | |||
4076 | 3987 | ConvolveHorizontalFIR<channels>(horizontalFiltered, in, width, height, sigmaX); | ||
4077 | 3988 | |||
4078 | 3989 | if (DO_TIMING) | ||
4079 | 3990 | { | ||
4080 | 3991 | auto t1 = high_resolution_clock::now(); | ||
4081 | 3992 | cout << "T_horiz=" << duration_cast<milliseconds>(t1 - t0).count() << " ms" << endl; | ||
4082 | 3993 | t0 = t1; | ||
4083 | 3994 | } | ||
4084 | 3995 | |||
4085 | 3996 | // todo: use sliding window to reduce cache pollution | ||
4086 | 3997 | float scale = 1.0f; | ||
4087 | 3998 | #ifndef DO_FIR_IN_FLOAT | ||
4088 | 3999 | scale = 1.0f / 64; | ||
4089 | 4000 | #endif | ||
4090 | 4001 | |||
4091 | 4002 | //SaveImage("horizontal_filtered.png", horizontalFiltered, width, height, channels, scale); | ||
4092 | 4003 | ConvolveVerticalFIR(out, horizontalFiltered, width * channels, height, sigmaY); | ||
4093 | 4004 | if (DO_TIMING) | ||
4094 | 4005 | cout << "T_v=" << duration_cast<milliseconds>(high_resolution_clock::now() - t0).count() << " ms" << endl; | ||
4095 | 4006 | } | ||
4096 | 0 | 4007 | ||
4097 | === modified file 'src/display/nr-filter-gaussian.cpp' | |||
4098 | --- src/display/nr-filter-gaussian.cpp 2014-03-27 01:33:44 +0000 | |||
4099 | +++ src/display/nr-filter-gaussian.cpp 2016-12-22 06:18:34 +0000 | |||
4100 | @@ -12,13 +12,13 @@ | |||
4101 | 12 | */ | 12 | */ |
4102 | 13 | 13 | ||
4103 | 14 | #include "config.h" // Needed for HAVE_OPENMP | 14 | #include "config.h" // Needed for HAVE_OPENMP |
4104 | 15 | |||
4105 | 16 | #include <algorithm> | 15 | #include <algorithm> |
4106 | 17 | #include <cmath> | 16 | #include <cmath> |
4107 | 18 | #include <complex> | 17 | #include <complex> |
4108 | 19 | #include <cstdlib> | 18 | #include <cstdlib> |
4109 | 20 | #include <glib.h> | 19 | #include <glib.h> |
4110 | 21 | #include <limits> | 20 | #include <limits> |
4111 | 21 | #include <typeinfo> | ||
4112 | 22 | #if HAVE_OPENMP | 22 | #if HAVE_OPENMP |
4113 | 23 | #include <omp.h> | 23 | #include <omp.h> |
4114 | 24 | #endif //HAVE_OPENMP | 24 | #endif //HAVE_OPENMP |
4115 | @@ -32,11 +32,18 @@ | |||
4116 | 32 | #include <2geom/affine.h> | 32 | #include <2geom/affine.h> |
4117 | 33 | #include "util/fixed_point.h" | 33 | #include "util/fixed_point.h" |
4118 | 34 | #include "preferences.h" | 34 | #include "preferences.h" |
4119 | 35 | #include <fstream> | ||
4120 | 36 | #include <iomanip> | ||
4121 | 37 | #include <cpuid.h> | ||
4122 | 38 | #include <chrono> | ||
4123 | 39 | #include "SimpleImage.h" | ||
4124 | 35 | 40 | ||
4125 | 36 | #ifndef INK_UNUSED | 41 | #ifndef INK_UNUSED |
4126 | 37 | #define INK_UNUSED(x) ((void)(x)) | 42 | #define INK_UNUSED(x) ((void)(x)) |
4127 | 38 | #endif | 43 | #endif |
4128 | 39 | 44 | ||
4129 | 45 | using namespace std; | ||
4130 | 46 | |||
4131 | 40 | // IIR filtering method based on: | 47 | // IIR filtering method based on: |
4132 | 41 | // L.J. van Vliet, I.T. Young, and P.W. Verbeek, Recursive Gaussian Derivative Filters, | 48 | // L.J. van Vliet, I.T. Young, and P.W. Verbeek, Recursive Gaussian Derivative Filters, |
4133 | 42 | // in: A.K. Jain, S. Venkatesh, B.C. Lovell (eds.), | 49 | // in: A.K. Jain, S. Venkatesh, B.C. Lovell (eds.), |
4134 | @@ -54,10 +61,12 @@ | |||
4135 | 54 | // filters are used). | 61 | // filters are used). |
4136 | 55 | static size_t const N = 3; | 62 | static size_t const N = 3; |
4137 | 56 | 63 | ||
4138 | 64 | #if __cplusplus < 201103 | ||
4139 | 57 | template<typename InIt, typename OutIt, typename Size> | 65 | template<typename InIt, typename OutIt, typename Size> |
4140 | 58 | inline void copy_n(InIt beg_in, Size N, OutIt beg_out) { | 66 | inline void copy_n(InIt beg_in, Size N, OutIt beg_out) { |
4141 | 59 | std::copy(beg_in, beg_in+N, beg_out); | 67 | std::copy(beg_in, beg_in+N, beg_out); |
4142 | 60 | } | 68 | } |
4143 | 69 | #endif | ||
4144 | 61 | 70 | ||
4145 | 62 | // Type used for IIR filter coefficients (can be 10.21 signed fixed point, see Anisotropic Gaussian Filtering Using Fixed Point Arithmetic, Christoph H. Lampert & Oliver Wirjadi, 2006) | 71 | // Type used for IIR filter coefficients (can be 10.21 signed fixed point, see Anisotropic Gaussian Filtering Using Fixed Point Arithmetic, Christoph H. Lampert & Oliver Wirjadi, 2006) |
4146 | 63 | typedef double IIRValue; | 72 | typedef double IIRValue; |
4147 | @@ -123,6 +132,11 @@ | |||
4148 | 123 | 132 | ||
4149 | 124 | FilterGaussian::FilterGaussian() | 133 | FilterGaussian::FilterGaussian() |
4150 | 125 | { | 134 | { |
4151 | 135 | extern void (*GaussianBlurIIR_Y8)(SimpleImage<uint8_t>, SimpleImage<uint8_t>, ssize_t, ssize_t, float, float); | ||
4152 | 136 | void InitializeSIMDFunctions(); | ||
4153 | 137 | if (GaussianBlurIIR_Y8 == NULL) | ||
4154 | 138 | InitializeSIMDFunctions(); | ||
4155 | 139 | |||
4156 | 126 | _deviation_x = _deviation_y = 0.0; | 140 | _deviation_x = _deviation_y = 0.0; |
4157 | 127 | } | 141 | } |
4158 | 128 | 142 | ||
4159 | @@ -142,8 +156,8 @@ | |||
4160 | 142 | return (int)std::ceil(std::fabs(deviation) * 3.0); | 156 | return (int)std::ceil(std::fabs(deviation) * 3.0); |
4161 | 143 | } | 157 | } |
4162 | 144 | 158 | ||
4165 | 145 | static void | 159 | template <typename FIRValue> |
4166 | 146 | _make_kernel(FIRValue *const kernel, double const deviation) | 160 | static void _make_kernel(FIRValue *const kernel, double const deviation) |
4167 | 147 | { | 161 | { |
4168 | 148 | int const scr_len = _effect_area_scr(deviation); | 162 | int const scr_len = _effect_area_scr(deviation); |
4169 | 149 | g_assert(scr_len >= 0); | 163 | g_assert(scr_len >= 0); |
4170 | @@ -546,6 +560,559 @@ | |||
4171 | 546 | }; | 560 | }; |
4172 | 547 | } | 561 | } |
4173 | 548 | 562 | ||
4174 | 563 | #ifdef _MSC_VER | ||
4175 | 564 | #define FORCE_INLINE __forceinline | ||
4176 | 565 | #define ALIGN(x) __declspec(aligned(x)) | ||
4177 | 566 | #else | ||
4178 | 567 | #define FORCE_INLINE inline __attribute__((always_inline)) | ||
4179 | 568 | #define ALIGN(x) __attribute__((alignment(x))) | ||
4180 | 569 | #endif | ||
4181 | 570 | |||
4182 | 571 | void (*GaussianBlurIIR_Y8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY); | ||
4183 | 572 | void (*GaussianBlurIIR_R8G8B8A8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY); | ||
4184 | 573 | |||
4185 | 574 | void (*GaussianBlurFIR_Y8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY); | ||
4186 | 575 | void (*GaussianBlurFIR_R8G8B8A8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, float sigmaY); | ||
4187 | 576 | |||
4188 | 577 | void(*GaussianBlurHorizontalIIR_Y8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, bool canOverwriteInput); | ||
4189 | 578 | void(*GaussianBlurHorizontalIIR_R8G8B8A8)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaX, bool canOverwriteInput); | ||
4190 | 579 | void(*GaussianBlurVerticalIIR)(SimpleImage<uint8_t> out, SimpleImage<uint8_t> in, ssize_t width, ssize_t height, float sigmaY); // works for grayscale & RGBA | ||
4191 | 580 | |||
4192 | 581 | template <typename AnyType> | ||
4193 | 582 | void SaveImage(const char *path, SimpleImage<AnyType> in, int width, int height, int channels, float scale = 1.0f) | ||
4194 | 583 | { | ||
4195 | 584 | cairo_surface_t *scaled = cairo_image_surface_create(channels == 1 ? CAIRO_FORMAT_A8 : CAIRO_FORMAT_ARGB32, width, height); | ||
4196 | 585 | |||
4197 | 586 | uint8_t *_scaled = cairo_image_surface_get_data(scaled); | ||
4198 | 587 | ssize_t scaledPitch = cairo_image_surface_get_stride(scaled); | ||
4199 | 588 | for (int y = 0; y < height; ++y) | ||
4200 | 589 | { | ||
4201 | 590 | for (int x = 0; x < width * channels; ++x) | ||
4202 | 591 | { | ||
4203 | 592 | _scaled[y * scaledPitch + x] = min(in[y][x] * scale, 255.0f); | ||
4204 | 593 | } | ||
4205 | 594 | } | ||
4206 | 595 | cairo_surface_mark_dirty(scaled); | ||
4207 | 596 | if (cairo_surface_write_to_png(scaled, path) != CAIRO_STATUS_SUCCESS) | ||
4208 | 597 | throw 0; | ||
4209 | 598 | } | ||
4210 | 599 | |||
4211 | 600 | #if defined(__x86_64__) || defined(__i386__) || defined (_M_X64) || defined(_M_IX86) // if x86 processor | ||
4212 | 601 | |||
4213 | 602 | #include <immintrin.h> | ||
4214 | 603 | |||
4215 | 604 | #ifndef __GNUC__ | ||
4216 | 605 | #define __SSE__ | ||
4217 | 606 | #define __SSE2__ | ||
4218 | 607 | #endif | ||
4219 | 608 | |||
4220 | 609 | const float MAX_SIZE_FOR_SINGLE_PRECISION = 30.0f; // switch to double if sigma > threshold or else round off error becomes so big that the output barely changes and you see annoying mach bands | ||
4221 | 610 | |||
4222 | 611 | const size_t GUARANTEED_ALIGNMENT = 16; | ||
4223 | 612 | |||
4224 | 613 | #define ALIGNED_ALLOCA(size, alignment) RoundUp((size_t)alloca(size + (((alignment) / GUARANTEED_ALIGNMENT) - 1) * GUARANTEED_ALIGNMENT), alignment) | ||
4225 | 614 | |||
4226 | 615 | const ALIGN(32) uint8_t PARTIAL_VECTOR_MASK[64] = | ||
4227 | 616 | { | ||
4228 | 617 | 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, | ||
4229 | 618 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 | ||
4230 | 619 | }; | ||
4231 | 620 | |||
4232 | 621 | #ifdef _MSC_VER | ||
4233 | 622 | #define __FMA__ | ||
4234 | 623 | #define __AVX2__ | ||
4235 | 624 | #define __AVX__ | ||
4236 | 625 | #define __SSE4_1__ | ||
4237 | 626 | #define __SSSE3__ | ||
4238 | 627 | #include "gaussian_blur_templates.h" | ||
4239 | 628 | |||
4240 | 629 | #else | ||
4241 | 630 | |||
4242 | 631 | #pragma GCC push_options | ||
4243 | 632 | |||
4244 | 633 | namespace AVX2 | ||
4245 | 634 | { | ||
4246 | 635 | #pragma GCC target("fma,avx2") | ||
4247 | 636 | #define __FMA__ | ||
4248 | 637 | #define __AVX2__ | ||
4249 | 638 | #define __AVX__ | ||
4250 | 639 | #define __SSE4_1__ | ||
4251 | 640 | #define __SSSE3__ | ||
4252 | 641 | #include "gaussian_blur_templates.h" | ||
4253 | 642 | } | ||
4254 | 643 | |||
4255 | 644 | namespace AVX | ||
4256 | 645 | { | ||
4257 | 646 | #pragma GCC target("avx") | ||
4258 | 647 | #undef __AVX2__ | ||
4259 | 648 | #undef __FMA__ | ||
4260 | 649 | |||
4261 | 650 | #include "gaussian_blur_templates.h" | ||
4262 | 651 | } | ||
4263 | 652 | |||
4264 | 653 | namespace SSE2 | ||
4265 | 654 | { | ||
4266 | 655 | //#ifdef __x86_64__ | ||
4267 | 656 | // #pragma GCC target("default") // base x86_64 ISA already has SSE2 | ||
4268 | 657 | //#else | ||
4269 | 658 | #pragma GCC target("sse2") | ||
4270 | 659 | //#endif | ||
4271 | 660 | #undef __AVX__ | ||
4272 | 661 | #undef __SSE4_1__ | ||
4273 | 662 | #undef __SSSE3__ | ||
4274 | 663 | #include "gaussian_blur_templates.h" | ||
4275 | 664 | } | ||
4276 | 665 | |||
4277 | 666 | #pragma GCC pop_options | ||
4278 | 667 | |||
4279 | 668 | #endif | ||
4280 | 669 | |||
4281 | 670 | void InitializeSIMDFunctions() | ||
4282 | 671 | { | ||
4283 | 672 | #ifdef __GNUC__ | ||
4284 | 673 | enum SIMDarch { SSE2, AVX, AVX2 }; | ||
4285 | 674 | const char *SIMDarchNames[] = { "SSE2", "AVX", "AVX2" }; | ||
4286 | 675 | SIMDarch arch; | ||
4287 | 676 | if (getenv("FORCE_SIMD") != NULL) | ||
4288 | 677 | { | ||
4289 | 678 | arch = (SIMDarch)atoi(getenv("FORCE_SIMD")); | ||
4290 | 679 | } | ||
4291 | 680 | else | ||
4292 | 681 | { | ||
4293 | 682 | unsigned cpuInfo[4]; | ||
4294 | 683 | // version in cpuid.h is buggy - forgets to clear ecx | ||
4295 | 684 | #define __cpuid(level, a, b, c, d) \ | ||
4296 | 685 | __asm__("xor %%ecx, %%ecx\n" \ | ||
4297 | 686 | "cpuid\n" \ | ||
4298 | 687 | : "=a"(a), "=b"(b), "=c"(c), "=d"(d) \ | ||
4299 | 688 | : "0"(level)) | ||
4300 | 689 | int maxLevel = __get_cpuid_max(0, 0); | ||
4301 | 690 | |||
4302 | 691 | cpuInfo[1] = 0; | ||
4303 | 692 | if (maxLevel >= 7) | ||
4304 | 693 | __cpuid(7, cpuInfo[0], cpuInfo[1], cpuInfo[2], cpuInfo[3]); | ||
4305 | 694 | |||
4306 | 695 | if (cpuInfo[1] & bit_AVX2) | ||
4307 | 696 | { | ||
4308 | 697 | arch = AVX2; | ||
4309 | 698 | } | ||
4310 | 699 | else | ||
4311 | 700 | { | ||
4312 | 701 | __cpuid(1, cpuInfo[0], cpuInfo[1], cpuInfo[2], cpuInfo[3]); | ||
4313 | 702 | if (cpuInfo[2] & bit_AVX) | ||
4314 | 703 | arch = AVX; | ||
4315 | 704 | else if (cpuInfo[3] & bit_SSE2) | ||
4316 | 705 | arch = SSE2; | ||
4317 | 706 | } | ||
4318 | 707 | } | ||
4319 | 708 | cout << "using " << SIMDarchNames[arch] << " functions" << endl; | ||
4320 | 709 | switch (arch) | ||
4321 | 710 | { | ||
4322 | 711 | case SSE2: | ||
4323 | 712 | GaussianBlurIIR_Y8 = SSE2::Convolve<1>; | ||
4324 | 713 | GaussianBlurIIR_R8G8B8A8 = SSE2::Convolve<4>; | ||
4325 | 714 | GaussianBlurFIR_Y8 = SSE2::ConvolveFIR<1>; | ||
4326 | 715 | GaussianBlurFIR_R8G8B8A8 = SSE2::ConvolveFIR<4>; | ||
4327 | 716 | GaussianBlurHorizontalIIR_Y8 = SSE2::ConvolveHorizontal<false, 1>; | ||
4328 | 717 | GaussianBlurHorizontalIIR_R8G8B8A8 = SSE2::ConvolveHorizontal<false, 4>; | ||
4329 | 718 | GaussianBlurVerticalIIR = SSE2::ConvolveVertical; | ||
4330 | 719 | break; | ||
4331 | 720 | case AVX: | ||
4332 | 721 | GaussianBlurIIR_Y8 = AVX::Convolve<1>; | ||
4333 | 722 | GaussianBlurIIR_R8G8B8A8 = AVX::Convolve<4>; | ||
4334 | 723 | GaussianBlurFIR_Y8 = AVX::ConvolveFIR<1>; | ||
4335 | 724 | GaussianBlurFIR_R8G8B8A8 = AVX::ConvolveFIR<4>; | ||
4336 | 725 | GaussianBlurHorizontalIIR_Y8 = AVX::ConvolveHorizontal<false, 1>; | ||
4337 | 726 | GaussianBlurHorizontalIIR_R8G8B8A8 = AVX::ConvolveHorizontal<false, 4>; | ||
4338 | 727 | GaussianBlurVerticalIIR = AVX::ConvolveVertical; | ||
4339 | 728 | break; | ||
4340 | 729 | case AVX2: | ||
4341 | 730 | GaussianBlurIIR_Y8 = AVX2::Convolve<1>; | ||
4342 | 731 | GaussianBlurIIR_R8G8B8A8 = AVX2::Convolve<4>; | ||
4343 | 732 | GaussianBlurFIR_Y8 = AVX2::ConvolveFIR<1>; | ||
4344 | 733 | GaussianBlurFIR_R8G8B8A8 = AVX2::ConvolveFIR<4>; | ||
4345 | 734 | GaussianBlurHorizontalIIR_Y8 = AVX2::ConvolveHorizontal<false, 1>; | ||
4346 | 735 | GaussianBlurHorizontalIIR_R8G8B8A8 = AVX2::ConvolveHorizontal<false, 4>; | ||
4347 | 736 | GaussianBlurVerticalIIR = AVX2::ConvolveVertical; | ||
4348 | 737 | break; | ||
4349 | 738 | } | ||
4350 | 739 | #else | ||
4351 | 740 | GaussianBlurIIR_Y8 = Convolve<1>; | ||
4352 | 741 | GaussianBlurIIR_R8G8B8A8 = Convolve<4>; | ||
4353 | 742 | GaussianBlurFIR_Y8 = ConvolveFIR<1>; | ||
4354 | 743 | GaussianBlurFIR_R8G8B8A8 = ConvolveFIR<4>; | ||
4355 | 744 | GaussianBlurHorizontalIIR_Y8 = ConvolveHorizontal<false, 1>; | ||
4356 | 745 | GaussianBlurHorizontalIIR_R8G8B8A8 = ConvolveHorizontal<false, 4>; | ||
4357 | 746 | GaussianBlurVerticalIIR = ConvolveVertical; | ||
4358 | 747 | #endif | ||
4359 | 748 | } | ||
4360 | 749 | #else | ||
4361 | 750 | void InitializeSIMDFunctions() | ||
4362 | 751 | { | ||
4363 | 752 | } | ||
4364 | 753 | #endif | ||
4365 | 754 | |||
4366 | 755 | |||
4367 | 756 | #ifdef UNIT_TEST | ||
4368 | 757 | |||
4369 | 758 | template <typename AnyType> | ||
4370 | 759 | void CompareImages(SimpleImage<AnyType> ref, SimpleImage<AnyType> actual, int w, int h) | ||
4371 | 760 | { | ||
4372 | 761 | double avgDiff = 0, | ||
4373 | 762 | maxDiff = 0, | ||
4374 | 763 | maxErrorFrac = 0; | ||
4375 | 764 | for (int y = 0; y < h; ++y) | ||
4376 | 765 | { | ||
4377 | 766 | for (int x = 0; x < w; ++x) | ||
4378 | 767 | { | ||
4379 | 768 | double diff = abs((double)actual[y][x] - (double)ref[y][x]); | ||
4380 | 769 | maxDiff = max(diff, maxDiff); | ||
4381 | 770 | avgDiff += diff; | ||
4382 | 771 | if (ref[y][x] != 0) | ||
4383 | 772 | { | ||
4384 | 773 | double errorFrac = abs(diff / ref[y][x]); | ||
4385 | 774 | maxErrorFrac = max(errorFrac, maxErrorFrac); | ||
4386 | 775 | } | ||
4387 | 776 | } | ||
4388 | 777 | } | ||
4389 | 778 | avgDiff /= (w * h); | ||
4390 | 779 | cout << "avgDiff=" << setprecision(4) << setw(9) << avgDiff << " maxDiff=" << setw(4) << maxDiff << " maxErrorFrac=" << setprecision(4) << setw(8) << maxErrorFrac << endl; | ||
4391 | 780 | } | ||
4392 | 781 | |||
4393 | 782 | void RefFilterIIR(cairo_surface_t *out, cairo_surface_t *in, | ||
4394 | 783 | float deviation_x, float deviation_y) | ||
4395 | 784 | { | ||
4396 | 785 | int threads = omp_get_max_threads(); | ||
4397 | 786 | const int MAX_THREADS = 16; | ||
4398 | 787 | |||
4399 | 788 | int h = cairo_image_surface_get_height(in), | ||
4400 | 789 | w = cairo_image_surface_get_width(in); | ||
4401 | 790 | |||
4402 | 791 | IIRValue * tmpdata[MAX_THREADS]; | ||
4403 | 792 | for (int i = 0; i < threads; ++i) | ||
4404 | 793 | tmpdata[i] = new IIRValue[std::max(w, h)*4]; | ||
4405 | 794 | |||
4406 | 795 | gaussian_pass_IIR(Geom::X, deviation_x, in, out, tmpdata, threads); | ||
4407 | 796 | gaussian_pass_IIR(Geom::Y, deviation_y, out, out, tmpdata, threads); | ||
4408 | 797 | |||
4409 | 798 | for (int i = 0; i < threads; ++i) | ||
4410 | 799 | delete[] tmpdata[i]; | ||
4411 | 800 | } | ||
4412 | 801 | |||
4413 | 802 | void RefFilterFIR(cairo_surface_t *out, cairo_surface_t *in, | ||
4414 | 803 | float deviation_x, float deviation_y) | ||
4415 | 804 | { | ||
4416 | 805 | int threads = omp_get_max_threads(); | ||
4417 | 806 | gaussian_pass_FIR(Geom::X, deviation_x, in, out, threads); | ||
4418 | 807 | gaussian_pass_FIR(Geom::Y, deviation_y, out, out, threads); | ||
4419 | 808 | } | ||
4420 | 809 | |||
4421 | 810 | |||
4422 | 811 | cairo_surface_t *ConvertToGrayscale(cairo_surface_t *s) | ||
4423 | 812 | { | ||
4424 | 813 | int w = cairo_image_surface_get_width(s), | ||
4425 | 814 | h = cairo_image_surface_get_height(s); | ||
4426 | 815 | cairo_surface_t *temp = cairo_image_surface_create(CAIRO_FORMAT_ARGB32, w, h), | ||
4427 | 816 | *grayScale = cairo_image_surface_create(CAIRO_FORMAT_A8, w, h); | ||
4428 | 817 | |||
4429 | 818 | cairo_t *r = cairo_create(temp); | ||
4430 | 819 | // set to 0 | ||
4431 | 820 | cairo_set_source_rgb(r, 0, 0, 0); | ||
4432 | 821 | cairo_paint(r); | ||
4433 | 822 | |||
4434 | 823 | // convert to gray scale | ||
4435 | 824 | cairo_set_operator(r, CAIRO_OPERATOR_HSL_LUMINOSITY); | ||
4436 | 825 | cairo_set_source_surface(r, s, 0, 0); | ||
4437 | 826 | cairo_paint(r); | ||
4438 | 827 | cairo_destroy(r); | ||
4439 | 828 | |||
4440 | 829 | ssize_t inPitch = cairo_image_surface_get_stride(temp), | ||
4441 | 830 | outPitch = cairo_image_surface_get_stride(grayScale); | ||
4442 | 831 | uint8_t *in = cairo_image_surface_get_data(temp), | ||
4443 | 832 | *out = cairo_image_surface_get_data(grayScale); | ||
4444 | 833 | for (int y = 0; y < h; ++y) | ||
4445 | 834 | { | ||
4446 | 835 | for (int x = 0; x < w; ++x) | ||
4447 | 836 | { | ||
4448 | 837 | out[y * outPitch + x] = in[y * inPitch + x * 4]; | ||
4449 | 838 | } | ||
4450 | 839 | } | ||
4451 | 840 | cairo_surface_destroy(temp); | ||
4452 | 841 | cairo_surface_mark_dirty(grayScale); | ||
4453 | 842 | return grayScale; | ||
4454 | 843 | } | ||
4455 | 844 | |||
4456 | 845 | void CopySurface(cairo_surface_t *out, cairo_surface_t *in) | ||
4457 | 846 | { | ||
4458 | 847 | int pitch = cairo_image_surface_get_stride(in), | ||
4459 | 848 | h = cairo_image_surface_get_height(in); | ||
4460 | 849 | |||
4461 | 850 | memcpy(cairo_image_surface_get_data(out), cairo_image_surface_get_data(in), pitch * h); | ||
4462 | 851 | cairo_surface_mark_dirty(out); | ||
4463 | 852 | } | ||
4464 | 853 | |||
4465 | 854 | extern "C" int main(int argc, char **argv) | ||
4466 | 855 | { | ||
4467 | 856 | using namespace boost::chrono; | ||
4468 | 857 | bool compareOrBenchmark = false; | ||
4469 | 858 | const char *imagePath = "../drmixx/rasterized/99_showdown_carcass.png"; | ||
4470 | 859 | _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); // why does Intel need microcode to handle denormals? NVIDIA handles it in hardware fine | ||
4471 | 860 | |||
4472 | 861 | cairo_surface_t *in; | ||
4473 | 862 | bool result; | ||
4474 | 863 | for (int i = 1; i < argc; ++i) | ||
4475 | 864 | { | ||
4476 | 865 | if (strcmp(argv[i], "-b") == 0) | ||
4477 | 866 | compareOrBenchmark = true; | ||
4478 | 867 | else | ||
4479 | 868 | imagePath = argv[i]; | ||
4480 | 869 | } | ||
4481 | 870 | in = cairo_image_surface_create_from_png(imagePath); | ||
4482 | 871 | |||
4483 | 872 | // OMG, not aligning to 16 bytes can almost slow things down 2x! | ||
4484 | 873 | //iluScale(RoundUp(ilGetInteger(IL_IMAGE_WIDTH), 4), RoundUp(ilGetInteger(IL_IMAGE_HEIGHT), 4), 0); | ||
4485 | 874 | |||
4486 | 875 | if (cairo_surface_status(in) != CAIRO_STATUS_SUCCESS) | ||
4487 | 876 | { | ||
4488 | 877 | cerr << "error loading" << endl; | ||
4489 | 878 | return -1; | ||
4490 | 879 | } | ||
4491 | 880 | |||
4492 | 881 | //if (!iluScale(38, 44, 1)) | ||
4493 | 882 | // cerr << "error scaling" << endl; | ||
4494 | 883 | int originalHeight = cairo_image_surface_get_height(in), | ||
4495 | 884 | originalWidth = cairo_image_surface_get_width(in); | ||
4496 | 885 | |||
4497 | 886 | cairo_surface_t *grayScaleIn = ConvertToGrayscale(in); | ||
4498 | 887 | if (GaussianBlurIIR_Y8 == NULL) | ||
4499 | 888 | InitializeSIMDFunctions(); | ||
4500 | 889 | |||
4501 | 890 | auto IterateCombinations = [&](int adjustedWidth0, int adjustedWidth1, int adjustedHeight0, int adjustedHeight1, auto callback) | ||
4502 | 891 | { | ||
4503 | 892 | for (int adjustedHeight = adjustedHeight0; adjustedHeight <= adjustedHeight1; ++adjustedHeight) | ||
4504 | 893 | { | ||
4505 | 894 | for (int adjustedWidth = adjustedWidth0; adjustedWidth <= adjustedWidth1; ++adjustedWidth) | ||
4506 | 895 | { | ||
4507 | 896 | for (int channels = 1; channels <= 4; channels += 3) | ||
4508 | 897 | { | ||
4509 | 898 | cairo_surface_t *modifiedIn = cairo_surface_create_similar_image(in, channels == 1 ? CAIRO_FORMAT_A8 : CAIRO_FORMAT_ARGB32, adjustedWidth, adjustedHeight); | ||
4510 | 899 | if (cairo_surface_status(modifiedIn) != CAIRO_STATUS_SUCCESS) | ||
4511 | 900 | { | ||
4512 | 901 | cerr << "error creating surface" << endl; | ||
4513 | 902 | continue; | ||
4514 | 903 | } | ||
4515 | 904 | cairo_t *ct = cairo_create(modifiedIn); | ||
4516 | 905 | cairo_scale(ct, double(adjustedWidth) / originalWidth, double(adjustedHeight) / originalHeight); | ||
4517 | 906 | // scale/convert to given size/format | ||
4518 | 907 | if (channels == 1) | ||
4519 | 908 | { | ||
4520 | 909 | cairo_set_source_rgb(ct, 1, 1, 1); | ||
4521 | 910 | cairo_mask_surface(ct, grayScaleIn, 0, 0); | ||
4522 | 911 | } | ||
4523 | 912 | else | ||
4524 | 913 | { | ||
4525 | 914 | cairo_set_source_surface(ct, in, 0, 0); | ||
4526 | 915 | cairo_paint(ct); | ||
4527 | 916 | } | ||
4528 | 917 | cairo_destroy(ct); | ||
4529 | 918 | |||
4530 | 919 | cairo_surface_t *out = cairo_surface_create_similar_image(modifiedIn, cairo_image_surface_get_format(modifiedIn), adjustedWidth, adjustedHeight); | ||
4531 | 920 | |||
4532 | 921 | for (int FIRorIIR = 0; FIRorIIR < 2; ++FIRorIIR) | ||
4533 | 922 | { | ||
4534 | 923 | for (int inPlace = 0; inPlace < 2; ++inPlace) | ||
4535 | 924 | { | ||
4536 | 925 | if (inPlace) | ||
4537 | 926 | CopySurface(out, modifiedIn); // restore overwritten input image | ||
4538 | 927 | |||
4539 | 928 | callback(out, inPlace ? out : modifiedIn, modifiedIn, inPlace, FIRorIIR); | ||
4540 | 929 | } | ||
4541 | 930 | } | ||
4542 | 931 | cairo_surface_destroy(modifiedIn); | ||
4543 | 932 | cairo_surface_destroy(out); | ||
4544 | 933 | } | ||
4545 | 934 | } | ||
4546 | 935 | } | ||
4547 | 936 | }; | ||
4548 | 937 | if (!compareOrBenchmark) | ||
4549 | 938 | { | ||
4550 | 939 | auto CompareFunction = [&](cairo_surface_t *out, cairo_surface_t *in, cairo_surface_t *backupIn, bool inPlace, bool FIRorIIR) | ||
4551 | 940 | { | ||
4552 | 941 | // here, we assume input & output have the same format | ||
4553 | 942 | int channels = cairo_image_surface_get_format(in) == CAIRO_FORMAT_ARGB32 ? 4 : 1; | ||
4554 | 943 | |||
4555 | 944 | SimpleImage <uint8_t> _in(cairo_image_surface_get_data(in), cairo_image_surface_get_stride(in)), | ||
4556 | 945 | _out(cairo_image_surface_get_data(out), cairo_image_surface_get_stride(out)); | ||
4557 | 946 | int width = cairo_image_surface_get_width(in), | ||
4558 | 947 | height = cairo_image_surface_get_height(in); | ||
4559 | 948 | |||
4560 | 949 | cairo_surface_t *refOut = cairo_surface_create_similar_image(in, channels == 1 ? CAIRO_FORMAT_A8 : CAIRO_FORMAT_ARGB32, width, height); | ||
4561 | 950 | |||
4562 | 951 | SimpleImage <uint8_t> _refOut(cairo_image_surface_get_data(refOut), cairo_image_surface_get_stride(refOut)); | ||
4563 | 952 | bool originalSize = width == originalWidth && height == originalHeight; | ||
4564 | 953 | |||
4565 | 954 | // test the correctness of different sigmas only for the original image size | ||
4566 | 955 | // testing multiple sigmas for scaled image sizes is a waste | ||
4567 | 956 | float DEFAULT_SIGMA = 5.0f; | ||
4568 | 957 | |||
4569 | 958 | float sigmaX0 = originalSize ? 0.5f : DEFAULT_SIGMA, | ||
4570 | 959 | sigmaX1 = originalSize ? (FIRorIIR ? 64 : 4) : DEFAULT_SIGMA; | ||
4571 | 960 | for (float sigmaX = sigmaX0; sigmaX <= sigmaX1; sigmaX *= 2) | ||
4572 | 961 | { | ||
4573 | 962 | float sigmaY = 1.2f * sigmaX; | ||
4574 | 963 | |||
4575 | 964 | cout << width << "x" << height | ||
4576 | 965 | << setw(10) << (channels == 4 ? " RGBA" : " grayscale") | ||
4577 | 966 | << (FIRorIIR ? " IIR" : " FIR") | ||
4578 | 967 | << setw(15) << (inPlace ? " in-place" : " out-of-place") | ||
4579 | 968 | << " sigmaX=" << setw(3) << sigmaX << " "; | ||
4580 | 969 | double dt = 0; | ||
4581 | 970 | |||
4582 | 971 | if (inPlace) | ||
4583 | 972 | { | ||
4584 | 973 | CopySurface(out, backupIn); | ||
4585 | 974 | } | ||
4586 | 975 | |||
4587 | 976 | |||
4588 | 977 | if (FIRorIIR) | ||
4589 | 978 | { | ||
4590 | 979 | if (channels == 1) | ||
4591 | 980 | GaussianBlurIIR_Y8(_out, _in, width, height, sigmaX, sigmaY); | ||
4592 | 981 | else | ||
4593 | 982 | GaussianBlurIIR_R8G8B8A8(_out, _in, width, height, sigmaX, sigmaY); | ||
4594 | 983 | } | ||
4595 | 984 | else | ||
4596 | 985 | { | ||
4597 | 986 | if (channels == 1) | ||
4598 | 987 | GaussianBlurFIR_Y8(_out, _in, width, height, sigmaX, sigmaY); | ||
4599 | 988 | else | ||
4600 | 989 | GaussianBlurFIR_R8G8B8A8(_out, _in, width, height, sigmaX, sigmaY); | ||
4601 | 990 | } | ||
4602 | 991 | |||
4603 | 992 | // ---------------------reference | ||
4604 | 993 | cairo_surface_t *refIn; | ||
4605 | 994 | if (inPlace) | ||
4606 | 995 | { | ||
4607 | 996 | refIn = refOut; | ||
4608 | 997 | CopySurface(refIn, backupIn); | ||
4609 | 998 | } | ||
4610 | 999 | else | ||
4611 | 1000 | { | ||
4612 | 1001 | refIn = in; | ||
4613 | 1002 | } | ||
4614 | 1003 | |||
4615 | 1004 | if (FIRorIIR) | ||
4616 | 1005 | RefFilterIIR(refOut, refIn, sigmaX, sigmaY); | ||
4617 | 1006 | else | ||
4618 | 1007 | RefFilterFIR(refOut, refIn, sigmaX, sigmaY); | ||
4619 | 1008 | |||
4620 | 1009 | if (0)//FIRorIIR && width == 1466 && sigmaX == 0.5f) | ||
4621 | 1010 | { | ||
4622 | 1011 | cout << " dumping "; | ||
4623 | 1012 | cairo_surface_write_to_png(refOut, "filtered_ref.png"); | ||
4624 | 1013 | cairo_surface_write_to_png(out, "filtered_opt.png"); | ||
4625 | 1014 | exit(1); | ||
4626 | 1015 | } | ||
4627 | 1016 | |||
4628 | 1017 | CompareImages(_refOut, _out, width, height); | ||
4629 | 1018 | } | ||
4630 | 1019 | cairo_surface_destroy(refOut); | ||
4631 | 1020 | }; | ||
4632 | 1021 | |||
4633 | 1022 | const int SIMD_Y_BLOCK_SIZE = 2, // for checking SIMD remainder handling issues | ||
4634 | 1023 | SIMD_X_BLOCK_SIZE = 4; | ||
4635 | 1024 | int h0 = RoundDown(originalHeight, SIMD_Y_BLOCK_SIZE), | ||
4636 | 1025 | w0 = RoundDown(originalWidth, SIMD_X_BLOCK_SIZE); | ||
4637 | 1026 | IterateCombinations(w0, w0 + SIMD_X_BLOCK_SIZE, h0, h0 + SIMD_Y_BLOCK_SIZE, CompareFunction); | ||
4638 | 1027 | //IterateCombinations(4, 8, 4, 8, CompareFunction); | ||
4639 | 1028 | |||
4640 | 1029 | } | ||
4641 | 1030 | else | ||
4642 | 1031 | { | ||
4643 | 1032 | // benchmark | ||
4644 | 1033 | #if defined(_WIN32) && defined(_DEBUG) | ||
4645 | 1034 | const int REPEATS = 1; | ||
4646 | 1035 | #else | ||
4647 | 1036 | const int REPEATS = 50 * max(1.0f, sqrtf(omp_get_max_threads())); // assume speedup is sqrt(#cores) | ||
4648 | 1037 | #endif | ||
4649 | 1038 | auto BenchmarkFunction = [&](cairo_surface_t *out, cairo_surface_t *in, cairo_surface_t *backupIn, bool inPlace, bool FIRorIIR) | ||
4650 | 1039 | { | ||
4651 | 1040 | // here, we assume input & output have the same format | ||
4652 | 1041 | int channels = cairo_image_surface_get_format(in) == CAIRO_FORMAT_ARGB32 ? 4 : 1; | ||
4653 | 1042 | |||
4654 | 1043 | SimpleImage <uint8_t> _in(cairo_image_surface_get_data(in), cairo_image_surface_get_stride(in)), | ||
4655 | 1044 | _out(cairo_image_surface_get_data(out), cairo_image_surface_get_stride(out)); | ||
4656 | 1045 | int width = cairo_image_surface_get_width(in), | ||
4657 | 1046 | height = cairo_image_surface_get_height(in); | ||
4658 | 1047 | const bool useRefCode = false; | ||
4659 | 1048 | |||
4660 | 1049 | float sigma0, sigma1; | ||
4661 | 1050 | if (FIRorIIR) | ||
4662 | 1051 | { | ||
4663 | 1052 | // test single precision & double precision throughput | ||
4664 | 1053 | sigma0 = MAX_SIZE_FOR_SINGLE_PRECISION * 0.75f; | ||
4665 | 1054 | sigma1 = sigma0 * 2.0f; | ||
4666 | 1055 | } | ||
4667 | 1056 | else | ||
4668 | 1057 | { | ||
4669 | 1058 | sigma0 = 0.5f; | ||
4670 | 1059 | sigma1 = 4; | ||
4671 | 1060 | } | ||
4672 | 1061 | |||
4673 | 1062 | for (float sigma = sigma0; sigma <= sigma1; sigma *= 2) | ||
4674 | 1063 | { | ||
4675 | 1064 | cout << width << "x" << height | ||
4676 | 1065 | << setw(10) << (channels == 4 ? " RGBA" : " grayscale") | ||
4677 | 1066 | << (FIRorIIR ? " IIR" : " FIR") | ||
4678 | 1067 | << setw(15) << (inPlace ? " in-place" : " out-of-place") | ||
4679 | 1068 | << " sigma=" << setw(3) << sigma; | ||
4680 | 1069 | high_resolution_clock::duration dt(0); | ||
4681 | 1070 | for (int i = 0; i < REPEATS; ++i) | ||
4682 | 1071 | { | ||
4683 | 1072 | if (inPlace) | ||
4684 | 1073 | CopySurface(out, backupIn); // copy backup to input/output | ||
4685 | 1074 | |||
4686 | 1075 | auto t0 = high_resolution_clock::now(); | ||
4687 | 1076 | if (useRefCode) | ||
4688 | 1077 | { | ||
4689 | 1078 | if (FIRorIIR) | ||
4690 | 1079 | RefFilterIIR(out, in, sigma, sigma); | ||
4691 | 1080 | else | ||
4692 | 1081 | RefFilterFIR(out, in, sigma, sigma); | ||
4693 | 1082 | } | ||
4694 | 1083 | else | ||
4695 | 1084 | { | ||
4696 | 1085 | if (FIRorIIR) | ||
4697 | 1086 | { | ||
4698 | 1087 | if (channels == 1) | ||
4699 | 1088 | GaussianBlurIIR_Y8(_out, _in, width, height, sigma, sigma); | ||
4700 | 1089 | else | ||
4701 | 1090 | GaussianBlurIIR_R8G8B8A8(_out, _in, width, height, sigma, sigma); | ||
4702 | 1091 | } | ||
4703 | 1092 | else | ||
4704 | 1093 | { | ||
4705 | 1094 | if (channels == 1) | ||
4706 | 1095 | GaussianBlurFIR_Y8(_out, _in, width, height, sigma, sigma); | ||
4707 | 1096 | else | ||
4708 | 1097 | GaussianBlurFIR_R8G8B8A8(_out, _in, width, height, sigma, sigma); | ||
4709 | 1098 | } | ||
4710 | 1099 | } | ||
4711 | 1100 | dt += high_resolution_clock::now() - t0; | ||
4712 | 1101 | } | ||
4713 | 1102 | cout << setw(9) << setprecision(3) << double(width * height * REPEATS) / duration_cast<microseconds>(dt).count() << " Mpix/s" << endl; | ||
4714 | 1103 | } | ||
4715 | 1104 | }; | ||
4716 | 1105 | int roundedWidth = RoundUp(originalWidth, 4), | ||
4717 | 1106 | roundedHeight = RoundUp(originalHeight, 4); | ||
4718 | 1107 | IterateCombinations(roundedWidth, roundedWidth, roundedHeight, roundedHeight, BenchmarkFunction); | ||
4719 | 1108 | } | ||
4720 | 1109 | cairo_surface_destroy(in); | ||
4721 | 1110 | cairo_surface_destroy(grayScaleIn); | ||
4722 | 1111 | return 0; | ||
4723 | 1112 | } | ||
4724 | 1113 | |||
4725 | 1114 | #endif | ||
4726 | 1115 | |||
4727 | 549 | void FilterGaussian::render_cairo(FilterSlot &slot) | 1116 | void FilterGaussian::render_cairo(FilterSlot &slot) |
4728 | 550 | { | 1117 | { |
4729 | 551 | cairo_surface_t *in = slot.getcairo(_input); | 1118 | cairo_surface_t *in = slot.getcairo(_input); |
4730 | @@ -645,22 +1212,97 @@ | |||
4731 | 645 | } | 1212 | } |
4732 | 646 | cairo_surface_flush(downsampled); | 1213 | cairo_surface_flush(downsampled); |
4733 | 647 | 1214 | ||
4734 | 1215 | SimpleImage<uint8_t> im((uint8_t *)cairo_image_surface_get_data(downsampled), cairo_image_surface_get_stride(downsampled)); | ||
4735 | 1216 | |||
4736 | 1217 | // 2D filter benefits | ||
4737 | 1218 | // 1. intermediate image may have higher precision than uint8 | ||
4738 | 1219 | // 2. reduced cache pollution from useless flushing of intermediate image to memory | ||
4739 | 1220 | if (scr_len_x > 0 && scr_len_y > 0 && use_IIR_x == use_IIR_y && GaussianBlurIIR_Y8 != NULL) | ||
4740 | 1221 | { | ||
4741 | 1222 | if (fmt == CAIRO_FORMAT_ARGB32) { | ||
4742 | 1223 | if (use_IIR_x) { | ||
4743 | 1224 | GaussianBlurIIR_R8G8B8A8(im, // out | ||
4744 | 1225 | im, // in | ||
4745 | 1226 | cairo_image_surface_get_width(downsampled), | ||
4746 | 1227 | cairo_image_surface_get_height(downsampled), | ||
4747 | 1228 | deviation_x, deviation_y); | ||
4748 | 1229 | } | ||
4749 | 1230 | else { | ||
4750 | 1231 | GaussianBlurFIR_R8G8B8A8(im, // out | ||
4751 | 1232 | im, // in | ||
4752 | 1233 | cairo_image_surface_get_width(downsampled), | ||
4753 | 1234 | cairo_image_surface_get_height(downsampled), | ||
4754 | 1235 | deviation_x, deviation_y); | ||
4755 | 1236 | } | ||
4756 | 1237 | } | ||
4757 | 1238 | else { | ||
4758 | 1239 | if (use_IIR_x) { | ||
4759 | 1240 | GaussianBlurIIR_Y8(im, | ||
4760 | 1241 | im, | ||
4761 | 1242 | cairo_image_surface_get_width(downsampled), | ||
4762 | 1243 | cairo_image_surface_get_height(downsampled), | ||
4763 | 1244 | deviation_x, deviation_y); | ||
4764 | 1245 | } | ||
4765 | 1246 | else { | ||
4766 | 1247 | GaussianBlurFIR_Y8(im, | ||
4767 | 1248 | im, | ||
4768 | 1249 | cairo_image_surface_get_width(downsampled), | ||
4769 | 1250 | cairo_image_surface_get_height(downsampled), | ||
4770 | 1251 | deviation_x, deviation_y); | ||
4771 | 1252 | } | ||
4772 | 1253 | } | ||
4773 | 1254 | } | ||
4774 | 1255 | else | ||
4775 | 1256 | { | ||
4776 | 648 | if (scr_len_x > 0) { | 1257 | if (scr_len_x > 0) { |
4779 | 649 | if (use_IIR_x) { | 1258 | if (use_IIR_x && GaussianBlurIIR_Y8 != NULL) { |
4780 | 650 | gaussian_pass_IIR(Geom::X, deviation_x, downsampled, downsampled, tmpdata, threads); | 1259 | if (fmt == CAIRO_FORMAT_ARGB32) |
4781 | 1260 | { | ||
4782 | 1261 | GaussianBlurHorizontalIIR_R8G8B8A8(im, // out | ||
4783 | 1262 | im, // in | ||
4784 | 1263 | cairo_image_surface_get_width(downsampled), | ||
4785 | 1264 | cairo_image_surface_get_height(downsampled), | ||
4786 | 1265 | deviation_x, false); | ||
4787 | 1266 | } | ||
4788 | 1267 | else { | ||
4789 | 1268 | GaussianBlurHorizontalIIR_Y8(im, // out | ||
4790 | 1269 | im, // in | ||
4791 | 1270 | cairo_image_surface_get_width(downsampled), | ||
4792 | 1271 | cairo_image_surface_get_height(downsampled), | ||
4793 | 1272 | deviation_x, false); | ||
4794 | 1273 | //gaussian_pass_IIR(Geom::X, deviation_x, downsampled, downsampled, tmpdata, threads); | ||
4795 | 1274 | } | ||
4796 | 651 | } else { | 1275 | } else { |
4797 | 1276 | // optimized 1D FIR filter can't work in-place | ||
4798 | 652 | gaussian_pass_FIR(Geom::X, deviation_x, downsampled, downsampled, threads); | 1277 | gaussian_pass_FIR(Geom::X, deviation_x, downsampled, downsampled, threads); |
4799 | 653 | } | 1278 | } |
4800 | 654 | } | 1279 | } |
4801 | 655 | 1280 | ||
4802 | 656 | if (scr_len_y > 0) { | 1281 | if (scr_len_y > 0) { |
4805 | 657 | if (use_IIR_y) { | 1282 | if (use_IIR_y && GaussianBlurIIR_Y8 != NULL) { |
4806 | 658 | gaussian_pass_IIR(Geom::Y, deviation_y, downsampled, downsampled, tmpdata, threads); | 1283 | if (fmt == CAIRO_FORMAT_ARGB32) |
4807 | 1284 | { | ||
4808 | 1285 | GaussianBlurVerticalIIR(im, // out | ||
4809 | 1286 | im, // in | ||
4810 | 1287 | cairo_image_surface_get_width(downsampled) * 4, | ||
4811 | 1288 | cairo_image_surface_get_height(downsampled), | ||
4812 | 1289 | deviation_y); | ||
4813 | 1290 | } | ||
4814 | 1291 | else | ||
4815 | 1292 | { | ||
4816 | 1293 | GaussianBlurVerticalIIR(im, // out | ||
4817 | 1294 | im, // in | ||
4818 | 1295 | cairo_image_surface_get_width(downsampled), | ||
4819 | 1296 | cairo_image_surface_get_height(downsampled), | ||
4820 | 1297 | deviation_y); | ||
4821 | 1298 | //gaussian_pass_IIR(Geom::Y, deviation_y, downsampled, downsampled, tmpdata, threads); | ||
4822 | 1299 | } | ||
4823 | 659 | } else { | 1300 | } else { |
4824 | 1301 | // optimized 1D FIR filter can't work in-place | ||
4825 | 660 | gaussian_pass_FIR(Geom::Y, deviation_y, downsampled, downsampled, threads); | 1302 | gaussian_pass_FIR(Geom::Y, deviation_y, downsampled, downsampled, threads); |
4826 | 661 | } | 1303 | } |
4827 | 662 | } | 1304 | } |
4829 | 663 | 1305 | } | |
4830 | 664 | // free the temporary data | 1306 | // free the temporary data |
4831 | 665 | if ( use_IIR_x || use_IIR_y ) { | 1307 | if ( use_IIR_x || use_IIR_y ) { |
4832 | 666 | for(int i = 0; i < threads; ++i) { | 1308 | for(int i = 0; i < threads; ++i) { |
Hi Yale could you fix some coding style. I try to do a review but is a complex code and my skills are very low in this things. What about coexistant with current code and give a prefs check to enable?
Inline some diff comments