Merge lp:~adamreichold/qpdfview/extended-text-selection into lp:qpdfview

Proposed by Adam Reichold
Status: Rejected
Rejected by: Adam Reichold
Proposed branch: lp:~adamreichold/qpdfview/extended-text-selection
Merge into: lp:qpdfview
Diff against target: 331 lines (+138/-46)
6 files modified
sources/documentview.cpp (+2/-2)
sources/model.h (+10/-1)
sources/pageitem.cpp (+39/-2)
sources/pageitem.h (+3/-0)
sources/pdfmodel.cpp (+83/-40)
sources/pdfmodel.h (+1/-1)
To merge this branch: bzr merge lp:~adamreichold/qpdfview/extended-text-selection
Reviewer Review Type Date Requested Status
Benjamin Eltzner Pending
Razi Alavizadeh Pending
Review via email: mp+264322@code.launchpad.net

Description of the change

This branch extends the model to allow for extended text selections, i.e. an arbitrary boundary and the text contained within it, and implements this within the PDF model. It also modifies the PageItem so that it makes use of this selection if available, preferring it over the previous method of text extraction and also highlighting the boundary during the selection process.

The extended text extraction itself uses the same Poppler API as the cached text extraction and relies on the fact that the text boxes are provided in "approximately reading order" by Poppler already. Hence this might be much more complicated to implement in the other backends, e.g. DjVuLibre, but therefore it should stay optional IMHO.

Some of the points that need to be discussed IMHO are:

* Do we want to always use it if made available by the model, or maybe extend the copy-to-clipboard pop-up menu with additional options to explicitly select the method of text extraction instead?

* Is the performance sufficient or do we need to devise a more incremental computation? It feels alright on my machine but I did come to wrong conclusions based on this in the past. This also affects whether this should be put behind a configuration setting or into a separate rubber-band-mode.

* This currently works on a per-text-box, i.e. approximately per-word, level and could be extended to work on a per-character level but I suspect with significant overhead for either tracking character runs or already aggregating runs into a boundary and contained text.

* Should the boundary be processed further to provide more connected highlighting and if so how, e.g. computing (an approximation of) the convex hull?

To post a comment you must log in.
Revision history for this message
Adam Reichold (adamreichold) wrote :

Also note that this does not make any use of tagged PDF features and hence its reliability w.r.t. complicated layouts is probably questionable...

1935. By Adam Reichold

Since PdfPage::text is now a fallback interface we can remove the separate Page::cachedText entry point which all other plugins forwarded to Page::text anyway.

1936. By Adam Reichold

Some minor cosmetic clean-ups to PDF text extraction and selection.

Revision history for this message
Razi Alavizadeh (srazi) wrote :

Hello Adam,

Also note that this does not make any use of tagged PDF features and hence
> its reliability w.r.t. complicated layouts is probably questionable...
>

Is related? With this branch it selects text of RTL documents in reversed
order.

CORRECT: این عشق خانمانسوز
INCORRECT: زوسنامناخ قشع نیا
I attached a PDF in Persian for testing.

Best regards,
Razi.

2015-07-09 22:46 GMT+04:30 Adam Reichold <email address hidden>:

> Also note that this does not make any use of tagged PDF features and hence
> its reliability w.r.t. complicated layouts is probably questionable...
> --
>
> https://code.launchpad.net/~adamreichold/qpdfview/extended-text-selection/+merge/264322
> You are requested to review the proposed merge of
> lp:~adamreichold/qpdfview/extended-text-selection into lp:qpdfview.
>

--
Alavizadeh, Sayed Razi
My Blog: http://pozh.org
Saaghar (نرم‌افزار شعر): http://saaghar.pozh.org/
Saaghar Fan Page: http://www.facebook.com/saaghar.p
Saaghar Mailing List: http://groups.google.com/group/saaghar

Revision history for this message
Adam Reichold (adamreichold) wrote :

Hello,

> Is related? With this branch it selects text of RTL documents in reversed
> order.
>
> CORRECT: این عشق خانمانسوز
> INCORRECT: زوسنامناخ قشع نیا
> I attached a PDF in Persian for testing.

I don't think this is related to tagged PDF since the previous method of text extraction did not use it as well. But it is obviously incorrect and a result of loosing some of the heuristics which where implemented within Poppler::Page::text and Poppler::Page::search.

What I wonder is though, this should affect the extended search dock in the same way as it uses the same method of text extraction? At least I don't understand how there could be a difference in behaviour...

From how I understand these things, "fixing" this by just calling QString::reverse if Document::wantsRightToLeft() will not be correct?

Best regards, Adam.

Revision history for this message
Benjamin Eltzner (b-eltzner) wrote :

Hi Adam,

I noticed the following:

* When selecting several lines of text, line breaks are not preserved
and not replaced by blank spaces, which leads to word concatenations.

* The selection tool is still a "frame", so when the text portion to be
selected starts to the right of the position where it ends one will
still have to copy unwanted text bits. Maybe the selection tool could be
adjusted to select "line wise" from the "button push position" to the
"button release position".

* I did not have any trouble concerning performance.

I agree that it would be favorable to be able to use the tagged PDF
features, however it seems that the poppler project maintainers are not
particularly interested in these. So I am really glad you propose this
implementation and are willing to take the burden of maintenance.

Cheers, Benjamin

Revision history for this message
Razi Alavizadeh (srazi) wrote :

Hello Adam,

What I wonder is though, this should affect the extended search dock in the
> same way as it uses the same method of text extraction? At least I don't
> understand how there could be a difference in behaviour...
>

Yes, indeed the extended search dock is affected, too. But I didn't report
it, because it is the bug of Poppler, and also the issue is bigger when we
speak about searching:

If you want search for RTL phrase "ABC" by Poppler then you have to reverse
it, i.e. you have to search for "CBA" and if it's a mixed phrase i.e "ABC
123" that "123" is an LTR phrase then you have to search for "CBA 123". see
the long standing bug [1].

From how I understand these things, "fixing" this by just calling
> QString::reverse if Document::wantsRightToLeft() will not be correct?
>
Well I have some problem and points here:
1- QString::reverse() (that we should implement it) is a solution when text
is RTL only.

2- What is reterned by "Document::wantsRightToLeft()" if there is both RTL
and LTR content?

3 And my problem: I compiled Poppler 0.29 and 0.34 but when linking with
both of them there is error about symbol not found :| for both of them it
says "m_document->textDirection()" is not found and for version 0.34 it
says "
Poppler::Page::SearchFlags" is not found, maybe its because I disable
"libcairo" when compiling? or Do I have set some flags?

Btw, I wrote a patch that fixes problem with extracting/searching [2].
Don't you think we can port this patch to Poppler's code as a patch for [1]?

[1] https://bugs.freedesktop.org/show_bug.cgi?id=2981
[2] https://code.launchpad.net/~srazi/qpdfview/extended-text-selection

Best Regards,
Razi.

2015-07-11 11:13 GMT+04:30 Adam Reichold <email address hidden>:

> Hello,
>
> > Is related? With this branch it selects text of RTL documents in reversed
> > order.
> >
> > CORRECT: این عشق خانمانسوز
> > INCORRECT: زوسنامناخ قشع نیا
> > I attached a PDF in Persian for testing.
>
> I don't think this is related to tagged PDF since the previous method of
> text extraction did not use it as well. But it is obviously incorrect and a
> result of loosing some of the heuristics which where implemented within
> Poppler::Page::text and Poppler::Page::search.
>
> What I wonder is though, this should affect the extended search dock in
> the same way as it uses the same method of text extraction? At least I
> don't understand how there could be a difference in behaviour...
>
> >From how I understand these things, "fixing" this by just calling
> QString::reverse if Document::wantsRightToLeft() will not be correct?
>
> Best regards, Adam.
> --
>
> https://code.launchpad.net/~adamreichold/qpdfview/extended-text-selection/+merge/264322
> You are requested to review the proposed merge of
> lp:~adamreichold/qpdfview/extended-text-selection into lp:qpdfview.
>

--
Alavizadeh, Sayed Razi
My Blog: http://pozh.org
Saaghar (نرم‌افزار شعر): http://saaghar.pozh.org/
Saaghar Fan Page: http://www.facebook.com/saaghar.p
Saaghar Mailing List: http://groups.google.com/group/saaghar

Revision history for this message
Adam Reichold (adamreichold) wrote :

Hello Razi,

Am 13.07.2015 um 17:33 schrieb S. Razi Alavizadeh:
> Yes, indeed the extended search dock is affected, too. But I didn't report
> it, because it is the bug of Poppler, and also the issue is bigger when we
> speak about searching:
>
> If you want search for RTL phrase "ABC" by Poppler then you have to reverse
> it, i.e. you have to search for "CBA" and if it's a mixed phrase i.e "ABC
> 123" that "123" is an LTR phrase then you have to search for "CBA 123". see
> the long standing bug [1].

> Btw, I wrote a patch that fixes problem with extracting/searching [2].
> Don't you think we can port this patch to Poppler's code as a patch for [1]?

I will look into what it would take to get this into Poppler, especially
reading up on the history of [1] seems necessary before submitting
anything. But could you rebase your work on qpdfview's trunk as a
separate merge request? (Because I will reject this merge and abandon
the branch for which I'll give an explanation in a separate message.)

>>From how I understand these things, "fixing" this by just calling
>> QString::reverse if Document::wantsRightToLeft() will not be correct?
>>
> Well I have some problem and points here:
> 1- QString::reverse() (that we should implement it) is a solution when text
> is RTL only.
>
> 2- What is reterned by "Document::wantsRightToLeft()" if there is both RTL
> and LTR content?

This is exactly what I was thinking of. So your bidirectional method
seems like the way to go. Is this a full implementation of the BiDi
algorithm?

> 3 And my problem: I compiled Poppler 0.29 and 0.34 but when linking with
> both of them there is error about symbol not found :| for both of them it
> says "m_document->textDirection()" is not found and for version 0.34 it
> says "
> Poppler::Page::SearchFlags" is not found, maybe its because I disable
> "libcairo" when compiling? or Do I have set some flags?

Have you checked that qpdfview is building against the correct Poppler
include path? This sounds like you try to build with "HAS_POPPLER_31"
but without adjustment of "INCLUDEPATH"? It should definitely be
independent of whether you enable Cario integration or not.

Best regards, Adam.

Revision history for this message
Adam Reichold (adamreichold) wrote :

Hello Benjamin,

Am 12.07.2015 um 13:30 schrieb Benjamin Eltzner:
> * When selecting several lines of text, line breaks are not preserved
> and not replaced by blank spaces, which leads to word concatenations.

Looking into this, Poppler::Page::textList gives us absolutely no
information other than the text (in whatever granularity) and its
bounding boxes. Everything else has to be synthesized from that. From
looking at how for example Okular does that, I do definitely not want to
implement comparable heuristics and will rather try to work on the
tagged PDF support in the Qt frontend of Poppler instead. Hence I will
reject this merge and abandon the related branch.

> * The selection tool is still a "frame", so when the text portion to be
> selected starts to the right of the position where it ends one will
> still have to copy unwanted text bits. Maybe the selection tool could be
> adjusted to select "line wise" from the "button push position" to the
> "button release position".

I agree, but it does not matter IMHO because of the above.

> I agree that it would be favorable to be able to use the tagged PDF
> features, however it seems that the poppler project maintainers are not
> particularly interested in these. So I am really glad you propose this
> implementation and are willing to take the burden of maintenance.

I do not think that favour has anything to do with it, but rather a
simple shortage of contributed labour. Tagged PDF support is a
significant project that puts considerable strain on the resources of
the Poppler project.

Best regards, Adam.

Revision history for this message
Razi Alavizadeh (srazi) wrote :

Hello Adam,

> I will look into what it would take to get this into Poppler, especially
> reading up on the history of [1] seems necessary before submitting
> anything. But could you rebase your work on qpdfview's trunk as a
> separate merge request? (Because I will reject this merge and abandon
> the branch for which I'll give an explanation in a separate message.)
>
Because you find existing patch for [1] a good start point then I don't
send the patch, feel free to ask it on your need.

This is exactly what I was thinking of. So your bidirectional method
> seems like the way to go. Is this a full implementation of the BiDi
> algorithm?
>
I think you already know that my patch is not bidi-algorithm and
bidi-algorithm is a very complicated thing, you can find its implementation
in QTSRC/gui/text/qtextengine.cpp.

Have you checked that qpdfview is building against the correct Poppler
> include path? This sounds like you try to build with "HAS_POPPLER_31"
> but without adjustment of "INCLUDEPATH"? It should definitely be
> independent of whether you enable Cario integration or not.
>
Interestingly it was disappeared when compiling against Qt5.

Best Regards,
Razi.

2015-07-13 21:41 GMT+04:30 Adam Reichold <email address hidden>:

> The proposal to merge lp:~adamreichold/qpdfview/extended-text-selection
> into lp:qpdfview has been updated.
>
> Status: Needs review => Rejected
>
> For more details, see:
>
> https://code.launchpad.net/~adamreichold/qpdfview/extended-text-selection/+merge/264322
> --
> You are requested to review the proposed merge of
> lp:~adamreichold/qpdfview/extended-text-selection into lp:qpdfview.
>

--
Alavizadeh, Sayed Razi
My Blog: http://pozh.org
Saaghar (نرم‌افزار شعر): http://saaghar.pozh.org/
Saaghar Fan Page: http://www.facebook.com/saaghar.p
Saaghar Mailing List: http://groups.google.com/group/saaghar

Unmerged revisions

1936. By Adam Reichold

Some minor cosmetic clean-ups to PDF text extraction and selection.

1935. By Adam Reichold

Since PdfPage::text is now a fallback interface we can remove the separate Page::cachedText entry point which all other plugins forwarded to Page::text anyway.

1934. By Adam Reichold

At least simplify the boundary and text of extended PDF selections.

1933. By Adam Reichold

Use proper text selection if made available by the model.

1932. By Adam Reichold

Extend model to allow for proper text selections

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'sources/documentview.cpp'
2--- sources/documentview.cpp 2015-07-05 14:37:58 +0000
3+++ sources/documentview.cpp 2015-07-09 19:23:47 +0000
4@@ -936,8 +936,8 @@
5
6 const QRectF surroundingRect(x, rect.top(), width, rect.height());
7
8- const QString& matchedText = m_pages.at(page - 1)->cachedText(rect);
9- const QString& surroundingText = m_pages.at(page - 1)->cachedText(surroundingRect);
10+ const QString& matchedText = m_pages.at(page - 1)->text(rect);
11+ const QString& surroundingText = m_pages.at(page - 1)->text(surroundingRect);
12
13 return qMakePair(matchedText, surroundingText);
14 }
15
16=== modified file 'sources/model.h'
17--- sources/model.h 2015-06-27 06:35:40 +0000
18+++ sources/model.h 2015-07-09 19:23:47 +0000
19@@ -65,6 +65,15 @@
20
21 };
22
23+ struct Selection
24+ {
25+ QPainterPath boundary;
26+ QString text;
27+
28+ Selection() : boundary(), text() {}
29+
30+ };
31+
32 class Annotation : public QObject
33 {
34 Q_OBJECT
35@@ -115,7 +124,7 @@
36 virtual QList< Link* > links() const { return QList< Link* >(); }
37
38 virtual QString text(const QRectF& rect) const { Q_UNUSED(rect); return QString(); }
39- virtual QString cachedText(const QRectF& rect) const { return text(rect); }
40+ virtual Selection* selectedText(const QRectF& rect) const { Q_UNUSED(rect); return 0; }
41
42 virtual QList< QRectF > search(const QString& text, bool matchCase, bool wholeWords) const { Q_UNUSED(text); Q_UNUSED(matchCase); Q_UNUSED(wholeWords); return QList< QRectF >(); }
43
44
45=== modified file 'sources/pageitem.cpp'
46--- sources/pageitem.cpp 2015-06-02 18:15:29 +0000
47+++ sources/pageitem.cpp 2015-07-09 19:23:47 +0000
48@@ -75,6 +75,7 @@
49 m_formFields(),
50 m_rubberBandMode(ModifiersMode),
51 m_rubberBand(),
52+ m_selection(),
53 m_annotationOverlay(),
54 m_formFieldOverlay(),
55 m_renderParam(),
56@@ -457,6 +458,12 @@
57 {
58 m_rubberBand = QRectF(event->pos(), QSizeF());
59
60+ if(m_rubberBandMode == CopyToClipboardMode)
61+ {
62+ const QRectF rect = m_transform.inverted().mapRect(m_rubberBand).normalized();
63+ m_selection.reset(m_page->selectedText(rect));
64+ }
65+
66 emit rubberBandStarted();
67
68 update();
69@@ -561,6 +568,12 @@
70 {
71 m_rubberBand.setBottomRight(event->pos());
72
73+ if(m_rubberBandMode == CopyToClipboardMode)
74+ {
75+ const QRectF rect = m_transform.inverted().mapRect(m_rubberBand).normalized();
76+ m_selection.reset(m_page->selectedText(rect));
77+ }
78+
79 update();
80
81 event->accept();
82@@ -595,6 +608,8 @@
83 m_rubberBandMode = ModifiersMode;
84 m_rubberBand = QRectF();
85
86+ m_selection.reset();
87+
88 emit rubberBandFinished();
89
90 update();
91@@ -730,6 +745,18 @@
92
93 void PageItem::copyToClipboard(const QPoint& screenPos)
94 {
95+ QString text;
96+
97+ if(m_selection)
98+ {
99+ text = m_selection->text;
100+ }
101+ else
102+ {
103+ const QRectF rect = m_transform.inverted().mapRect(m_rubberBand).normalized();
104+ text = m_page->text(rect);
105+ }
106+
107 QMenu menu;
108
109 QAction* copyTextAction = menu.addAction(tr("Copy &text"));
110@@ -737,8 +764,6 @@
111 const QAction* copyImageAction = menu.addAction(tr("Copy &image"));
112 const QAction* saveImageToFileAction = menu.addAction(tr("Save image to &file..."));
113
114- const QString text = m_page->text(m_transform.inverted().mapRect(m_rubberBand));
115-
116 copyTextAction->setVisible(!text.isEmpty());
117 selectTextAction->setVisible(!text.isEmpty() && QApplication::clipboard()->supportsSelection());
118
119@@ -1271,6 +1296,18 @@
120
121 painter->restore();
122 }
123+
124+ if(m_selection)
125+ {
126+ painter->save();
127+
128+ painter->setTransform(m_transform, true);
129+ painter->setCompositionMode(QPainter::CompositionMode_Multiply);
130+
131+ painter->fillPath(m_selection->boundary, QBrush(s_settings->pageItem().highlightColor()));
132+
133+ painter->restore();
134+ }
135 }
136
137 } // qpdfview
138
139=== modified file 'sources/pageitem.h'
140--- sources/pageitem.h 2015-06-01 19:28:55 +0000
141+++ sources/pageitem.h 2015-07-09 19:23:47 +0000
142@@ -37,6 +37,7 @@
143 namespace Model
144 {
145 struct Link;
146+struct Selection;
147 class Annotation;
148 class FormField;
149 class Page;
150@@ -166,6 +167,8 @@
151 RubberBandMode m_rubberBandMode;
152 QRectF m_rubberBand;
153
154+ QScopedPointer< Model::Selection > m_selection;
155+
156 void copyToClipboard(const QPoint& screenPos);
157 void addAnnotation(const QPoint& screenPos);
158
159
160=== modified file 'sources/pdfmodel.cpp'
161--- sources/pdfmodel.cpp 2015-06-27 06:35:40 +0000
162+++ sources/pdfmodel.cpp 2015-07-09 19:23:47 +0000
163@@ -150,11 +150,59 @@
164 typedef QSharedPointer< Poppler::TextBox > TextBox;
165 typedef QList< TextBox > TextBoxList;
166
167-QCache< const PdfPage*, TextBoxList > textCache(1 << 12);
168+QCache< const Poppler::Page*, TextBoxList > textCache(1 << 12);
169 QMutex textCacheMutex;
170
171 #define LOCK_TEXT_CACHE QMutexLocker mutexLocker(&textCacheMutex);
172
173+TextBoxList textBoxes(QMutex* mutex, const Poppler::Page* page)
174+{
175+ {
176+ LOCK_TEXT_CACHE
177+
178+ if(const TextBoxList* object = textCache.object(page))
179+ {
180+ return *object;
181+ }
182+ }
183+
184+ TextBoxList textBoxes;
185+
186+ {
187+ QMutexLocker mutexLocker(mutex);
188+
189+ foreach(Poppler::TextBox* textBox, page->textList())
190+ {
191+ textBoxes.append(TextBox(textBox));
192+ }
193+ }
194+
195+ {
196+ LOCK_TEXT_CACHE
197+
198+ textCache.insert(page, new TextBoxList(textBoxes), textBoxes.count());
199+ }
200+
201+ return textBoxes;
202+}
203+
204+inline void extendSelection(Selection* selection, const TextBox& textBox)
205+{
206+ selection->boundary.addRect(textBox->boundingBox());
207+ selection->text.append(textBox->text());
208+
209+ if(textBox->hasSpaceAfter())
210+ {
211+ selection->text.append(QLatin1Char(' '));
212+ }
213+}
214+
215+inline void simplifySelection(Selection* selection)
216+{
217+ selection->boundary = selection->boundary.simplified();
218+ selection->text = selection->text.simplified();
219+}
220+
221 } // anonymous
222
223 namespace qpdfview
224@@ -293,7 +341,7 @@
225 {
226 LOCK_TEXT_CACHE
227
228- textCache.remove(this);
229+ textCache.remove(m_page);
230 }
231
232 delete m_page;
233@@ -420,46 +468,9 @@
234
235 QString PdfPage::text(const QRectF& rect) const
236 {
237- LOCK_PAGE
238-
239- return m_page->text(rect).simplified();
240-}
241-
242-QString PdfPage::cachedText(const QRectF& rect) const
243-{
244- bool wasCached = false;
245- TextBoxList textBoxes;
246-
247- {
248- LOCK_TEXT_CACHE
249-
250- if(const TextBoxList* object = textCache.object(this))
251- {
252- wasCached = true;
253-
254- textBoxes = *object;
255- }
256- }
257-
258- if(!wasCached)
259- {
260- {
261- LOCK_PAGE
262-
263- foreach(Poppler::TextBox* textBox, m_page->textList())
264- {
265- textBoxes.append(TextBox(textBox));
266- }
267- }
268-
269- LOCK_TEXT_CACHE
270-
271- textCache.insert(this, new TextBoxList(textBoxes), textBoxes.count());
272- }
273-
274 QString text;
275
276- foreach(const TextBox& textBox, textBoxes)
277+ foreach(const TextBox& textBox, textBoxes(m_mutex, m_page))
278 {
279 if(!rect.intersects(textBox->boundingBox()))
280 {
281@@ -485,6 +496,38 @@
282 return text.simplified();
283 }
284
285+Selection* PdfPage::selectedText(const QRectF& rect) const
286+{
287+ Selection* selection = new Selection();
288+
289+ bool firstRun = false;
290+ TextBoxList currentRun;
291+
292+ foreach(const TextBox& textBox, textBoxes(m_mutex, m_page))
293+ {
294+ if(textBox->boundingBox().intersects(rect))
295+ {
296+ foreach(const TextBox& textBox, currentRun)
297+ {
298+ extendSelection(selection, textBox);
299+ }
300+
301+ extendSelection(selection, textBox);
302+
303+ firstRun = true;
304+ currentRun.clear();
305+ }
306+ else if(firstRun)
307+ {
308+ currentRun.append(textBox);
309+ }
310+ }
311+
312+ simplifySelection(selection);
313+
314+ return selection;
315+}
316+
317 QList< QRectF > PdfPage::search(const QString& text, bool matchCase, bool wholeWords) const
318 {
319 LOCK_PAGE
320
321=== modified file 'sources/pdfmodel.h'
322--- sources/pdfmodel.h 2015-01-27 20:41:38 +0000
323+++ sources/pdfmodel.h 2015-07-09 19:23:47 +0000
324@@ -115,7 +115,7 @@
325 QList< Link* > links() const;
326
327 QString text(const QRectF& rect) const;
328- QString cachedText(const QRectF& rect) const;
329+ Selection* selectedText(const QRectF& rect) const;
330
331 QList< QRectF > search(const QString& text, bool matchCase, bool wholeWords) const;
332

Subscribers

People subscribed via source and target branches

to all changes: