[15.10 REGRESSION] Randomly wrongly detects files as binary

Bug #1535458 reported by Teo
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
grep (Debian)
Fix Released
Unknown
grep (Ubuntu)
Confirmed
High
Unassigned
Nominated for Wily by Mathew Hodson
Nominated for Xenial by Alberto Salvia Novella

Bug Description

I have a folder with a bunch of subfolder and several hundred or thousands files, most of them PHP files (obviously text).

I often use grep recursively, like this:
  $ grep -R somepattern *

Since the upgrade from 15.04 to 15.10, it often happens that a lot of text files are wrongly treated as binary. That means that, when a match is found, instead of getting the normal output which would show the file name and the matching line (with the matching substring highlighted), I get the bogus message:
  binary file whatever.php matches

Just to be clear: in one invocation of grep -R, I get mixed output with a lot of matches shown in the expected way and quite a few matches shown in the wrong way, even though ALL matching files are text files.

This worked as expected before the upgrade from 15.04 and 15.10.

This is a HUGE issue that makes it impossible to do everyday developing work.

Until you fix it, please roll back grep to the previous version, because it is unusable.

Thanks for your attention

ProblemType: Bug
DistroRelease: Ubuntu 15.10
Package: grep 2.21-2
ProcVersionSignature: Ubuntu 4.2.0-23.28-generic 4.2.6
Uname: Linux 4.2.0-23-generic x86_64
NonfreeKernelModules: nvidia
ApportVersion: 2.19.1-0ubuntu5
Architecture: amd64
CurrentDesktop: Unity
Date: Mon Jan 18 22:43:51 2016
InstallationDate: Installed on 2013-10-11 (829 days ago)
InstallationMedia: Ubuntu 13.04 "Raring Ringtail" - Release amd64 (20130424)
SourcePackage: grep
UpgradeStatus: Upgraded to wily on 2016-01-18 (0 days ago)

Revision history for this message
Teo (teo1978) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in grep (Ubuntu):
status: New → Confirmed
Revision history for this message
php4fan (php4fan) wrote :

For god's sake, this regression is HUGE.
It makes the life of any developer a hell and it's been MONTHS.

Can't you at least release an update rollinb back to a non-broken version until this is fixed??

Revision history for this message
Brian Murray (brian-murray) wrote :

Is this perhaps related to the following debian bug?

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=799956

Revision history for this message
teo1978 (teo8976) wrote :

Yep, looks like it's that one.
So it's fixed upstream. Will it take long to land on Ubuntu?
Otherwise, an update rolling back to grep version previous to the regression (which appears to be known) should be urgently released in the meanwhile. This is a pretty critical bug.

It's astonishing how poor Ubuntu's QA is.

Revision history for this message
Brian Murray (brian-murray) wrote :

Actually, the following bug seems more likely the cause of the problem.

https://debbugs.gnu.org/cgi/bugreport.cgi?bug=22838

Revision history for this message
teo1978 (teo8976) wrote :

Are you sure this is the same as 1547466 ??

1547466 describes mode switching to binary in the middle of the file.

What I observe is that some text files are treated as binary. I never get an output like the one described in 1547466, where you get some matches as expected and then the output gets interrupted in the middle of the file with the phrase "binary file xxx matches". I always get either the expected text-mode output for the whole file or only the "binary match" phrase. The same files always produce the same output, consistently.

To me this looks a lot like https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=799956 and not much like issue 1547466.

Every single file that exhibits the issue that I have examined is ISO-8859-15. I have only looked at a few randomly. And a few that don't exhibit the issue which I have looked at randomly are all UTF-8. (to look at it I use Gedit and do a Save As; the encoding the file is in is, or is supposed to be, selected by default).
However, I have created an ISO-8859-15 file from scratch from Gedit and it does not reproduce the issue.

Revision history for this message
teo1978 (teo8976) wrote :

However,

$ LANG=C.UTF-8 grep somestring some_iso8859_file.txt # reproduces the issue ("binary file matches")
$ LANG=C grep somestring some_iso8859_file.txt # expected (text) output

$ LANG=C.UTF-8 grep somestring some_utf8_file.txt # expected (text) output
$ LANG=C grep somestring some_utf8_file.txt # expected (text) output

I don't know if this is consistent with issue 1547466

Revision history for this message
teo1978 (teo8976) wrote :

@Brian Murray, I resubscribed you because you marked this issue as duplicate of #1547466, I asked you if you could confirm because that seems doubtful and you didn't reply, and now at 1535458 they say it only affects xenial, while this one I am observing on wily.

Revision history for this message
teo1978 (teo8976) wrote :

(btw sorry for subscribing you to the other bug by mistake)

Revision history for this message
teo1978 (teo8976) wrote :
Revision history for this message
Alberto Salvia Novella (es20490446e) wrote :

@ teo

If you see that a bug is affecting a release, it is important to enter its first name into the tag list. So I will notice it and promote it.

Changed in grep (Ubuntu):
importance: Undecided → High
tags: added: xenial
Revision history for this message
teo1978 (teo8976) wrote :

Sorry, not sure what you mean exactly by "affecting a release".

This issue appeared at some point on wily (with some update, NOT at dist-upgrade) and I wonder how it could be ignored for so long, when it was first reported AND after it turned out that it had wrongly be marked as duplicate.

Revision history for this message
Mathew Hodson (mhodson) wrote :

I don't think this particular bug should affect Xenial. The Debian bug says this was fixed in grep/2.23-1

The related issue in Xenial should be fixed by bug 1547466

tags: added: regression-release
tags: removed: xenial
Mathew Hodson (mhodson)
Changed in grep (Ubuntu):
status: Confirmed → Fix Committed
Revision history for this message
teo1978 (teo8976) wrote :

But it definitely affects Wily

Revision history for this message
teo1978 (teo8976) wrote :

And again, I don't think this is the same as bug 1547466 at all, as per comment https://bugs.launchpad.net/ubuntu/+source/grep/+bug/1547466/comments/19

Revision history for this message
sudodus (nio-wiklund) wrote :

No, it *should not*, but it *does*, at least for me in an updated & dist-upgraded 32-bit version (up to date today).

Using the attached file:

nio@xenial32 ~ $ cat seen-binary-by-grep.txt
vMgs ingen l�sning - the Swedish character o-umlaut
osmak ingen l�sning
smak lösningen - the solution
nio@xenial32 ~ $ grep ning seen-binary-by-grep.txt
Binär fil seen-binary-by-grep.txt matchar
nio@xenial32 ~ $ grep -a ning seen-binary-by-grep.txt
vMgs ingen l�sning - the Swedish character o-umlaut
osmak ingen l�sning
smak lösningen - the solution
nio@xenial32 ~ $ tail -n1 seen-binary-by-grep.txt|grep ning
smak lösningen - the solution
nio@xenial32 ~ $ head -n2 seen-binary-by-grep.txt|grep ning
Binär fil (standard in) matchar
nio@xenial32 ~ $ head -n2 seen-binary-by-grep.txt|grep -a ning
vMgs ingen l�sning - the Swedish character o-umlaut
osmak ingen l�sning
nio@xenial32 ~ $

Revision history for this message
sudodus (nio-wiklund) wrote :

Adding to the previous post: This bug does affect Xenial alias 16.04 LTS, at least with typical text files, that contain characters, that belong to my country's standard characters, but are not part of the English standard characters.

I think grep should not be sensitive to such characters, and draw the conclusion, that it is a binary file.

Changed in grep (Debian):
status: Unknown → Fix Released
teo1978 (teo8976)
Changed in grep (Ubuntu):
status: Fix Committed → Confirmed
Revision history for this message
Mathew Hodson (mhodson) wrote :

sudodus, what version of grep are you using? grep 2.25-1~16.04.1 was released in bug 1547466 and it may fix your issue.

Revision history for this message
sudodus (nio-wiklund) wrote :

nio@xenial32 ~ $ grep -V
grep (GNU grep) 2.24
Copyright © 2016 Free Software Foundation, Inc.
Licens GPLv3+: GNU GPL version 3 eller senare <http://gnu.org/licenses/gpl.html>
Det här är fri programvara: du får ändra och distribuera den.
Det finns INGEN GARANTI, så långt som tillåts enligt lag.

Skriven av Mike Haertel och andra, se <http://git.sv.gnu.org/cgit/grep.git/tree/AUTHORS>.
nio@xenial32 ~ $ apt-cache policy grep
grep:
  Installerad: 2.24-1
  Kandidat: 2.24-1
  Versionstabell:
 *** 2.24-1 500
        500 http://se.archive.ubuntu.com/ubuntu xenial/main i386 Packages
        100 /var/lib/dpkg/status
nio@xenial32 ~ $

How can I get grep 2.25-1~16.04.1 without messing up my system? Maybe I test it in a separate system, not my production system.

Revision history for this message
sudodus (nio-wiklund) wrote :

No I still have problems with a file with Swedish characters. In a live Ubuntu 16.04 LTS I installed grep version 2.25-1~16.04.1, but it did not help, as you can see from the following dialogue in a terminal window.

ubuntu@ubuntu:~$ apt-cache policy grep
grep:
  Installerad: 2.25-1~16.04.1
  Kandidat: 2.25-1~16.04.1
  Versionstabell:
 *** 2.25-1~16.04.1 500
        500 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     2.24-1 500
        500 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages
ubuntu@ubuntu:~$ grep ning seen-binary-by-grep.txt
Binary file seen-binary-by-grep.txt matches
ubuntu@ubuntu:~$ grep -a ning seen-binary-by-grep.txt
vMgs ingen l�sning - the Swedish character o-umlaut
osmak ingen l�sning
smak lösningen - the solution
ubuntu@ubuntu:~$ tail -n1 seen-binary-by-grep.txt |grep ning
smak lösningen - the solution
ubuntu@ubuntu:~$ tail -n2 seen-binary-by-grep.txt |grep ning
Binary file (standard input) matches
ubuntu@ubuntu:~$ tail -n2 seen-binary-by-grep.txt |grep -a ning
osmak ingen l�sning
smak lösningen - the solution
ubuntu@ubuntu:~$

-o-

This works with older versions of grub, so I would call it a regression.

Revision history for this message
Martin Pitt (pitti) wrote :

grep 2.25 only stopped treating files as binary under the "C" locale, as that commonly means "I don't care about the encoding". AFAIK the behaviour did not change if you call it under a proper locale such as sv_SE.UTF-8. If you look at the file when it's encoded in a proper locale, it works:

$ iconv -f iso8859-1 -t utf8 /tmp/seen-binary-by-grep.txt
vMgs ingen lösning - the Swedish character o-umlaut
osmak ingen lösning
smak lösningen - the solution

(Note that the file is broken -- the first two non-ASCII ö characters are encoded in ISO-8859-1, and the last one is UTF-8).

And as described above, binary detection also is disabled under C:

$ LC_CTYPE=C grep ning /tmp/seen-binary-by-grep.txt
vMgs ingen l�sning - the Swedish character o-umlaut
osmak ingen l�sning
smak lösningen - the solution

So this is indeed not the same bug as bug 1547466, I retitled that one to clarify. It's much closer to http://debbugs.gnu.org/cgi/bugreport.cgi?bug=19985.

Revision history for this message
sudodus (nio-wiklund) wrote :

I see, so I should use the "C" locale, as that commonly means "I don't care about the encoding".

Thank you for this information :-)

ps/ The 'real' file was not broken, I made this test file with two encodings on purpose for testing. The real file was encoded as iso8859-1, while the default Swedish locale is sv_SE.UTF-8.

I understand that grep works in another way now. one could say that it is more picky, but one could also consider this an improvement, that the file warns, that the file is somehow 'non-standard' /ds

Revision history for this message
teo1978 (teo8976) wrote :

> I see, so I should use the "C" locale,
> as that commonly means "I don't care about the encoding".

That is JUST A WORKAROUND that seems to work, but that shouldn't be needed. This is NOT the expected behavior.

> I understand that grep works in another way now.

Yeah, a wrong one because it has a bug. Which by the way appears to be fixed upstream.

Note that grepping an ISO-8859-file (containing non-ascii characters) with an UTF-8 locale will usually *but not always* reproduce the issue, so the behavior is not even consistent or predictable.

> one could say that it is more picky, but one could also consider this an improvement,

NOPE NOPE NOPE

> that the file warns, that the file is somehow 'non-standard'

If the intent was to warn about something, the program would print a warning (and then do its job correctly). Treating the file as BINARY, when it is not, is not a sane way to "warn" about anything.

Revision history for this message
Martin Pitt (pitti) wrote :

> Yeah, a wrong one because it has a bug. Which by the way appears to be fixed upstream.

... and apparently regressed again in 2.24 or 2.25..

Revision history for this message
Mathew Hodson (mhodson) wrote :

I am running a Trusty system with grep/2.16-1 and the problem doesn't show up there.

I manually installed grep/2.25-1~16.04.1 and grep/2.25-3 to them with the test case in Debian #799956, and they both exhibit the same issue as in the Debian bug.

It does seem that it either regressed again or it wasn't completely fixed the first time.

tags: added: xenial yakkety
description: updated
Mathew Hodson (mhodson)
description: updated
Mathew Hodson (mhodson)
tags: added: testcase
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.