gawk: Odd regexp matching problem if locale's mb_cur_max > 1

Bug #9026 reported by Debian Bug Importer
4
Affects Status Importance Assigned to Milestone
gawk (Debian)
Fix Released
Unknown
gawk (Ubuntu)
Invalid
High
Unassigned

Bug Description

Automatically imported from Debian bug report #266519 http://bugs.debian.org/266519

Revision history for this message
In , Tatsuya Kinoshita (tats) wrote :

On August 18, 2004 at 2:57PM +0900,
miles (at lsi.nec.co.jp) wrote:

> Package: gawk
> Version: 1:3.1.4-1

> Executing the following line in a shell:
>
> echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=ja_JP gawk '/[Cc]hangeLog/ { print }'
>
> yields not the expected two lines of output, but instead only the first one:
>
> --- orig/lisp/ChangeLog
>
>
> If the LANG-setting portion is changed to use C, then it works as
> expected (others such as "de" seem to work too):
>
> echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=C gawk '/[Cc]hangeLog/ { print }'
>
> yields:
>
> --- orig/lisp/ChangeLog
> +++ mod/lisp/ChangeLog
>
>
> I'm not sure if the actual encoding has any impact -- ja_JP, ja_JP.utf8,
> and ja_JP.eucjp all exhibit the same problem.

ko_KR, zh_CN, and zh_TW exhibit the same problem. On CJK
locales, this bug causes gawk scripts unusable.

Downgrading gawk to version 1:3.1.3-3 prevents the problem.

Could anyone fix this bug?

Thanks,
--
Tatsuya Kinoshita

Revision history for this message
In , Fumitoshi UKAI (ukai) wrote :

At Mon, 11 Oct 2004 23:29:15 +0900 (JST),
Tatsuya Kinoshita wrote:

> > Package: gawk
> > Version: 1:3.1.4-1
>
> > Executing the following line in a shell:
> >
> > echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=ja_JP gawk '/[Cc]hangeLog/ { print }'
> >
> > yields not the expected two lines of output, but instead only the first one:
> >
> > --- orig/lisp/ChangeLog
> >
> >
> > If the LANG-setting portion is changed to use C, then it works as
> > expected (others such as "de" seem to work too):
> >
> > echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=C gawk '/[Cc]hangeLog/ { print }'
> >
> > yields:
> >
> > --- orig/lisp/ChangeLog
> > +++ mod/lisp/ChangeLog
> >
> >
> > I'm not sure if the actual encoding has any impact -- ja_JP, ja_JP.utf8,
> > and ja_JP.eucjp all exhibit the same problem.
>
> ko_KR, zh_CN, and zh_TW exhibit the same problem. On CJK
> locales, this bug causes gawk scripts unusable.
>
> Downgrading gawk to version 1:3.1.3-3 prevents the problem.
>
> Could anyone fix this bug?

One possible workaround is use GAWK_NO_DFA=1

 % echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=ja_JP.eucJP GAWK_NO_DFA=1 gawk '/[Cc]hangeLog/ { print }'
 --- orig/lisp/ChangeLog
 +++ mod/lisp/ChangeLog

I may find the reason of this bug. This is because pattern string has been
changed, but begin,end remain to point the same address so that
mblen_buf and inputwcs won't be updated.
For example, this patch will fix the problem, but it may slow down,
so I think better fixes should be made.

--- dfa.c~ 2004-07-26 23:11:41.000000000 +0900
+++ dfa.c 2004-10-12 01:05:14.000000000 +0900
@@ -2872,13 +2872,14 @@
     {
       int remain_bytes, i;
       buf_begin -= buf_offset;
+#if 0
       if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
  buf_offset = (unsigned char const *)begin - buf_begin;
  buf_begin = begin;
  buf_end = end;
  goto go_fast;
       }
-
+#endif
       buf_offset = 0;
       buf_begin = begin;
       buf_end = end;

Regards,
Fumitoshi UKAI <email address hidden> / <email address hidden>
Hewlett-Packard Laboratories Japan http://ecardfile.com/id/ukai

Revision history for this message
In , Fumitoshi UKAI (ukai) wrote :

severity 266519 grave
retitle 266519 gawk: Odd regexp matching problem if locale's mb_cur_max > 1
tags 266519 + patch
thanks

Not only on CJK, but also on all locales that is mb_cur_max > 1.
This means all UTF-8 locales, such as en_US.UTF-8, exhibit the same problem.
So I think this bug should be considered as release critical.

This patch solves this problem.
(Explanation:
 begin-end points input string and this portion checks if the
 input string is the same as previous one and skips updating
 mbs related buffers. However, gawk uses a buffer for each input lines,
 so begin-end points the same address but its contents may differ
 from previous ones.)

--- dfa.c~ 2004-07-26 23:11:41.000000000 +0900
+++ dfa.c 2004-10-12 01:05:14.000000000 +0900
@@ -2872,13 +2872,14 @@
     {
       int remain_bytes, i;
       buf_begin -= buf_offset;
+#if 0
       if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
  buf_offset = (unsigned char const *)begin - buf_begin;
  buf_begin = begin;
  buf_end = end;
  goto go_fast;
       }
-
+#endif
       buf_offset = 0;
       buf_begin = begin;
       buf_end = end;

Regards,
Fumitoshi UKAI

Revision history for this message
Debian Bug Importer (debzilla) wrote :

Automatically imported from Debian bug report #266519 http://bugs.debian.org/266519

Revision history for this message
Debian Bug Importer (debzilla) wrote :

Message-Id: <20040818055735.77AC7431@mctpc71>
Date: Wed, 18 Aug 2004 14:57:35 +0900
From: Miles Bader <email address hidden>
To: Debian Bug Tracking System <email address hidden>
Subject: gawk: Odd regexp matching problem if LANG=ja_JP

Package: gawk
Version: 1:3.1.4-1
Severity: normal

Executing the following line in a shell:

   echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=ja_JP gawk '/[Cc]hangeLog/ { print }'

yields not the expected two lines of output, but instead only the first one:

   --- orig/lisp/ChangeLog

If the LANG-setting portion is changed to use C, then it works as
expected (others such as "de" seem to work too):

   echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=C gawk '/[Cc]hangeLog/ { print }'

yields:

   --- orig/lisp/ChangeLog
   +++ mod/lisp/ChangeLog

I'm not sure if the actual encoding has any impact -- ja_JP, ja_JP.utf8,
and ja_JP.eucjp all exhibit the same problem.

Thanks,

-Miles

-- System Information:
Debian Release: 3.1
  APT prefers unstable
  APT policy: (500, 'unstable'), (101, 'experimental')
Architecture: i386 (i686)
Kernel: Linux 2.6.8.1
Locale: LANG=ja_JP.UTF-8, LC_CTYPE=ja_JP.UTF-8

Versions of packages gawk depends on:
ii libc6 2.3.2.ds1-16 GNU C Library: Shared libraries an

-- no debconf information

Revision history for this message
Debian Bug Importer (debzilla) wrote :

Message-Id: <20041011.232915.193711624.tats05%<email address hidden>>
Date: Mon, 11 Oct 2004 23:29:15 +0900 (JST)
From: Tatsuya Kinoshita <email address hidden>
To: <email address hidden>, <email address hidden>
Cc: <email address hidden>, <email address hidden>
Subject: Re: gawk: Odd regexp matching problem if LANG=ja_JP

----Security_Multipart(Mon_Oct_11_23_29_15_2004_186)--
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

On August 18, 2004 at 2:57PM +0900,
miles (at lsi.nec.co.jp) wrote:

> Package: gawk
> Version: 1:3.1.4-1

> Executing the following line in a shell:
>
> echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=ja_JP gawk '/[Cc]hangeLog/ { print }'
>
> yields not the expected two lines of output, but instead only the first one:
>
> --- orig/lisp/ChangeLog
>
>
> If the LANG-setting portion is changed to use C, then it works as
> expected (others such as "de" seem to work too):
>
> echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=C gawk '/[Cc]hangeLog/ { print }'
>
> yields:
>
> --- orig/lisp/ChangeLog
> +++ mod/lisp/ChangeLog
>
>
> I'm not sure if the actual encoding has any impact -- ja_JP, ja_JP.utf8,
> and ja_JP.eucjp all exhibit the same problem.

ko_KR, zh_CN, and zh_TW exhibit the same problem. On CJK
locales, this bug causes gawk scripts unusable.

Downgrading gawk to version 1:3.1.3-3 prevents the problem.

Could anyone fix this bug?

Thanks,
--
Tatsuya Kinoshita

----Security_Multipart(Mon_Oct_11_23_29_15_2004_186)--
Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQBBapi+gV4LPvpMUpgRAoGVAJ92rG0y8+0H5GzQOnKVYa9cHV+yPgCguchQ
xEDvdADGk+eu6BVk3dqMf5s=
=iLC+
-----END PGP SIGNATURE-----

----Security_Multipart(Mon_Oct_11_23_29_15_2004_186)----

Revision history for this message
Debian Bug Importer (debzilla) wrote :

Message-ID: <email address hidden>
Date: Tue, 12 Oct 2004 01:16:54 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Cc: <email address hidden>, Tatsuya Kinoshita <email address hidden>,
 <email address hidden>, <email address hidden>
Subject: Re: gawk: Odd regexp matching problem if LANG=ja_JP

At Mon, 11 Oct 2004 23:29:15 +0900 (JST),
Tatsuya Kinoshita wrote:

> > Package: gawk
> > Version: 1:3.1.4-1
>
> > Executing the following line in a shell:
> >
> > echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=ja_JP gawk '/[Cc]hangeLog/ { print }'
> >
> > yields not the expected two lines of output, but instead only the first one:
> >
> > --- orig/lisp/ChangeLog
> >
> >
> > If the LANG-setting portion is changed to use C, then it works as
> > expected (others such as "de" seem to work too):
> >
> > echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=C gawk '/[Cc]hangeLog/ { print }'
> >
> > yields:
> >
> > --- orig/lisp/ChangeLog
> > +++ mod/lisp/ChangeLog
> >
> >
> > I'm not sure if the actual encoding has any impact -- ja_JP, ja_JP.utf8,
> > and ja_JP.eucjp all exhibit the same problem.
>
> ko_KR, zh_CN, and zh_TW exhibit the same problem. On CJK
> locales, this bug causes gawk scripts unusable.
>
> Downgrading gawk to version 1:3.1.3-3 prevents the problem.
>
> Could anyone fix this bug?

One possible workaround is use GAWK_NO_DFA=1

 % echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=ja_JP.eucJP GAWK_NO_DFA=1 gawk '/[Cc]hangeLog/ { print }'
 --- orig/lisp/ChangeLog
 +++ mod/lisp/ChangeLog

I may find the reason of this bug. This is because pattern string has been
changed, but begin,end remain to point the same address so that
mblen_buf and inputwcs won't be updated.
For example, this patch will fix the problem, but it may slow down,
so I think better fixes should be made.

--- dfa.c~ 2004-07-26 23:11:41.000000000 +0900
+++ dfa.c 2004-10-12 01:05:14.000000000 +0900
@@ -2872,13 +2872,14 @@
     {
       int remain_bytes, i;
       buf_begin -= buf_offset;
+#if 0
       if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
  buf_offset = (unsigned char const *)begin - buf_begin;
  buf_begin = begin;
  buf_end = end;
  goto go_fast;
       }
-
+#endif
       buf_offset = 0;
       buf_begin = begin;
       buf_end = end;

Regards,
Fumitoshi UKAI <email address hidden> / <email address hidden>
Hewlett-Packard Laboratories Japan http://ecardfile.com/id/ukai

Revision history for this message
Debian Bug Importer (debzilla) wrote :

Message-ID: <email address hidden>
Date: Wed, 13 Oct 2004 00:40:39 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Subject: Re: gawk: Odd regexp matching problem if LANG=ja_JP

severity 266519 grave
retitle 266519 gawk: Odd regexp matching problem if locale's mb_cur_max > 1
tags 266519 + patch
thanks

Not only on CJK, but also on all locales that is mb_cur_max > 1.
This means all UTF-8 locales, such as en_US.UTF-8, exhibit the same problem.
So I think this bug should be considered as release critical.

This patch solves this problem.
(Explanation:
 begin-end points input string and this portion checks if the
 input string is the same as previous one and skips updating
 mbs related buffers. However, gawk uses a buffer for each input lines,
 so begin-end points the same address but its contents may differ
 from previous ones.)

--- dfa.c~ 2004-07-26 23:11:41.000000000 +0900
+++ dfa.c 2004-10-12 01:05:14.000000000 +0900
@@ -2872,13 +2872,14 @@
     {
       int remain_bytes, i;
       buf_begin -= buf_offset;
+#if 0
       if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
  buf_offset = (unsigned char const *)begin - buf_begin;
  buf_begin = begin;
  buf_end = end;
  goto go_fast;
       }
-
+#endif
       buf_offset = 0;
       buf_begin = begin;
       buf_end = end;

Regards,
Fumitoshi UKAI

Revision history for this message
Martin Pitt (pitti) wrote :

(In reply to comment #2)
> Downgrading gawk to version 1:3.1.3-3 prevents the problem.

This is exactly the version that Warty ships. I also checked it:

$ echo -e '--- orig/lisp/ChangeLog\n+++ mod/lisp/ChangeLog' | LANG=ja_JP gawk
'/[Cc]hangeLog/ { print }'
--- orig/lisp/ChangeLog
+++ mod/lisp/ChangeLog

Closing as NOTWARTY. The Debian version already has a patch and certainly will
be fixed soon, too.

Revision history for this message
In , Fumitoshi UKAI (ukai) wrote : Fixed in NMU of gawk 1:3.1.4-1.1

tag 266519 + fixed
tag 276201 + fixed

quit

This message was generated automatically in response to a
non-maintainer upload. The .changes file follows.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Tue, 19 Oct 2004 01:16:27 +0900
Source: gawk
Binary: gawk
Architecture: source i386
Version: 1:3.1.4-1.1
Distribution: unstable
Urgency: low
Maintainer: James Troup <email address hidden>
Changed-By: Fumitoshi UKAI <email address hidden>
Description:
 gawk - GNU awk, a pattern scanning and processing language
Closes: 266519 276201
Changes:
 gawk (1:3.1.4-1.1) unstable; urgency=low
 .
   * NMU to fix RC bugs
   * 10_dfa.c-no-go_fast.dpatch: new patch by Fumitoshi UKAI
      to fix odd regexp matching in multibyte locales (UTF-8, CJK, ..)
      closes: Bug#266519
   * 11_dfa.c-ignorecase.dpatch: new patch by Fumitoshi UKAI
      to fix CASEIGNORE match on [:upper:] and [:lower:] in
      multibyte locales (UTF-8, CJK, ...)
      closes: Bug#276201
Files:
 47cdd14a4532a07d540cb6be156f0e22 557 interpreters optional gawk_3.1.4-1.1.dsc
 0e16583a1390c72b8ba73929466ce6df 9225 interpreters optional gawk_3.1.4-1.1.diff.gz
 a1a43961a3154a311aded33168c6cb1a 983300 interpreters optional gawk_3.1.4-1.1_i386.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFBc+029D5yZjzIjAkRAvxHAKC05uoZgw8msEe73szYw9FU12nxrgCgkWCe
B8rEeS5lv/Mw5rIPLqXfPWo=
=urId
-----END PGP SIGNATURE-----

Revision history for this message
Debian Bug Importer (debzilla) wrote :

Message-Id: <email address hidden>
Date: Mon, 18 Oct 2004 12:47:03 -0400
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Cc: Fumitoshi UKAI <email address hidden>, James Troup <email address hidden>
Subject: Fixed in NMU of gawk 1:3.1.4-1.1

tag 266519 + fixed
tag 276201 + fixed

quit

This message was generated automatically in response to a
non-maintainer upload. The .changes file follows.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Tue, 19 Oct 2004 01:16:27 +0900
Source: gawk
Binary: gawk
Architecture: source i386
Version: 1:3.1.4-1.1
Distribution: unstable
Urgency: low
Maintainer: James Troup <email address hidden>
Changed-By: Fumitoshi UKAI <email address hidden>
Description:
 gawk - GNU awk, a pattern scanning and processing language
Closes: 266519 276201
Changes:
 gawk (1:3.1.4-1.1) unstable; urgency=low
 .
   * NMU to fix RC bugs
   * 10_dfa.c-no-go_fast.dpatch: new patch by Fumitoshi UKAI
      to fix odd regexp matching in multibyte locales (UTF-8, CJK, ..)
      closes: Bug#266519
   * 11_dfa.c-ignorecase.dpatch: new patch by Fumitoshi UKAI
      to fix CASEIGNORE match on [:upper:] and [:lower:] in
      multibyte locales (UTF-8, CJK, ...)
      closes: Bug#276201
Files:
 47cdd14a4532a07d540cb6be156f0e22 557 interpreters optional gawk_3.1.4-1.1.dsc
 0e16583a1390c72b8ba73929466ce6df 9225 interpreters optional gawk_3.1.4-1.1.diff.gz
 a1a43961a3154a311aded33168c6cb1a 983300 interpreters optional gawk_3.1.4-1.1_i386.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFBc+029D5yZjzIjAkRAvxHAKC05uoZgw8msEe73szYw9FU12nxrgCgkWCe
B8rEeS5lv/Mw5rIPLqXfPWo=
=urId
-----END PGP SIGNATURE-----

Revision history for this message
In , Fumitoshi UKAI (ukai) wrote : rc bug for sarge

# grep
tags 249245 - fixed
tags 249245 + sarge
tags 274352 - fixed
tags 274352 + sarge
tags 276202 - fixed
tags 276202 + sarge
tags 276209 - fixed
tags 276209 + sarge
# gawk
tags 266519 - fixed
tags 266519 + sarge
tags 276201 - fixed
tags 276201 + sarge
tags 276206 - fixed
tags 276206 + sarge
tags 277122 - fixed
tags 277122 + sarge
tags 264829 - fixed
tags 264829 + sarge
tags 266043 - fixed
tags 266043 + sarge
tags 271231 - fixed
tags 271231 + sarge

Revision history for this message
In , Oded Shimon (ods15) wrote : Patch: Odd regexp matching problem if locale's mb_cur_max > 1

Package: gawk
Version: 1:3.1.4-1
Followup-For: Bug #266519

I have a patch for this bug which does not involve removing go_fast, but
it does involve adding a check loop. I believe this is still faster than
the previous patch, and it was the best I could do with my programming
knowledge.
I checked, this patch actually compiles and fixes the bug. :)

Regards,
- ods15

diff -u dfa.c ~/sources/debian/gawk-3.1.4/

--- dfa.c 2004-10-29 11:58:47.000000000 +0200
+++ /home/ods15/sources/debian/gawk-3.1.4/dfa.c 2004-10-29 12:00:15.000000000 +0200
@@ -2895,6 +2895,10 @@
   register unsigned char eol = eolbyte; /* Likewise for eolbyte. */
   static int sbit[NOTCHAR]; /* Table for anding with d->success. */
   static int sbit_init;
+ static unsigned char * sameas; /* a simple check that the content
+ between begin and end are indeed
+ what they used to be */
+ static int sizesameas;

   if (! sbit_init)
     {
@@ -2918,14 +2922,31 @@
   if (MB_CUR_MAX > 1)
     {
       int remain_bytes, i;
+
+ if (!sameas) {
+ MALLOC(sameas, unsigned char, end - begin + 2);
+ memset(sameas, 0, sizeof(unsigned char) * (end - begin + 1));
+ sizesameas = end - begin + 1;
+ }
+
       buf_begin -= buf_offset;
       if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
+ int yesgood = sizesameas == end - begin + 1;
+ for (i = 0; i < sizesameas && yesgood; i++) {
+ if (sameas[i] != begin[i]) yesgood = 0;
+ }
+ if (yesgood) {
  buf_offset = (unsigned char const *)begin - buf_begin;
  buf_begin = begin;
  buf_end = end;
  goto go_fast;
+ }
       }

+ REALLOC(sameas, unsigned char, end - begin + 2);
+ for (i = 0; i < end - begin + 1; i++) sameas[i] = begin[i];
+ sizesameas = end - begin + 1;
+
       buf_offset = 0;
       buf_begin = begin;
       buf_end = end;

-- System Information:
Debian Release: 3.1
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: i386 (i686)
Kernel: Linux 2.6.6
Locale: LANG=C, LC_CTYPE=C

Versions of packages gawk depends on:
ii libc6 2.3.2.ds1-18 GNU C Library: Shared libraries an

-- no debconf information

Revision history for this message
Debian Bug Importer (debzilla) wrote :

Message-ID: <email address hidden>
Date: Thu, 28 Oct 2004 12:04:42 +0900
From: Fumitoshi UKAI <email address hidden>
To: <email address hidden>
Subject: rc bug for sarge

# grep
tags 249245 - fixed
tags 249245 + sarge
tags 274352 - fixed
tags 274352 + sarge
tags 276202 - fixed
tags 276202 + sarge
tags 276209 - fixed
tags 276209 + sarge
# gawk
tags 266519 - fixed
tags 266519 + sarge
tags 276201 - fixed
tags 276201 + sarge
tags 276206 - fixed
tags 276206 + sarge
tags 277122 - fixed
tags 277122 + sarge
tags 264829 - fixed
tags 264829 + sarge
tags 266043 - fixed
tags 266043 + sarge
tags 271231 - fixed
tags 271231 + sarge

Revision history for this message
Debian Bug Importer (debzilla) wrote :

Message-Id: <E1CNTmv-0006Gi-F6@linux15>
Date: Fri, 29 Oct 2004 12:15:17 +0200
From: Oded Shimon <email address hidden>
To: Debian Bug Tracking System <email address hidden>
Subject: Patch: Odd regexp matching problem if locale's mb_cur_max > 1

Package: gawk
Version: 1:3.1.4-1
Followup-For: Bug #266519

I have a patch for this bug which does not involve removing go_fast, but
it does involve adding a check loop. I believe this is still faster than
the previous patch, and it was the best I could do with my programming
knowledge.
I checked, this patch actually compiles and fixes the bug. :)

Regards,
- ods15

diff -u dfa.c ~/sources/debian/gawk-3.1.4/

--- dfa.c 2004-10-29 11:58:47.000000000 +0200
+++ /home/ods15/sources/debian/gawk-3.1.4/dfa.c 2004-10-29 12:00:15.000000000 +0200
@@ -2895,6 +2895,10 @@
   register unsigned char eol = eolbyte; /* Likewise for eolbyte. */
   static int sbit[NOTCHAR]; /* Table for anding with d->success. */
   static int sbit_init;
+ static unsigned char * sameas; /* a simple check that the content
+ between begin and end are indeed
+ what they used to be */
+ static int sizesameas;

   if (! sbit_init)
     {
@@ -2918,14 +2922,31 @@
   if (MB_CUR_MAX > 1)
     {
       int remain_bytes, i;
+
+ if (!sameas) {
+ MALLOC(sameas, unsigned char, end - begin + 2);
+ memset(sameas, 0, sizeof(unsigned char) * (end - begin + 1));
+ sizesameas = end - begin + 1;
+ }
+
       buf_begin -= buf_offset;
       if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
+ int yesgood = sizesameas == end - begin + 1;
+ for (i = 0; i < sizesameas && yesgood; i++) {
+ if (sameas[i] != begin[i]) yesgood = 0;
+ }
+ if (yesgood) {
  buf_offset = (unsigned char const *)begin - buf_begin;
  buf_begin = begin;
  buf_end = end;
  goto go_fast;
+ }
       }

+ REALLOC(sameas, unsigned char, end - begin + 2);
+ for (i = 0; i < end - begin + 1; i++) sameas[i] = begin[i];
+ sizesameas = end - begin + 1;
+
       buf_offset = 0;
       buf_begin = begin;
       buf_end = end;

-- System Information:
Debian Release: 3.1
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: i386 (i686)
Kernel: Linux 2.6.6
Locale: LANG=C, LC_CTYPE=C

Versions of packages gawk depends on:
ii libc6 2.3.2.ds1-18 GNU C Library: Shared libraries an

-- no debconf information

Revision history for this message
In , Tatsuya Kinoshita (tats) wrote : Re: Bug#266519: Patch: Odd regexp matching problem if locale's mb_cur_max > 1

Hi, Fumitoshi,

Thanks for the NMU.

BTW, how about the following patch?

On October 29, 2004 at 12:15PM +0200,
ods15 (at ods15.dyndns.org) wrote:

> Package: gawk
> Version: 1:3.1.4-1
> Followup-For: Bug #266519

> I have a patch for this bug which does not involve removing go_fast, but
> it does involve adding a check loop. I believe this is still faster than
> the previous patch, and it was the best I could do with my programming
> knowledge.
> I checked, this patch actually compiles and fixes the bug. :)
>
> Regards,
> - ods15
>
>
> diff -u dfa.c ~/sources/debian/gawk-3.1.4/
>
> --- dfa.c 2004-10-29 11:58:47.000000000 +0200
> +++ /home/ods15/sources/debian/gawk-3.1.4/dfa.c 2004-10-29 12:00:15.000000000 +0200
> @@ -2895,6 +2895,10 @@
> register unsigned char eol = eolbyte; /* Likewise for eolbyte. */
> static int sbit[NOTCHAR]; /* Table for anding with d->success. */
> static int sbit_init;
> + static unsigned char * sameas; /* a simple check that the content
> + between begin and end are indeed
> + what they used to be */
> + static int sizesameas;
>
> if (! sbit_init)
> {
> @@ -2918,14 +2922,31 @@
> if (MB_CUR_MAX > 1)
> {
> int remain_bytes, i;
> +
> + if (!sameas) {
> + MALLOC(sameas, unsigned char, end - begin + 2);
> + memset(sameas, 0, sizeof(unsigned char) * (end - begin + 1));
> + sizesameas = end - begin + 1;
> + }
> +
> buf_begin -= buf_offset;
> if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
> + int yesgood = sizesameas == end - begin + 1;
> + for (i = 0; i < sizesameas && yesgood; i++) {
> + if (sameas[i] != begin[i]) yesgood = 0;
> + }
> + if (yesgood) {
> buf_offset = (unsigned char const *)begin - buf_begin;
> buf_begin = begin;
> buf_end = end;
> goto go_fast;
> + }
> }
>
> + REALLOC(sameas, unsigned char, end - begin + 2);
> + for (i = 0; i < end - begin + 1; i++) sameas[i] = begin[i];
> + sizesameas = end - begin + 1;
> +
> buf_offset = 0;
> buf_begin = begin;
> buf_end = end;

--
Tatsuya Kinoshita

Revision history for this message
Debian Bug Importer (debzilla) wrote :

Message-Id: <20041103.204455.10121738.tats05%<email address hidden>>
Date: Wed, 03 Nov 2004 20:44:55 +0900 (JST)
From: Tatsuya Kinoshita <email address hidden>
To: Fumitoshi UKAI <email address hidden>
Cc: Oded Shimon <email address hidden>, <email address hidden>
Subject: Re: Bug#266519: Patch: Odd regexp matching problem if locale's
 mb_cur_max > 1

----Security_Multipart(Wed_Nov__3_20_44_56_2004_842)--
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Hi, Fumitoshi,

Thanks for the NMU.

BTW, how about the following patch?

On October 29, 2004 at 12:15PM +0200,
ods15 (at ods15.dyndns.org) wrote:

> Package: gawk
> Version: 1:3.1.4-1
> Followup-For: Bug #266519

> I have a patch for this bug which does not involve removing go_fast, but
> it does involve adding a check loop. I believe this is still faster than
> the previous patch, and it was the best I could do with my programming
> knowledge.
> I checked, this patch actually compiles and fixes the bug. :)
>
> Regards,
> - ods15
>
>
> diff -u dfa.c ~/sources/debian/gawk-3.1.4/
>
> --- dfa.c 2004-10-29 11:58:47.000000000 +0200
> +++ /home/ods15/sources/debian/gawk-3.1.4/dfa.c 2004-10-29 12:00:15.000000000 +0200
> @@ -2895,6 +2895,10 @@
> register unsigned char eol = eolbyte; /* Likewise for eolbyte. */
> static int sbit[NOTCHAR]; /* Table for anding with d->success. */
> static int sbit_init;
> + static unsigned char * sameas; /* a simple check that the content
> + between begin and end are indeed
> + what they used to be */
> + static int sizesameas;
>
> if (! sbit_init)
> {
> @@ -2918,14 +2922,31 @@
> if (MB_CUR_MAX > 1)
> {
> int remain_bytes, i;
> +
> + if (!sameas) {
> + MALLOC(sameas, unsigned char, end - begin + 2);
> + memset(sameas, 0, sizeof(unsigned char) * (end - begin + 1));
> + sizesameas = end - begin + 1;
> + }
> +
> buf_begin -= buf_offset;
> if (buf_begin <= (unsigned char const *)begin && (unsigned char const *) end <= buf_end) {
> + int yesgood = sizesameas == end - begin + 1;
> + for (i = 0; i < sizesameas && yesgood; i++) {
> + if (sameas[i] != begin[i]) yesgood = 0;
> + }
> + if (yesgood) {
> buf_offset = (unsigned char const *)begin - buf_begin;
> buf_begin = begin;
> buf_end = end;
> goto go_fast;
> + }
> }
>
> + REALLOC(sameas, unsigned char, end - begin + 2);
> + for (i = 0; i < end - begin + 1; i++) sameas[i] = begin[i];
> + sizesameas = end - begin + 1;
> +
> buf_offset = 0;
> buf_begin = begin;
> buf_end = end;

--
Tatsuya Kinoshita

----Security_Multipart(Wed_Nov__3_20_44_56_2004_842)--
Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQBBiMS4gV4LPvpMUpgRAgl+AKCI1v2smUxr/Kpq/DHv7KliPd9WWQCg0dc8
F7Tzm2fDZhBKeOcNQhMtufI=
=1y5c
-----END PGP SIGNATURE-----

----Security_Multipart(Wed_Nov__3_20_44_56_2004_842)----

Revision history for this message
In , James Troup (james-nocrew) wrote : Bug#266519: fixed in gawk 1:3.1.4-2
Download full text (3.8 KiB)

Source: gawk
Source-Version: 1:3.1.4-2

We believe that the bug you reported is fixed in the latest version of
gawk, which is due to be installed in the Debian FTP archive:

gawk_3.1.4-2.diff.gz
  to pool/main/g/gawk/gawk_3.1.4-2.diff.gz
gawk_3.1.4-2.dsc
  to pool/main/g/gawk/gawk_3.1.4-2.dsc
gawk_3.1.4-2_i386.deb
  to pool/main/g/gawk/gawk_3.1.4-2_i386.deb

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
James Troup <email address hidden> (supplier of updated gawk package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Fri, 26 Nov 2004 18:30:42 +0000
Source: gawk
Binary: gawk
Architecture: source i386
Version: 1:3.1.4-2
Distribution: unstable
Urgency: low
Maintainer: James Troup <email address hidden>
Changed-By: James Troup <email address hidden>
Description:
 gawk - GNU awk, a pattern scanning and processing language
Closes: 263964 266519 276201 276206 277122 278135
Changes:
 gawk (1:3.1.4-2) unstable; urgency=low
 .
   * 14_io.c-fix-redirect-hang.dpatch: new patch which reverts io.c changes
     that wait() when a redirect hits EOF without checking whether or not
     this is the kind of redirect which would have an orphan to wait() on.
     Closes: #263964
 .
   * debian/control (Build-Depends): Add a versioned build-depends on a
     fixed binutils for m68k. Closes: #278135
 .
   * Merge in NMU changes. Many thanks to Fumitoshi UKAI. Closes:
     #276206, #277122, #266519, #276201
 .
   * 11_dfa.c-ignorecase.dpatch, 12_dfa.c-ignorecase-range.dpath,
     13_dfa.c-charclass-bracket.dpatch: revert to old-style dpatch patch so
     that it works for me.
 .
   * 10_dfa.c-no-go_fast.dpatch: replaced...
   * 10_dfa.c-disable-cache.dpatch: ... with this. Which is upstream's fix
     for the same problem.
 .
   * 15_builtin.c-fix-wide-char.dpatch: new patch by Stephen Kasal to fix
     wide-char to{lower,upper}() handling.
 .
   * 16_awkgram.y-stop-at-eof.dpatch: new patch by Andreas Schwab to stop
     gawk reading past the end of the file for an awk script that is big
     enough to fill more than a buffer's worth and does not end with a
     newline.
 .
   * 17_fix-non-numeric-constants.dpatch: new patch by Aharon Robbins to
     improve handling of non-numeric constants so that numbers like 00.34
     don't get confused as being octal.
Files:
 492e13079781d176c5b589d64bcaaedb 1221 interpreters optional gawk_3.1.4-2.dsc
 a175a8e9572d74150d3ff6072b4f64df 14896 interpreters optional gawk_3.1.4-2.diff.gz
 262ea208b69d0fb65d71b5cbb1708881 995324 interpreters optional gawk_3.1.4-2_i386.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iQIVAwUBQad8NNfD8TGrKpH1AQIN2g/+PvhVX2LyNwKzjZK6q5gW2dZqyj+sgkHS
6YsNJPlGlnroFGnRi/mQwKPv0B2orTjRCbYrE4ROuuiEY8zl05S9jKGP...

Read more...

Revision history for this message
Debian Bug Importer (debzilla) wrote :
Download full text (4.0 KiB)

Message-Id: <email address hidden>
Date: Fri, 26 Nov 2004 14:02:14 -0500
From: James Troup <email address hidden>
To: <email address hidden>
Subject: Bug#266519: fixed in gawk 1:3.1.4-2

Source: gawk
Source-Version: 1:3.1.4-2

We believe that the bug you reported is fixed in the latest version of
gawk, which is due to be installed in the Debian FTP archive:

gawk_3.1.4-2.diff.gz
  to pool/main/g/gawk/gawk_3.1.4-2.diff.gz
gawk_3.1.4-2.dsc
  to pool/main/g/gawk/gawk_3.1.4-2.dsc
gawk_3.1.4-2_i386.deb
  to pool/main/g/gawk/gawk_3.1.4-2_i386.deb

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed. If you
have further comments please address them to <email address hidden>,
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
James Troup <email address hidden> (supplier of updated gawk package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing <email address hidden>)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Format: 1.7
Date: Fri, 26 Nov 2004 18:30:42 +0000
Source: gawk
Binary: gawk
Architecture: source i386
Version: 1:3.1.4-2
Distribution: unstable
Urgency: low
Maintainer: James Troup <email address hidden>
Changed-By: James Troup <email address hidden>
Description:
 gawk - GNU awk, a pattern scanning and processing language
Closes: 263964 266519 276201 276206 277122 278135
Changes:
 gawk (1:3.1.4-2) unstable; urgency=low
 .
   * 14_io.c-fix-redirect-hang.dpatch: new patch which reverts io.c changes
     that wait() when a redirect hits EOF without checking whether or not
     this is the kind of redirect which would have an orphan to wait() on.
     Closes: #263964
 .
   * debian/control (Build-Depends): Add a versioned build-depends on a
     fixed binutils for m68k. Closes: #278135
 .
   * Merge in NMU changes. Many thanks to Fumitoshi UKAI. Closes:
     #276206, #277122, #266519, #276201
 .
   * 11_dfa.c-ignorecase.dpatch, 12_dfa.c-ignorecase-range.dpath,
     13_dfa.c-charclass-bracket.dpatch: revert to old-style dpatch patch so
     that it works for me.
 .
   * 10_dfa.c-no-go_fast.dpatch: replaced...
   * 10_dfa.c-disable-cache.dpatch: ... with this. Which is upstream's fix
     for the same problem.
 .
   * 15_builtin.c-fix-wide-char.dpatch: new patch by Stephen Kasal to fix
     wide-char to{lower,upper}() handling.
 .
   * 16_awkgram.y-stop-at-eof.dpatch: new patch by Andreas Schwab to stop
     gawk reading past the end of the file for an awk script that is big
     enough to fill more than a buffer's worth and does not end with a
     newline.
 .
   * 17_fix-non-numeric-constants.dpatch: new patch by Aharon Robbins to
     improve handling of non-numeric constants so that numbers like 00.34
     don't get confused as being octal.
Files:
 492e13079781d176c5b589d64bcaaedb 1221 interpreters optional gawk_3.1.4-2.dsc
 a175a8e9572d74150d3ff6072b4f64df 14896 interpreters optional gawk_3.1.4-2.diff.gz
 262ea208b69d0fb65d71b5cbb1708881 995324 interpreters optional gawk...

Read more...

Changed in gawk:
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.