Talk About Network

Google


Register and Login
Nick
Password
Register create new account Sign up is FREE and you can post replies, new topics, bookmark posts and more!
Recover lost password


Mac > Perl on OSX > Re: interaction...
Latest [ Topics | Posts ] Archive Post A New Topic Post a Reply
<< Topic < Post Post 10 of 12 Topic 1019 of 1076
Post > Topic >>

Re: interaction between tr and s (was Re: tr question -- probably wrong list to ask, but ...)

by joel_rees@[EMAIL PROTECTED] (Joel Rees) Dec 3, 2007 at 06:41 PM

For the record --

> Is UTF-8 input coming from the likes of Apache a possible source of  
> failure? Pack may need to allow for endian-ness of a specific machine.

Well, it depends on how one looks at things, perhaps. I think one of  
the probable reasons for the failure in the DWIM machinery was that I  
am insisting on using ****ft-JIS characters in the source file instead  
of utf-8 in strings and comments. But, no, Apache wasn't filtering  
****ft-JIS to utf-8 for me. Byte order also was not the problem.

After several hours of analysis (using more of the stuff that made  
the original posting of the source somewhat opaque), I determined  
that the problem derived from perl sometimes being stricter about  
****ft-JIS than I wanted it to be.

I don't know why the '+' substitute for space would switch to strict  
character interpretation, but it seems to have been doing so.

****ft-JIS is a variable byte width encoding, one or two bytes. Lead  
bytes are inherently not valid as single-byte characters. Trailing  
bytes are sometimes valid as single-byte characters and sometimes  
not. If the regular expression engine is not checking for valid  
bytes, all you have to do is string the decoded bytes together. But  
if it is checking for valid bytes, you have to put the decoded bytes  
into something other than a char. (Blame C for folding the type of a  
byte onto the type of a character.)

But if you are collecting into 16-bit words, you have to actually  
check for the lead bytes yourself. I'm sure someone could put an RE  
together that would do it, but I just decided it was going to be  
simpler to check and build the string by hand.

So, for anybody who's curious, here's what I'm doing for now:

-----------------------------------------
my $qString = $ENV{'QUERY_STRING'};
my @[EMAIL PROTECTED]
 = split( '&', $qString, 10 );
my %queries = ();
foreach my $pair ( @[EMAIL PROTECTED]
 )
{	my ( $key, $value ) = split( '=', $pair, 2 );
	# Really should just give in and use CGI.
	# $key =~ tr/+/ /;	# You don't expect space in identifiers, but, ...
	$key =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;

	# $queries{ $key . '_' } = $value; # dbg
	
	$value =~ tr/+/ /;
	
	my ( $byteAccm, $hexAccm, $conv ) = ( 0, undef, '' );
	while ( $value =~ m/%([\dA-Fa-f][\dA-Fa-f])|(.)/g )
	{	if ( defined ( $1 ) )
		{	my $hexValue = $1;
			my $decValue = hex ( $hexValue );
			if ( ! defined ( $hexAccm ) )
			{	if ( $decValue <= 0x80 || ( $decValue >= 0xa0 && $decValue <  
0xe0 ) || $decValue >= 0xfd )
				{	$conv .= pack( 'C', $decValue );
				}
				else	# Lead byte -- loose checks all around.
				{	$byteAccm = $decValue;
					$hexAccm = $hexValue;
				}
			}
			else
			{	# if ( $decValue >= 0x40 || ( $decValue > 0xa0 && $hexValue <  
0xe0 ) )
				$conv .= pack( 'S', ( $byteAccm << 8 ) + $decValue );
				$byteAccm = 0;
				$hexAccm = undef;
			}
		}
		else
		{	my $cValue = $2;
			my $decValue = ord ( $cValue );
			if ( ! defined ( $hexAccm ) )
			{	$conv .= $cValue;
			}
			else
			{	# if ( $decValue >= 0x40 || ( $decValue > 0xa0 && $hexValue <  
0xe0 ) )
				$conv .= pack( 'S', ( $byteAccm << 8 ) + $decValue );
				$byteAccm = 0;
				$hexAccm = undef;
			}
		}
	}

	$queries{ $key } = $conv;
}
-----------------------------------------

If this were production code, I should check some more gaps in the  
lead byte (and check where the newest JIS adds the extra several  
thousand characters) and uncomment the checks on the trailing bytes  
(and add some trailing byte checks specific to certain lead bytes,  
geagh). But then I have to figure out what to do with bad bytes.


Joel Rees
(waiting for a 3+GHz ARM processor to come out,
to test Steve's willingness to switch again.)
 




 12 Posts in Topic:
tr question (probably wrong list to ask, but ...)
joel_rees@[EMAIL PROTECTE  2007-12-01 09:33:12 
Re: tr question (probably wrong list to ask, but ...)
andy@[EMAIL PROTECTED] (  2007-12-01 01:56:44 
Re: tr question (probably wrong list to ask, but ...)
douglist@[EMAIL PROTECTED  2007-11-30 19:19:07 
Re: tr question (probably wrong list to ask, but ...)
chas.owens@[EMAIL PROTECT  2007-11-30 21:02:41 
Re: tr question (probably wrong list to ask, but ...)
chas.owens@[EMAIL PROTECT  2007-11-30 21:29:29 
Re: tr question (probably wrong list to ask, but ...)
douglist@[EMAIL PROTECTED  2007-11-30 20:18:12 
Re: tr question (probably wrong list to ask, but ...)
joel_rees@[EMAIL PROTECTE  2007-12-01 11:43:58 
interaction between tr and s (was Re: tr question -- probably wr
joel_rees@[EMAIL PROTECTE  2007-12-01 17:03:23 
Re: interaction between tr and s (was Re: tr question -- probabl
douglist@[EMAIL PROTECTED  2007-12-01 11:59:12 
Re: interaction between tr and s (was Re: tr question -- probabl
joel_rees@[EMAIL PROTECTE  2007-12-03 18:41:33 
Re: tr question (probably wrong list to ask, but ...)
chas.owens@[EMAIL PROTECT  2007-11-30 22:04:08 
Okay, it's not tr after all.
joel_rees@[EMAIL PROTECTE  2007-12-01 12:27:39 

Post A Reply:
  Go here to Signup

AddThis Feed Button


About - Advertising - Contact - Frequently Asked Questions - Privacy Policy - Terms of Use - Signup

Contact
tan12V112 Fri Dec 5 0:43:29 CST 2008.