Fog Creek Software
g
Discussion Board




Frustated by (Near) Impossible Algorithm Challenge

Am I the only one who feels like this (I sort of suspect not but I never see it talked about on programming web sites)...

I find most programming routine (which is not say that I don't enjoy it).  But you put together a GUI, query a database, optimize for performance, or hunt bugs ... you may have to think, may have to look stuff up or work round bugs in your environment etc., - so it can be a challenge, but it's a challenge you know you can solve eventually given the time and effort.

But...

Once in (rare) while, I hit a problem of a qualitively different nature.  Figure out an algorithm to solve this question.  And the problem is often too large, too complex or too many variables to work out on pen and paper - nor is it in a book, although sometimes you find similar, problems in books, but it ain't quite the same.

Last year, for example, I had some math-related problems of this nature.  I looked in a lot of programming and maths books/sites, but this didn't really help.  I used probably a ream of paper trying to work answers, no help.  Eventually I got solutions to part from a maths expert guy, brute-forced others, and "good-enough" approximated the rest, to eventually over-come the obstacle - but it took me several months.

The last few days, I have hit a similar kind of problem, although this time it's sort of compression related.  It's not something that I have ever seen discussed anywhere. nor a standard algorithm.  It's wierd enough, that in its fully glory, I don't even know if it would be on-topic on compression newsgroups.  I do have a partial answer, but I'm not sure the problem is solveable in the general case -- a information-theory expert could probably answer this part, atleast to 1st order approximation.  I'll keep plugging away at this, but I'm getting frustrated now :-)


Anyway, I am the only one who experiences the "near impossible algorithm challenge" periodicially?  If am not, how do you overcome it? 

This is my current frustration talking - but I really don't fancy spending several months on my current problem.

S. Tanna
Tuesday, January 27, 2004

All the time. Its usually overcome by something called
"Junk coding"

Karthik 
Tuesday, January 27, 2004

What's the problem, specifically?

bob
Tuesday, January 27, 2004

Damn, I would love to have challenges like that to work on instead of the usual "why doesn't this API for which I don't have the source code work right and what braindead thing must I do to work around it?

Joel Spolsky
Tuesday, January 27, 2004

Sounds fantastic! What's the problem again?

Mr Jack
Tuesday, January 27, 2004

If anyone of you has the time, inclination and the skill to solve complex mathematical probs., here's one I've been banging on for about a year now.

Rectangle PQRS with lengths A & B
PQ || SR = A
QR || PS = B
Area = AB = C
Diagonal = SQ
t = Angle PSQ = 45 deg.
tan(t) = A/B = 1
As t reduces to 0, PS increases & PQ decreases. So the function has limits of 45 deg. -> 0.deg.

Problem: Generate the function. And solve it for all N, where N is a Natural number.

Can it be solved? Or can it be _proven_ unsolvable?

Regards

KayJay

Indian Developer in India
Tuesday, January 27, 2004

Joel: I have those problems too sometimes :-) However the grass with my kind of problem is a lot lesser green than you think.  A lot of satisfaction from development comes from achieving stuff. 


This particular problem:

It's a compression type problem. This is slightly simplied explanation:

1. Given, Input = Is a stream of printable and non-printable ASCII characters (ASCII 0 to ASCII 255), can be any length

2. Create Output, that is all of:

(i) Contains only printable characters (ASCII 32 to 127) - (or rather certain printable characters - see note)
The note - There a few printable characters that are best to exclude from this set  [or you get a whole different set of complications], specifically: " \ / <

(ii) Is itself a program (in source form) to generate

(iii) Is shorter in bytes than the input stream.

You can ignore (iii), within sensible limits, if the input is very short.

S. Tanna
Tuesday, January 27, 2004

Typo:

(ii) Is itself a program (in source form) to generate

Should be:

(ii) Is itself a program (in source form) to generate the exact original input, non-printable characters and all.

S. Tanna
Tuesday, January 27, 2004

Details! Details! What are friends for?

Alex.ro
Tuesday, January 27, 2004

Which function?

A function for the lengths of A and B in terms of the angle t?

Unless I'm missing in something here, it would be:

A = SQ * sine(t)
B = SQ * cosine(t)

Rob VH
Tuesday, January 27, 2004

and you've got a sample of the kind of data stream you see?  Played at RLE etc?

Maybe if it is compressible enough, just base64 (with a substitute for the /) the binary output of gzip?

i like i
Tuesday, January 27, 2004

Sorry. Forgot to state that my reply was for KayJay...

Rob VH
Tuesday, January 27, 2004

Not too particuar of hi-jacking this thread, I'll start a new one.

Rob, take into consideration ".......for all N, where N is a Natural number." An algebraic definition of N is beyond me.

To give you a hint. The area "C" is the primary input.

Indian Developer in India
Tuesday, January 27, 2004

S. Tanna, perhaps the problem is theoretically impossible to solve if the input stream is random?

Christopher Wells
Tuesday, January 27, 2004

Darn forgot a bit

(iii)(a) The output must not only be as much shorter than the input, but as much shorter as possible


Some other bits that are unstated, but should be implied from what I said in the original definitions

(iv) As the decompression source code counts (not just the compressed data) towards the output's total length. There is an obvious trade off here between
- using a more complex algorithm (potentially achieving greater compression)
- versus -  the extra source code required to decompress more complexly compressed data (remember decompression source code counts towards the output's total length)

(v) The input has 8 bits per byte (printable and non-printable characters), the output has approx 6.5 bits per byte (printable characters only). 

If you imagine a niave conversion with no compression (and forget the source code being in the output for a minute) between a 8 bit/byte and 6.5 bit/byte format, the output would be more bytes in the input.   

So, even before you get to consider the source code overhead added to the output, I have to achieve a significant amount of compression, just to keep the output to the same length as the input.

S. Tanna
Tuesday, January 27, 2004

""why doesn't this API for which I don't have the source code work right and what braindead thing must I do to work around it?"

Well, there is an answer for that - Linux.  Not that it won't contain other brain deadness

Mike
Tuesday, January 27, 2004

Christopher Wells,

You are right, if the data was truly random, it would not be solveable

Look at my previous post about 8 bits/byte vs 6.5 bits/byte.  This indicates it is only solveable (even ignoring the output's source code overhead) if a significant amount of compression can be achieved by one method or another.

In short, for some cases, it's not solveable to make the output shorter than the input.

In these cases, we want to make the output, as little longer than the input as possible.

S. Tanna
Tuesday, January 27, 2004

> You are right, if the data was truly random, it would not be solveable

The thing is, then, that to solve it you must know in what sense it's non-random. For example, if some letters appear more frequently than others then consider huffman encoding (like morse code); or if some letter-combinations appear more frequently than others then consider dictionary encoding; or if the bytes change slowly (as in an picture) then encode the run-length or the delta. The key to compression is knowing the characteristics of (what's non-random about) the data.

Christopher Wells
Tuesday, January 27, 2004

If the input stream is already compressed to death, there is no way to squeeze a single extra byte.

So, with the source code overhead, you can't do it.

On the other hand, if it's compressible, and large enough that the few K (?) of decompression source code won't matter, your only worry is how to handle the non-printable characters.

Would the source code be C? Because a lot would depend on the language.

I'm really very curious -- where on earth would this be needed?

Alex.ro
Tuesday, January 27, 2004

Presuming you get the entire input stream before you have to compress it, you can simply LOOK for "all possible" ways to compress it (okay, or just have a table and try each one). Not a very efficient method, granted, but its a start.

Interesting problem S. Tanna. I'll have to think about that a bit. I admit to some curiousity as tyo where this problem came up, if you want to share.

My biggest situation like this was when I was trying to do some manipulation of a data structure to achieve a particular end, and it was taking a really long time for some reason. After a couple hours of effort, I realized I was trying to solve a variant of "greatest common ancestor of two graphs", which is N-P complete. Made me sad. :P

Steve C.
Tuesday, January 27, 2004

While I agree that this is an interesting discussion, I don't think it's an "Algorithm Challenge" as much as it's a challenge of dealing with very narrow technical specifications.

To me, an **algorithm*challenge** typically means that I've encountered a scenario when brute-forcing the solution to a problem would take an impossible amount of processing power. So, I have to find a short-cut method for arriving at the exact same solution, without brute-forcing it.

Benji Smith
Tuesday, January 27, 2004

Language of source is really important. For myself, I'd suggest doing it in a new language I've invented which has syntax:

1) If the source file contains any non-printing character, then the program echoes the source file

2) Under any other circumstances it behaves like C.

I'd happily write you a compiler for you, as long as I can have a callable C compiler.

Also, if this is real app, find out how much compression is REALLY required. Theoretically maximal or just 'as much as we can do under the circumstances'?

David Clayworth
Tuesday, January 27, 2004

S. Tanna,

Here's a python implementation  (I hope the quotes come through correctly.)

I'm able to get about 50% compression on some files, even with the ASCII encoding

#!/usr/bin/python

import base64
import zlib
import sys
import string

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print "usage: %s <input file> <output file>" % sys.argv[0]
        sys.exit(-1)

    rawdata = file(sys.argv[1]).read()
    encdata = base64.encodestring(zlib.compress(rawdata))

    program = """#!/usr/bin/python
import zlib
import base64
   
filename="%s.new"
data=\"\"\"%s\"\"\"

comdata=base64.decodestring(data)
output=zlib.decompress(comdata)
file(filename, 'w').write(output)
""" % (sys.argv[1], encdata)

    file(sys.argv[2], 'w').write(program)

Jay Monkman
Tuesday, January 27, 2004

Look into arithmetic compression - it compresses a stream to the theoretical minimum length suggested by its information content.  The information content is subject to your modeling, or how well you can predict what the next character will be - if you just go by the character distribution (order 0 modeling), it will be ok, if you use the character or characters before the current one to determine the likelihood of a given character being the current one, you can reduce your theoretical information content.

Basically, you can't get better compression than arithmetic compression, although you also can't ignore the modeling.  The downside of arithmetic compression is that it's generally slower than other methods, but you probably don't care for this particular problem.  (And the downside of higher-order modeling is that it takes lots more memory, which you also may not care about.)

I'd imagine it would come down to writing a very small arithmetic decompression engine.

schmoe
Tuesday, January 27, 2004

I was thinking -- it's really expensive to express non-printing characters in source code.

How do you express a stream of bytes 0, 1, 17, 255?

You can go

out(1);
out(1);
out(17);
etc.

or a={0,1,17,255};

either way, that's a lot more than four bytes.

More to the point, say you have a sequence of bytes repeating 5 times.

Any winzip will tell you that's very good compressible material. But representing that sequence of bytes in source code, and then adding "repeat 5 times;" would be very expensive.

So empirically, I don't think your problem can be solved, unless you have something like a million zero bytes.

Alex.ro
Tuesday, January 27, 2004

The question of "which language" is essentially irrelevant.  Tell you what, let's define this batch file as "codegen.bat":

pkzip -add tmp.zip %1
uuencode < tmp.zip > %2

and this one as "runprogram.bat"

uudecode < %1 > tmp.zip
pkzip -extract tmp.zip

Now you have a code generator and a runtime engine, therefore you have a language.  Its an excellent language for this problem space -- its very concise, though not very readable. 

But it is a language -- the code generator takes as its input arbitrary files and produces "programs", the runtime engine "runs" those programs and the semantics of the program is to produce the file described by the program.

That seems to satisfy all your constraints, assuming that pkzip can compress the file by 20%. 

If it can't, well, you are aware you're asking for the impossible, right?  There are no compression algorithms that compress EVERYTHING.

Eric Lippert
Tuesday, January 27, 2004

Eric,

Thanks, but I think you misunderstand the problem (or I didn't explain it right)

From what I understand, is your solution is:
1. program #1 "codegen.bat" - get the input, compress into (and create) a file, let say file "F" (for examle)
2. program #2 "runprogram.bat" - decompress the file, e.g. file F

Unfortunately that is *not* the problem.  The problem is F and program #2 are supposed to be the same thing. i.e. a correct solution would

1. Program #1 *creates* program #2. There is no external file F.


I realize I was vague about the language etc. But it is relevant.

What I should have said also, is program #2 is self-contained.  It can do strings/math type functions in my code.

But I can not go off and rely on say having a copy of  pkzip (or zip library) or uudecode (or library version of it) in program #2 - unless I include the **source** code for that as part of program #2 (which of course increases it's size).


BTW I do have a working solution now, but I'm sure it's suboptimal in terms of compression. 

S. Tanna
Tuesday, January 27, 2004

Compression is something that people study for decades, and tens of thousands of pages have been written on it.  You are not going to get an "optimal" solution in a week, month, or a year.  Get the customer to agree to a certain threshold (X% of the time it must be compressed to Y% of the original size), and stop when you get there.

Any why bother to go to such extremes to reinvent the wheel?  Just use regular zip compression followed by something like a binhex encoding of the result and you're done.

T. Norman
Tuesday, January 27, 2004

*  Recent Topics

*  Fog Creek Home