Fog Creek Software
Discussion Board




extensive memory allocation causing run-time delay

I have a whole bunch of text, (20,000 bytes), and for each unique word in the text, I am putting that word and the count of it's occurances in a HashMap.
to put it in a hash map of key-value pairs with the count as value,
I need put the primitive data type count,which is an int into an
Integer object which is created, and then I can put it in the map.
SO I do Integer I =new Integer(count);
each time.The profiler says I am using a whole amount of time there, I guess due to memory allocation of so many Integer objects for the unique words.
is there any way to cast an int to an Object?
anyway I could avoid using the heap? and optimize?
I am almost using 50% of run-time in this

anon
Thursday, July 03, 2003

Unfortunately, I don't think there is any way to get around creating all those Integer objects if you are using built in Java collections.  Creating objects in Java should be very fast though.  My guess is that the speed hits you're seeing are a result of triggering the garbage collector.  Try null-ing out any unnecessary references before you do this, and then explicitly calling the collector.  This may allow you to get through the insertions with fewer collections.

As a last resort you could always write your own hash table class that handles ints as the value type.

Mike McNertney
Thursday, July 03, 2003

Java?


Thursday, July 03, 2003

try TObjectIntHashMap from http://trove4j.sourceforge.net/. It will let you use a primitive.

jq
Thursday, July 03, 2003

I'm mostly a C/C++ type of programmer, but maybe this makes sense for whatever language you're using.

Allocate a large pool of objects, and eat from that until it's empty.  (And then refill the pool if necessary.)

SG
Thursday, July 03, 2003

I don't think it's a useful suggestion to your particular problem, but it might be interesting to someone. As Integers are immutable value objects they can be freely shared rather than creating new instances all the time.

SaranWrap
Thursday, July 03, 2003

since Integers are immutable, you can share them.  You can do something like:
Integer[] cachedInts = new Integer[MAGIC_NUMBER];

Integer getInteger(int i)
{
  if (i < cachedInts)
  {
      if (cachedInts[i] == null)
        cachedInts[i] = new Integer(i);
      return cachedInts[i];
  }
  return new Integer(i);
}

then later:
theHash.put(word, getInteger(count));

Tweak MAGIC_NUMBER to suit your needs.

Brian
Thursday, July 03, 2003

Mike McNertney, where exactly should I call the garbage collector?

anon
Thursday, July 03, 2003

Well if you have some sort of loop that is adding elements to the HashTable, before that loop would be a good spot.

Obviously there are no guarantees that this will help.  It will probably only be significant if you can identify a lot of object references that you can get rid of and NULL them out before calling the GC.  If this doesn't work, there have been a number of other good suggestions in the thread that should help you out.

Mike McNertney
Thursday, July 03, 2003

To be more specific, the TObjectIntHashMap and the idea of caching Integer objects based on the int value both have a lot of merit and either would probably speed your application significantly.  Which you try is up to you.  The Integer cache sounds very promising for your application since I imagine the vast majority of your words have a relatively small count (say less than 10 or 20) and thus you would have a lot of re-use of Integer objects

Mike McNertney
Thursday, July 03, 2003

You never said if your application is too slow. If it's not just build the map, damn the runtime, and move on.

valraven
Thursday, July 03, 2003

The simplest approach is to create a WordCount object with getCount() and IncreaseCount() properties, and an int private member.

When you create a new hash entry, then you create a new WordCount object and put it with the words as an index. 

When you encounter an existing entry, you just retrieve the corresponding WordCount object and call increase.

You may also find that creating your own class of string with a more efficient equals method will speed things up.

This type of thing is so much quicker using Jython.

Ged Byrne
Thursday, July 03, 2003

DO NOT MANUALLY CALL THE GARBAGE COLLECTOR!

Modern GC's are well tuned these days, and manually kicking them off will just waste more time than letting it do its thing automatically.

For more detail, read this:

http://www.neward.net/ted/weblog/index.jsp?date=20030413#1050273570331

Chris Tavares
Thursday, July 03, 2003

would setting object references to null without explicitely calling the garbage collector help?

anon
Thursday, July 03, 2003

Yes it could.  That would mean you'd be able to reclaim more memory the first time the GC is called.

It's far more likely though that some of the other suggestions mentioned will benefit you.

Mike McNertney
Thursday, July 03, 2003

Chris has a good point.  It is probably best not to explicitly call the GC.  As I mentioned before the other ideas expressed here should substantially cut down on the number of allocations required and should probably help your execution speed

Mike McNertney
Thursday, July 03, 2003

Do it Ged's way.  I was too busy answering the question to see the big picture.
From your description, it doesn't sound like calling GC or nulling references would help any.  You need to make less garbage in the first place.

Brian
Thursday, July 03, 2003

Anon here is a quick demo.  It takes a reader so you should have no trouble adapting it to take a file.

It will need a bit of work to filter out any punctuation, but not much.

Notice that you don't even have to put the WordCount objects back in the map.  This is because the map returns a pointer to the existing object.  Both your object reference and the map reference point to the same object.

Hope this helps.

__________________________________________
import java.util.*;
import java.io.*;

class HashCount {
    private StreamTokenizer st;
    private HashMap hm = new HashMap();
   
    static void main(String[] args) throws Exception {
        HashCount hc = new HashCount(new StringReader("All together now\nI am he and we are he and we are all together\nAre we all singing"));
            hc.tally();
    }
   
    public HashCount(Reader in) {
        st = new StreamTokenizer(in);
    }

    public void tally() throws Exception {
            String word;
            while (st.nextToken()!=st.TT_EOF) {
                word = st.sval.toLowerCase();
                if (hm.containsKey(word)) {
                    // Increase Count
                    WordCount wc = (WordCount) hm.get(word);
                    wc.increaseCount();
                } else {
                    // New Counter
                    hm.put(word, new WordCount(1));
                }
            }
            System.out.println(hm);
    }

    private class WordCount {
        private int i;
       
        public WordCount(int startValue) {
            i = startValue;
        }
       
        public int getCount() {
            return i;
        }
       
        public void increaseCount() {
            i++;
        }
       
        public String toString() {
            return Integer.toString(i);
        }
    }
}

Ged Byrne
Thursday, July 03, 2003

Should give the result:

{are=3, singing=1, now=1, we=3, he=2, i=1, together=2, and=2, am=1, all=3}

Ged Byrne
Thursday, July 03, 2003

Type-Specific Collections Library:
http://www.sosnoski.com/opensrc/tclib/


ObjectIntHashMap
http://www.sosnoski.com/opensrc/tclib/docs/com/sosnoski/util/hashmap/ObjectIntHashMap.html

"Hash map using Object values as keys mapped to primitive int values"

Evgeny Goldin
Thursday, July 03, 2003

I'll second the point about don't call the garbage collector.  Additionally, using a pool of Integers is the wrong way to go.  Unless creating an object is extremely heavyweight (IE database connections), you'll want to just create a new object.  Lightweight objects such as Integers are not candidates for pooling.  One possible suggestion is to instantiate the HashMap with predefined size.  The default for java.util.HashMap is 16.

mph
Thursday, July 03, 2003

Integers and Chars are the PERFECT examples of the Flyweight pattern (see GoF).

ice
Friday, July 04, 2003

*  Recent Topics

*  Fog Creek Home