Sien day, i’m working on an algorithm to compare html template similarity using the HTML DOM tree, its been nearly 80 hours for thinking and brainstorming, and it seems this method is getting dimmer than ever.
   What i’m trying to build is a Search Engine, from a crawler to result page. Kinda stuck somewhere at the data processing. Stripping html, structuring groups of data and identify the key topic of a page is not as easy as it seems. All were PROBLEMS!!!!!
No tags

kowkaybin · September 3, 2006 at 11:46 pm
I’m having ~20k pages of vBulettin board for data samples, now running a word analysis program to process the data, trying to sample the frequency of the words with these data, i hope i can get:
1. Template Words (which shall not be evaluated on query)
2. Grouping of words (synonyms perhaps you can call)
3. *Spelling correction based on some algorithm that i have not came out with