lundi 6 décembre 2010

The 2011 model

While working on the code for the search-engine, I was hitting limits where "internet explorer" would not work at all. Search was done by having the javascript read the HTML, problem is that some browser plugins, notably "skype" are modifying the HTML on the go, and the search-engine needed a very strict HTML representation and could not operate with modified HTML.

So came the idea to recode everything starting from 0. In november I started thinking about using JSON exclusively for generating the HTML. With the new version, the engine read exclusively from the JSON, it is insensitive to changes in HTML done by third party scripts like ads or browser plugins.

I also wanted to make search more powerful. Before if you searched for "bio" you would find "biological", "biochemistry" but not "agrobiology". If the search term had some accents like é or ô, or ï, there was also no matches. But to do that I needed to write a very fast acces index of the words. That took two weeks to figure. One of the first things to understand is that building an index for a mini search engine is very different from the one of giant search-engines like "google". The big players have to do with an enormous amount of losely or not at all clearly typed data (every web pages is different), on the opposite, a small search-engine using JSON as a database rely on very structured data, so accessing data is very different.

Another problem that is not evident is multilingual searches and results highlithing when using regular expressions. Regular expressions, when you can find a good one are a very powerful way of modifying some text/HTML, but you need exact search terms to send to the regular expression and so was the need to also send it accentuated searchterms for results highlithing, this took me a few days to get it perfect.

Now even all versions of "internet explorer" are fast.


Aucun commentaire:

Enregistrer un commentaire