Wednesday, March 29, 2006

Google Calendar on its way

Also, for people who believe that 8hrs sleep is necessary or less sleep leads to short life or problems in your near future (go to hell!!). Others, Read these:
1) SleepLess
2) Is Getting Less Sleep Better Than using Sleeping Pill?

Wednesday, February 08, 2006

Kosmix - search against Vivisimo

and Google as some are saying.
Kosmix, is a world class search engine that lets people search less, and discover more great stuff. There are billions of pages on the web that are useful, but never see the light of day through a standard search engine. It want to help you find those great pages, and make it easy and fun to do in the process.
Kosmix (currently providing search results only for Health, Travel and Politics) is search engine built on the lines of Vivisimo (except that the categories are pre-defined, in which results are clustered to.)
[In the remaining of this article, whenever Kosmix is termed, it means Kosmix health]
The first page of results are quite similar for both of them. While kosmix only provides the results which are hightly relevant to category (with poor categorization results sometimes), Vivisimo clustering algorithm is much more better and dynamic than kosmix, since all the pages are cateogorized to "appropriate categories" and user can drop (neglect ;-)) the irrelevant categories.
The categories of the Kosmix are predefinded for eg. in case of health, the category listing is.
  • All Results
  • Basic Information
    • Causes
    • Symptoms
    • Treatments
    • Definition
  • Fitness
  • Message Boards
  • Alternative Medicine
  • Expert Information
    • Journals
    • Clinical Trials
    • Guidelines
    • Case Studies
  • Babies & Kids
  • Blogs
  • Medical Organizations
  • Diet & Nutrition
  • Women's Health
  • Men's Health
Now, each of the result (if possible) will be categorized into one of the above categories. But the categorization is highly efficient in terms of relevancy of the search term.
This is over all the search results, the relevancy of the search results to the search term is highly efficient. With efficiency + good categorization finding the relevant information becomes quite easy.
Also, 2 categories which I couldn't find in most of the vivisimo results were medicines and the blogs (latter is quite understandable since blogosphere was not so popular and vast at that time).

But, the results of Vivisimo are quite better in the end. I searched for these terms:
  1. Horner Syndrome - Kosmix only provided 391 results with just 3 categories while vivisimo returned 81339 results with categories including congenital, pain , daignosis etc.
  2. Dysentery - Kosmix returned 1823 results with categories being basic info, alternative medicines, message boards, fitness and expert info while vivisimo returned 3,62,400 results categorizing them into different type of dysentry'es, conditions, outbreak, definition etc.
Also, the definitions for the searched terms in Kosmix are always from fixed urls (as it seemed to me), some being:
which are highly technical in medical science terms, while the results of vivisimo even contains the definitions from (if possible, of course), which makes it easy to understand even for lay man.

My Conclusion: Kosmix is just better in the terms of the relevancy of the documents returned and when one wants to study a topic with proper organization as they give results. Otherwise, some specific information can even be looked up in google :-)

Tuesday, January 31, 2006

Categories in Blogger -- labelr

labelr (though beta release) is here ..
Finally you can have categories in blogger also. Amit has developed the labelr application which allows you to integrate categories in your blogger template with nearly no effort. ("nearly" because you got to add some lines somewhere in your blogger template :-) )

Once registered, you can add any number of categories to your blog. You can also categorize your existing posts into the new categories. And woah!! you are done. You can look at the sample cases either here or at Amit's site.

labler is easy to use and even provides categorized rss feeds. Unlike Wordpress, labelr provides the ajax-enabled quick and fast view of posts in a given category making going back quite easy ;-)

Checkout this much awaited necessary feature of blogger. Since, labelr is in its beta release, for using it you gotta contact here.

Monday, January 16, 2006

Network Traffic Analysis

Sometime back I read the book, "Intrusion Signatures and Analysis" (by Mark Cooper, Stephen Northcutt, Matt Fearnow, Karen Frederick). I was very much impressed by the approach authors has devised with the crucial loopholes in typical analysis techniques. They divided the work into following parts:
  1. Probability the source address was spoofed.
  2. Description of attack
  3. Attack Mechanism
  4. Correlations
  5. Evidence of the active Targeting
  6. Severity
  7. Defence Recommendations
Their are enough examples given with most crucial threats analysis. A must read for network security guys.
Sometime then, I was thinking that one can even use IDS engine over these traffic dump to get the alert data and then work over alerts BUT with complete ruleset of snort alert logfile is so huge, that situation is equally combursome as if someone analyzing the traffic dump. [with additional chances of missing new attacks if any since snort is an pattern matching engine]. Some say, why not use correlation engines to decrease the amount of work tobe done on the alert data. But what about the attacks/intrusions which are new (not caught by snort, ofcourse they are very few but new vulnerabilities always keep coming :-().
Ofcourse there are anomaly detection tools like Lancope etc. (even snort has preprocessor plugin namely spade for that) but some issues exists with these tools also.

Recently, I read this article "Structured Traffic Analysis" in (IN)SECURE magazine by Richard Bejtlich (october 2005 issue). The article is simply superb describing 13 step procedure to analyze traffic dump using a lot of "simple" opensource tools including tcpdstat, argus etc. These steps mostly includes generating traffic statistics from various perspectives including traffic protocol distribution, total number of packets, session analysis, IP informations etc. And in the last the snort was used for further analysis of alerts (optional step).
But what was really nice about it was that, it is an approach of "unsupervised anomaly detection techniques" with simple tools in simple steps. May be u can add more tools like tcptrace etc. to get more information but at the abstract level, this was kind of "offline (passive) traffic analysis" to detect anomalous traffic in dump capture. Anomaly detection techniques deploy machine learning/data mining approaches on traffic dump getting stats for the "feature set" from the data. Commonly used "features" are:
  1. no. of distinct sessions created in a time window.
  2. protocol distribution
  3. no. of ack/syn packets
etc. There's a large list of feature set which can be found in research papers. Author here has analyzed few of them very simply, with easily available, known tools (most of them).

Friday, December 09, 2005

Google - Past, present and future

Google (1998):




Source: [last 2 images]
(edited 22 Jan, 2006)

Monday, November 21, 2005

Issues with pattern matching in Network Intrusion Detection Systems

Other than Pattern matching “algorithm” decision, there are a lot of other issues that also needs to considered before choosing any one of them. Of course, fast matching is the natural need for the decision but there are some other issues to be kept in mind like fighting false positives example in some cases it is possible that payload contains a pattern for buffer overflow attack via telnet application protocol but what if there was no active telnet session between two hosts. Then, other issue can be what if pattern is split over multiple packets? Some of issues with respect to choice of algorithm and limitations of signature matching has been stated below.

  1. Memory vs Speed
  2. Signature format
  3. Session-Based and Application Level Signature Matching
  4. State Holding issues in-cases of pattern extending over multiple pkts
  5. Packet Fragmentation Issues.
  6. Getting packet dumps or testing data set? (other than attack tools and DARPA set.)

While one always needs to compromise between memory requirements vs speed available. As we can see in the existing algorithms itself, Aho/Corasick provides O(1) time pattern matching but requires quite large memory for the storage of the state machine. While the other string matching algorithms such as Boyer-Moore can lead to O(mn) time requirements in cases of algorithmic attacks. One must need to payoff one depending upon the constraints.

Most of the IDS’s except a few use the byte or character based string as the patterns presentation format. While this is also needed as the most common algorithms used are Boyer-Moore, KMP etc. But if State-machine matching is being deployed then regular expression can provide a better pattern which can be more informative and will be more unique to the attack it is identifying. Other than these, most of the Snort rules do contains multiple patterns with different offset and depth values which can be very well expressed in single regular expression with the usage of basic regex patterns like . and * etc. [1] provides some examples also. Also, Bro contains patterns in regex (regular expression) format itself.

Then, [7] also discusses about the statesful packet matching where IDS stores the information about the context of the traffic between two peers providing more efficient pattern matching results but the overheads involved are the massive because of the information that needs to be stored specific to content of the traffic for large amount of the flows. While over this, one can also provide application level pattern matching to provide even better results.

One of the most important issues with IDS systems is the state holding issue which can be explained as the amount of the information that needs to be stored for each flow flowing
through it. Incase of pattern matching over individual packets, this is not of much concern since this does not even comes into picture. But with the invent of attack packet split over multiple packets, pattern matching has gone to name packet stream matching since now packet needs to be matched over multiple packets, demanding more memory for storing information about session flows and packets flowing, the partially matched patterns, other flow specific data structures etc. Although there is Snort preprocessor for counter-attack to this issue namely Stream4, but these issues are with this plugin also. For how much time, does the information needs to stored before dropping the information, (it should not be the case that IDS declares timeout and drops the session information while the destination host still keeps waiting, or vice-versa). Then, what is the number of maximum sessions that can be stored, since information that needs to be stored can vary from flow to flow.

Continuing the above discussion, issue of fragmented packets [2],[3], [4], [5] even complicate the situation more. Since, some of new issues comes into picture like

  1. Out-of-order arrival of TCP segments
  2. Re-transmitted segments
  3. Overlapping TCP packets hence issues with reassembly
  4. Missing of fragments in between or losing the state of the connection while connection is still alive?
  5. How much data should be buffered (TCP window)
  6. Varying TTL of the fragments for evasion of NIDS. If the NIDS believes a packet was received when in fact it did not reach the end-system, then its model of the end-system's protocol state will be incorrect. If the attacker can find ways to systematically ensure that some packets will be received and some not, the attacker may be able to evade the NIDS.

While Authors in [6] has examined the character and effects of fragmented IP traffic as monitored on highly aggregated Internet links. They had shown the amount of fragmented packets in normal internet traffic and their characterizations, classifications as per the statistics, protocol and application layer. They show that the amount of “fragmented packet” traffic at internet links is less than 1% but there are two cases first they are talking at internet level with good connection speeds and secondly, but what if traffic is fragmented attack specific. These issues pops up some new questions other than existing ones like because different operating systems have unique methods of fragment reassembly, if an intrusion detection system uses a single “one size fits all” reassembly method, it may not reassemble and process the packets the same way the destination host does. An attack that successfully exploits these differences in fragment reassembly can cause the IDS to miss the malicious traffic and fail to alert. While much of these have been solved in existing tools heuristically. The above mentioned papers themselves have discussed few of them. Snort even contains a preprocessor plugin i.e. Frag2 for most of these issues with some assumptions like if next few fragments doesnot arrives in next 30 seconds, it will be dropped, then one can/needs to specify the end hostsystem OS so that specific reassembly is done for that session. Some tools even use bifurcating analysis [5], what it means is if the NIDS does not know which of two possible interpretations the end-system may apply to incoming packets, then it splits its analysis context for that connection into multiple threads, one for each possible interpretation, and analyzes each context separately from then onwards. Some other methodologies has also been discussed in the same paper.

Then, one of the major issue we have come across is the testing of existing approaches. While there exists MIT DARPA Datasets but there are two issues with them, firstly they contain very few attacks and secondly they are of 1998-99 period and since that attack technologies has advanced a lot. Even the attack tools are too specific for producing individual attacks rather a generic traffic in-between including attack packets. While recently,[7] has designed a new tool for IDS testing namely AGENT which takes other than producing ”pattern strings”, also generate other type of traffic like the ones has been described in [5], but then always its also synthetic.

[1] Sommer, R., Paxson, V.: Enhancing Byte-Level Network Intrusion Detection Signatures with Context. In: Proceedings of the 10th ACM conference on Computer and Communication Security, Washington, DC (2003) 262-271.
[2] C. A. Kent and J. C. Mogul. Fragmentation Considered Harmful. Computer Communications Review — Proceedings of SIGCOMM’87, 17(5):390–401, August 1987.
[3] Thomas H. Ptacek and Timothy N. Newsham. Insertion, evasion, and denial of service: Eluding network intrusion detection, January 1998.
[4] Judy Novak, Target-Based Fragmentation Reassembly, WhitePaper from Sourcefire Inc., April 2005.
[5] Mark Handley, Vern Paxson, and Christian Kreibich. Network Intrusion Detection: Evasion, Traffic Normalization, and End-to-End Protocol Semantics. In Proceedings of USENIX Security Symposium, August 2001.
[6] Colleen Shannon, David Moore, K. Claffy. Characteristics of fragmented IP traffic on internet links. Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement, 2001.
[7] Shai Rubin, Somesh Jha, Barton P. Miller: Automatic Generation and Analysis of NIDS Attacks. ACSAC 2004: 28-38

Tuesday, October 11, 2005

search engines

uff.. not again!!
Search engines, one of the most discussed topics when it comes to internet :p. And yes am going to discuss on it again with state of the art in it.

Google not the poineers in it, but still poineers in the way of organizing the web in a way that one can search to get what (s)he wants. Yahoo, MSN, askjeeves are other big names in the same field. With ever increasing internet growth, everyone of them is trying to cope up with the large number of web pages build everyday. Everyone is building more tools, incorporating more features into their search facility to attract more users like toolbars etc.

In this feat, Yahoo sometime back said that the pages they are searching are around 19 billion i.e. arnd 2.5 times the google engine shows on their homepage. Google in reply to this removed the no. of pages they are querying from their main page and said that what matters is the quality of results and not quantity (and even if one considers the quantity then one can always match the no. of results returned on yahoo vs google).
MSN on the other hand is trying to beat the google with the way we discussed earlier, and even talks are going on. Here is link to earlier article (i lost a newer one :( )

While big players in this market are fighting for attracting more users via these _useless_ ways rather than trying to improve the quality of search results and providing the users what he wants and not what these engines want to give. Rather than some context based optimizations (or SEO's), I have not read any article since long that anyone of them is working on "invisible web" or even trying to ways to access or crawl it or anyone has some cutting edge technology for the context based search. There are some new startups in search field providing context-based search results but the result list is so small that one cannot rely on them fully.
Vivismo which is their since long, a automatic categorization tool organizes search results into meaningful categories without requiring any preprocessing of documents. Didnot got much success because of the small number of the results that are returned to the user. The categories formed here are more less depend on the various meaning of the search word and/or the various attributes the search phrase may have.
Then, their is KartOO a flash based online search tool which provides a new and interactive kind of experience dividing the search results into various categories. Interesting part is categories are formed based on the categorization of the results rather then some pre-defined taxonomies or meanings.
Grokker is another engine in the market that uses yahoo search power with java technology to categorize the search results in small circles inside a big circle (representing the whole set). Here small circles represents the subsets with some results belonging to that category. The results here are combination of the above two approaches. Where the process is sort of recursive when u enter one of the categories u choose.

The problem with last two is that they provide very few results, so can be used only when one wants some specific information about some topic or field.

But there is new search engine in the market namely, Wink, which is an folksonomy based search engine. Folkosonomy (tagging in lay man langauge) is current approach of categorization showing high acceptance in web 2.0 era. Wink other than using google search API's allows users to tag and rank the search results as per the their choice and then when someone else search for "similar" keyword other than google results, these user driven results are also shown to users.
you can even rank (0-5) and tag the results. I think these options were also used to be there in early search engines. Anyways nice work and cool interface.