|Proposed by Chow Zheng Wei (profile, biography) Don't forget to submit this proposal to official Google Melange site too!
This project aims to implement a fast and intelligent search engine in Smalltalk to answer relevant categories, classes, methods, source code, comments and global variables to queries made by the programmer. The search engine consists of two main components: indexing and searching. Ranking is done using the Google Page Rank method but specialized for the Smalltalk programming language. A single browser to unify all searching needs within Smalltalk will be made.
Demo on Search Indexing of Smalltalk image videos:
Search Indexing of Smalltalk Image Part 1
Search Indexing of Smalltalk Image Part 2
Search Indexing of Smalltalk Image Part 3
How will I do that project?
- I will contact the project mentors everyday for at least 15 minutes to ensure everything goes according to the timeline or better. We will use Skype to talk and share screens. My aim is: It all comes down to “doing whatever it takes” to implement the search engine. I feel very motivated for this project because of my past work with the mentors.
- Each time I have a new idea or concept, I will start it with a small area. This is to ensure that the idea can work smoothly when it goes big. Also, this will save a lot of time when we are debugging the code. Smalltalk is very good for debugging of live code.
- I will spend sufficient time on research phase and write down clear specifications of a feature before jumping right into code development because I believe this will greatly reduce the time spent on re-writing the specs and again re-developing the code since the specifications have changed.
- Tackling stuck - it is normal to get stuck while programming. The following are my little strategies to tackle with stuck. If it is a small stuck I will motivate myself to solve it, bigger stuck, and I will seek help from forums. If it is a conceptually level stuck, which means I-don't-even-know-what-to-ask type of stuck, I will consult my mentor. Besides, I will actively participate in forums to widen my understandings, and to help out fellow Smalltalk programmers.
What methodologies will I use?
An information retrieval system (IR) or search engine consists of two main components:
1) The Smalltalk categories, classes, methods, source codes, comments and global variables will be indexed. Given a search term, the relevant categories, classes, etc. can be found quickly.
2) To present the results more intelligently, the Google Page Rank Algorithm will be applied to the method index and source code index to define the popularity of the code. The formula is
PR(A) = (1-d) + d( PR(T1)/C(T1) + ... + PR(Tn)/C(Tn) )
PR() = Page Rank of
d = damping factor
A = an implementor of a selector
T1 = a sender of selector A. There are n of them.
C = number of method sends in the sender method T.
By applying this formula, it actually describes the popularity of the method. As the number of senders increases, the importance of the method increases as well. The PageRank will be saved inside the method index and source code index.
1) The implementation of the search engine will be started on single term query. Words that used frequently such as self, nil, super and true will be filtered out to improve the speed of the search.
2) An Information Retrieval System Score which is also known as IR-Score, is a numeric score computed by the search engine. The IR-score is computed based on how well each objects in the database (index) match the query. The IR-score for now is computed based on the frequency weight of the query inside the method or source code.
3) The results will be listed in order of the Final Rank. The Final Rank will be the combination of IR-score and PageRank of the results.
4) For multiple term queries, by comparing the position vector between the queries inside the method or source code, a position vector weight will be computed. As the position of the queries are closer, the position vector weight increases. This position vector weight will then be added into the IR-score.
Suggested timeline and milestones
- May 21 – June17: Begin to develop the indexing, and PageRank of Smalltalk Image with the help from mentors and ESUG forum. The indexing and PageRank will be started with smaller size of the class to ensure that the concept is correct. As the size of the Smalltalk Image is large, this requires around 3~4 weeks to research and develop the indexing and PageRank code on the Smalltalk image.
- June11 – July 15: Begin to code for the IR-score computation on single term query. Then, the IR-score will be combined with the PageRank.
- July 15 – July 22: Implement multiple term queries into the search engine.
- July 23 – August 13: Fixing Bugs and writing documentation.
This timeline can be changed according to progress and mentors suggestions.
Where I see the risks
a) It is possible that the indexing duration might take very long time.
b) If every time a new code is added into the image, the updating duration might be time consuming.
c) The number of iterative computation of PageRank for each method in order to obtain accurate PageRank is still an unknown for Smalltalk Image.
How the results will look like
Three videos on the draft of the search engine have been made and below are the links:
The video in Part 1 and 2 explain the brief algorithm on ranking the results. The algorithm will be improved later on. Part 3 of the video explains the functionality of the buttons in the search engine.
Here are some brief descriptions on the search engine including a screenshot:
a) System categories, classes, selectors, codes, comments and global variables will be indexed and ranked according to the relevancy.
b) A GUI of the search engine will be developed to display the results above. Figure 1 shows the draft GUI of the search engine.
Figure 1: Draft GUI of the search engine
For more information, please visits the 3 links provided earlier.