Toomre Capital Markets LLC

Real-Time Capital Markets -- Analytics, Visualization, Event Processing, and Intelligence

Initial Discussions about Sample Summary of Google Image Search Results

This is being cross-posted to both the Toomre Capital Markets website and to Lars Toomre's personal website.

===========================================================================

Earlier this week, Lars Toomre broke up an LT post about Adding Images to Lars Toomre Website. Previously, buried at the bottom of that post, there was a hand-crafted HTML table summarizing what information currently was in Google Image Search for various combinations of site domain restrictions, safe search parameters and some typical search terms. That table is now presented in the LT post Sample Summary of Google Image Search Results and repeated here for easier reference.

Google Image Search (Row Title plus Safe Search Type plus Column Title) — Nov. 7, 2010 about 2 PM
Website Domain Safe
Search
(none) Toomre Lars Kyra Mary Blonde Blonde
Mary
site:toomre.com Strict 157 162 182 36 185 115 142
Moderate 188 178 182 21 170 124 142
Off 188 162 182 36 185 115 142
site:lars.toomre.com Strict 112 117 111 30 134 143 116
Moderate 145 117 132 30 134 143 136
Off 145 136 132 30 134 127 136
site:larstoomre.toomre.com Strict 89 83 89 7 79 68 61
Moderate 89 89 89 7 70 68 54
Off 89 89 83 8 70 68 54
site:www.toomre.com Strict 10 10 10 0 0 1 0
Moderate 10 10 10 0 0 1 0
Off 10 10 10 0 0 1 0

The public presentation of these search results has already resulted in several stimulating conversations heading off in numerous different directions. Let me try to summarize some of the issues that have been raised thus far and the broad categories that they might be grouped into. Each of these categories probably will receive additional attention in the coming days. They include:

  1. Presentation of the Original Table

    Several people commented on the poor default rendering of the HTML table as seen in their web browsers. (Much of this is attributable to minimal CSS definitions for rendering a table in the current Drupal theme.) Comments included:

    • The table title should have different formatting and be centered.
    • One person suggested that the table itself should be indented to better draw focus to its details.
    • Another person argued that the columns with data beneath them should be of uniform width.
    • "The vertical spacing (between the start of the table and the paragraph above it sucked)!!", exclaimed another.
    • The lack of cell lines (at least indicating the domain breaks) made row comparisons difficult.
    • The main column header lacked a heavy bottom line.
    • The left side of the first column of results data [the column with the head of (none)] lacked any line whatsoever reducing focus on the data values.
    • Perhaps the cells of results data should have a light outline around each cell?
    • At minimum, the results data should include a border around it to highlight those values.
  2. Usability of the Original Table
    • One person suggested to facilitate understanding the Safe Search rows should also be collapsible. Hence, one could look at say only the Moderate Safe Search values for each of the domains. Of course, they also wanted the option to fully expand the table display to see all of the data if so desired.
    • Another suggestion was to wrap an HTML link around each of the result values so that the user could jump to the Google Image Search page from which the particular result value was obtained.
  3. Discussion about the Original Data Values

    There was quite a bit of discussion about the validity of one cell value in comparison to others in the table. For instance, reading across the row of values for the Moderate option for the toomre.com domain, one would expect that having no additional search value would produce more results than any search that was further constrained. That indeed was the case. Similarly, one would expect for each particular grouping of Safe search options, that the most results would be in the 'Off' category. The other two categories would further constrain the search resulting only as much or fewer search results. For each of the domains with no search terms this indeed was the case.

    However, how does one explain by adding a search term like 'Toomre" or 'Lars' that there are more results than if no term were specified? Or, why for a particular mode of Safe Search does the strict version give more results than the suggested moderate mode. It was quite difficult to come to definitive conclusions about the internal consistency of the data that was hand gathered.

  4. Other Desired Search Terms

    Lars rather randomly selected what additional search terms to use to further constrain the Image Search results. Google Analytics data suggests that three frequent search terms for the Lars Toomre website are 'Steven Covey', 'Elizabeth Gilbert' and 'Louise Glover'. The question was posed about what results would occur if search variations on these names were used instead. In short, how easy would it be to change the search refinement terms?

    Since all of the original work was done by hand, the answer was it would take considerable effort. The suggestion then was made to consider how the data collection and table generation tasks could be automated. Both were good suggestions, but like any programming task, doing so will take some time before any reliable results are produced. Is it worth the time now to figure out how to automate such tasks? And is the automated gathering of Google Image Search totals permissible on an automated basis?

  5. Missing Reference Data for Better Analysis

    Another person suggested other reference data was missing if one wanted to do a more complete analysis. Thoughts included how many images in total are available for viewing by anonymous users? Each image is associated with one or more 'nodes' to use Drupal terminology. Most of the posts with images have been posted by Lars Toomre. Hence, one would expect search restrictions with either 'Lars' or 'Toomre' to be fairly close

    It was suggested that we drill down on what is selected with the 'Kyra' or 'Mary' restrictions. How many nodes in a particular domain have both an image and say the term 'Kyra' in them? Are the Google Image Search algorithms mainly focusing on what is produced in that particular node or are they also attaching tags for taxonomy terms and what relatively static information might appear elsewhere on the page (like side bars)?

    The suggestion also was made that Google Image Search returns the top results for that particular search. Could one collect what those top search results are and then compare them to what is available on the Toomre website servers? For instance, how many times (and for what reason if known) has the first ranked image been served up as part of a page display? Was it ranked first because of a high number of searches looking for a particular term? Or was it first because it was placed as part of the display of most page presentations (like a logo or user photograph)? In short, for the relatively small data sets being returned from Google Image Search, what impressions can be derived about how one image from a search is ranked for importance vis-à-vis another from the same domain for the same search query? And how does this algorithmically-calculated order compare with an 'expert' user of the same website?

  6. Update of Snapshot Data Values

    Perhaps the explanation about the table's internal data inconsistency may lie in the fact that Google's various image search indices might be updated at different times. Hence, the variations might be due to when this particular snapshot of data was gathered. This is a typical stale data problem that often occurs in dynamic real-time systems. The question was asked how difficult would it be to take another snapshot of results data and prepare a second (or third or fourth …) results table.

    The short answer is that preparing the original data table was quite labor intensive. To produce more snapshot tables, certainly some effort will need to be devoted to automating the transformation of "raw" data into HTML table code. This is one of the great possibilities about using a flexible Content Management System like Drupal. However, time needs to be devoted to writing the custom PHP code that takes arrays like column headers, row headers and results data and transforms it into valid XHTML code that then can be formatted with CSS mark-up language.

  7. Automated Comparison of Table Snapshots

    Another tangent of one conversation got into analytics through time. The question was asked about how difficult it might be to compare the snapshot taken this week and another that might be taken next week. That way one could get a sense of the dynamics of how the index of Google Image Search values might be changing through time. The suggestion was made for a given set of search restriction terms one could automatically take and record 'snapshots' of the total results and then generate various statistics from the three dimensional array of data. Assuming that one could address the automatic collection and presentation generation issues, feeding a difference array to the formatting function should not be too difficult.

  8. Use of Analytical Tools (like MATLAB) on Snapshot Data

    Toomre Capital Markets LLC ("TCM") has done considerable work in the past couple of years with the analytic package called MATLAB. The question was raised about how difficult would it be to incorporate a MATLAB analytic routine into what is presented on the website. For instance, MATLAB has strong built-in graphing capabilities. Could one run an analytic routine that results in graphic output that might be displayed on the same page as the current snapshot of results?

    Lars chuckled at this question. Recently for a hedge fund client, TCM completed an engagement where internal MATLAB analytics were made accessible through a webpage interface. Clients of that firm now can log into a secure website, enter certain data (like a date range) and both graphical and tabular results are returned to their web browser based on what other data stored in secure SQL data bases and various custom MATLAB algorithms. In short, those clients are running MATLAB analytics through a web browser.

    We certainly could engage in a similar effort to perform MATLAB analyses on this result data through time. However, first we would need to address a number of technical issues like data collection, data storage, creation of the MATLAB analytic functions and finally the webpage input/output error checking and display generation dynamics. For such a relatively trivial problem as total images found in Google Image search for a particular query, it hardly seems worthy of the effort involved.

    However, one can easily envision how relatively static criteria might be dynamically updated in real-time through such a webpage. For instance, certain risk analytics like net long, risk equivalents, profit & loss, percentage turn-over, sector quantification and/or aging distribution across a firm's collection of client portfolios is a good example where such an effort is well-worth consideration. Presumably, the list of client portfolios does not change much from one period to the next. Hence, each portfolio might be one row in a table like this one and the various risk values might be represented by a column.

    If one has more questions on this topic, please contact Lars Toomre or Aldon Hynes for a more detailed discussion about your particular problem and the issues involved in it.

  9. Initial Conclusions

    Lars has been quite impressed by the detailed queries that resulted from the publishing of this original table. Certainly, more by hand 'snapshots' need to be recorded to better understand some of the original results data inconsistencies. Also, exactly what problem are we trying to answer needs to be addressed before one starts developing even preliminary data structures and analytic code. An argument can be made that some effort should already be made now to help automate the creation of display tables. That would facilitate both ad hoc querying and the display of two data sets for at least visual comparison.

    As per usual, further comments and queries are welcome!