The awesome people over at Android Software were kind enough to help me out with a problem that’s been bugging PackRat for a while now. And that problem is that despite my best efforst, some barcodes you scan won’t be recognized.
Thanks to those guys help, I managed to get hold of some such barcodes, and that meant I could test where exactly the barcode recognition fails — there are several steps involved in this. And seeing where the failure occurs then led me to develop a workaround which will hopefully mean better search results for you.
In this post, I’d like to explain the problem and solution in more detail — if that sort of thing bores you, you can skip the rest of this post, and just keep in mind that the upcoming version 1.2.0 of PackRat should be better at recognizing your stuff!
There are several steps involved in recognizing that a barcode belongs to, say, a film. Say that film is one of my own favourites, and apparently one that at least one user at Android Software liked, Shaun of the Dead. The following are the steps PackRat performs to recognize this connection:
- Take a picture of the barcode.
- Process that picture to enhance contrast and allow recognition of the barcode pattern.
- Recognize the barcode pattern. Barcodes contain numbers only, in most barcode formats those are the numbers commonly printed above or below the pattern.
- Fetch information from Google Base related to this number. Because these numbers can be different types of product codes (UPC, EAN, ISBN, etc.), several searches are performed.
- Display the results of those searches.
Points 1 to 3 are mostly out of my control. For barcode recognition, PackRat uses the Zebra Crossing library, which pretty much spits out product codes, and even tells you the type of product code. But the library’s use of terminology is a bit different from my own understanding of the standards involved, and slightly different again from what Google Base seems to think, so I can treat the product code type only as a guideline.
As a result, in the next step, PackRat fires off several queries to Google Base, trying for the most restrictive intepretation of the product code first, and allowing more and more fuzzy interpretations in each successive search. That sounds more clever than it is, really, it’s just a question of finding the right query parameters for Google Base.
But the upshot is that PackRat searches Google Base for product codes. And therein lies the problem, and it’s a problem with two sides:
- Google Base is not actually a product database as such. It’s a database anyone can enter stuff into and anyone can query stuff from, but it’s mostly fed by online shops. Consequently, if no shop offers the item you just scanned, the likelihood is that Google Base doesn’t find it.
- These different shops compete with each other; they will enter information into Google Base in a way that they think will lead to more hits on their website. This is not necessarily the best information for PackRat, and — amongst other things — product codes may not always be reproduces faithfully.
It turns out that for the guy complaining on Android Software’s forums that PackRat would not find results for him, that is a fatal combination: Google Base really does not contain any information about the product codes he searched for.
I’ve attached the barcodes for his items below. You can click on the thumbnails to expand them, and then scan them right off the screen. Note that the barcodes have been enhanced in contrast to make for easier scanning.
For your convenience, I’ve also posted Google products links with the barcode contents below. Google products is essentially the public-facing frontend for Google Base, so search results are identical. For comparison, I’ve also posted links to Google web searches with the same search term.
Google searches by search type and keywords
|Seven||products (product code)||web (product code)||products (title)|
|Shaun of the Dead||products (product code)||web (product code)||products (title)|
|Warcraft III||products (product code)||web (product code)||products (title)|
The thing that becomes immediately obvious is that for all three product codes, Google web search finds plenty of matching results. Clearly there are online shops who correctly list these product codes on their websites, but for one reason or another, that data does not make it into Google base.
But what you can also see is that when searching for the items by title, Google base does contain entries. The solution to making PackRat find stuff appeared to be the following:
- Search on Google web by product code.
- Search on Google products by the titles found in the previous step.
But that again raises the problem of having to interpret what web pages are about, or else any sequence of words could be interpreted as a title.
Luckily Google web search helps out with that problem, in that the results it returns already contain a title element. Granted, web search talks about a web page title, and PackRat is concerned with media titles, but chances are those are fairly similar.
The approach I’m taking in PackRat is to look at all the titles returned by the web search, and count the occurrences of words in those titles. My assumption is that a search for the product code of a movie should result in pages that all contain the movie title in their web page title, in addition to a few other words that are not necessarily present on all pages. As a result, the words with the highest frequency should be the ones to search for.
In theory, that works very well. In practice, there are a few problems, and you can see one of them above: “Seven” is not a search term that would naturally be associated with the movie of the same name, but many other products as well. The results for “Shaun of the Dead” and “Warcraft III” are vastly more promising. If it recognizes two out of three, that’s not too shabby, eh?
Sadly, though, there is another problem: The API Google provides for accessing web search results returns very, very different results from the actual Google search many of us use regularly. As a result, putting the above technique into action means that out of the three only “Warcraft III” is found.
I’m still quite happy with the result — I was just hoping that things might work even better.
But once 1.2.0 hits the Market, I’d love to hear if you noticed an improvement in practice!