Structure of database tables used in ASPSeek.

1."wordurl"
This table keeps information about each word in main and "real-time database", 1 record per word.
Fields:
	word		Word itself
	word_id		Numeric ID of word
	urls		Information about sites and urls, where word is encountered.
				Empty if size of info is greater than 1000 bytes, in this case info is stored in separate file.
	urlcount	Number of URLs where word is encountered.
	totalcount	Total count of this word in all URLs

Last 3 fields are updated after finishing of crawling, or then indexer is run with "-D" switch.

2. "wordurl1"
This table keeps all information about each word in "real-time database", 1 record per word.
Fields:
	word		Word itself
	word_id		Numeric ID of word, refers to wordurl.word
	urls		Information about sites and urls, where word is encountered. Always not empty regardless of size.
	urlcount	Number of URLs where word is encountered.
	totalcount	Total count of this word in all URLs

Last 3 fields are updated immediately after downloading of the URL, specified "-T" switch.

3. "urlword"
This table keeps information about all encountered URLs, both indexed and not indexed yet which match specified
conditions in configuration files.

Fields:
	url_id			ID of URL.
	site_id			ID of site, refers to sites.site_id.
	deleted			Set to 1 if server returned 404 error.
	url				URL itself
	next_index_time	Time of next indexing in seconds from 1970.
	status			HTTP status returned by server or 0 if document has not been indexed yet.
	crc				MD5 checksum of document
	last_modified	"Last-Modified" HTTP header returned by server.
	last_index_time Time of last indexing in seconds from 1970.
	referrer		ID of URL which has link to this URL.
	tag				Arbitrary tag.
	hops			Depth of URL in hyperlink tree.
	redir			URL ID, where current URL is redirected or 0 if this URL is not redirected.

4. "urlwordsNN" where NN is 2-digit number from 00-15
These tables contain additional info about existing indexed URLs. NN of table is URL ID mod 16.

Fields:
	deleted			Set to 1 if server returned 404 error.
	wordcount		Count of unique words in the indexed part of URL.
	totalcount		Total count of words in the indexed part of URL.
	content_type	Content-Type HTTP header returned by server.
	title			First 128 characters from pages title.
	txt				First 255 characters from page body.
	docsize			Total size of URL.
	keywords		First 255 characters from page keywords.
	description		First 100 characters from page description.
	lang			Not used now.
	words			Zipped content of URL.
	hrefs			Sorted array of outgoing href IDs from this URL.

5. "robots"
This table contains information parsed from robots.txt file for each site.

Fields:
	hostinfo		Host name
	path			Path to exclude from indexing.


6. "sites"
This table contains IDs for all indexed sites.

Fields:
	site_id			ID of site.
	site			Site name with protocol, like "http://www.aspstreet.com/"

7. "stat"
This table contains information about query statistics for each completed query.

Fields:
	addr			IP address of computer, from which query was requested.
	proxy			IP address of proxy server, through which query was requested.
	query			Query string.
	ul				URL limit used to restrict the query.
	sp				Web spaces used to restrict the query.
	site			Site ID used to restrict the query.
	np				Results page number requested.
	ps				Results per page.
	sites			Number of found sites matching query.
	urls			Number of found URLs matching query.
	start			Query processing start in seconds from 1970
	finish			Query processing finish in seconds from 1970
	referer			URL of web page from which query was requested.

8. "subsets"
Table describing all subsets, which can be used to restrict the search.
Populated manually with URL masks. Subset is the set of URLs from the particular directory of site.
Putting masks describing whole site is not necessary.

Fields:
	subset_id		ID of subset.
	mask			URL mask. Example "http://www.aspstreet.com/directory/%"
					Examples of wrong use: "http://www.aspstreet.com/%", "http://www.aspstreet/%", 


9. "spaces"
Table describing web spaces. Web space is the set of sites.
Each site belonging to particular space must be put into separate record.
Populated manually.

Fields:
	space_id		ID of web space.
	site_id			ID of site belonging to the space, refers to sites.site_id.


10. "tmpurl"
Table describing URLs indexed since start of last indexing. Used for debugging.

Fields:
	url_id			URL ID
	thread			Ordinal thread number, which indexed URL.

11. "wordsite"
Auxillary table used when search is restricted to site pattern. Built at the end of indexing from "sites" table.

Fields:
	word			word used in site name between dots.
	sites			Array of site IDs, where this word is encountered.

12. "citation"
This table contains reverse index of hyperlinks.

Fields:
	url_id			URL ID
	referrers		Array of URL IDs, which have hyperlink to this URL.



Structure of BLOBs.

1. "wordurl.urls", "wordurl1.urls"

	Sites information
		Offset		Length	Description
		0			4		Offset of URL info for 1st site.
		4			4		ID of 1st site where word is encountered.
		8			4		Offset of URL info for 2nd site.
		12			4		ID of 2nd site where word is encountered.
		...
		(N-1)*8		4		Offset of URL info for Nth site, where N is the total number of sites, where word is encountered.
		(N-1)*8+4	4		Offset of URL info for Nth site.
		(N-1)*8+8	4		Offset of URL info end for Nth site. Must point to the end of blob or file.

	URLs information. Follows sites information immediately. Offsets are counted from 0.
		Offset	Length	Description
		0			4		URL ID of first site in sites info section.
		4			2		Word count in this URL
		6			2		First position
		8			2		Second position
		...
		6+(N-1)*2	2		Nth position, where N is the total word count in the URL.
		---Repeated with info for URLs with ID greater than previous---.
		....
		---Repeated with info for URLs for next sites from sites info section---


2.	"urlwordsNN.words"

	Offset		Length		Description
	0			4			Size of URL content before zipping or 0xFFFFFFFF if content is not zipped.
	4			Zipped size	Zipped or original URL content.


3.	"wordsite.sites"
This filed contains array of sites/positions for word. Sorted by site IDs.

	Structure of array element:
	Bits		Description
	24-31		Bitmap of positions, highest bit is set to 1 is word is first-level domain.
	0-23		Site ID.

