Labels
How to load tiger data
(This version is the basic idea for 1 county. Its extendable to more.
Special care has to be given for topology (and other) error.
You should be also checking # of entities loaded as well)
1. get ogr2ogr and gdal (for tigerpoly.pl)
http://fwtools.maptools.org/ is probably the easiest place
2. Unzip and convert to PostGIS
NOTE: compressed (.zip) tiger is approximately 5.5Gb. Uncompressed its about 45Gb.
If all the tiger .zip files are in one directory (bash):
for f in *.zip do echo $f mkdir ${f%.*} mv $f ${f%.*} cd ${f%.*} unzip -q $f cd .. done mv */*.zip ~/original_tiger2005fe
At the end of this you will have about 3000 directories, named like "tgr01033". Inside this directory will be the uncompressed tiger files for that one county.
Next you need to setup the database - we do this with any of the tiger counties.
First make sure you have a tiger2005fe postgis database
mv tgr01001 TIGERFIRSTMODULE cd TIGERFIRSTMODULE ogr2ogr -f "PostgreSQL" PG:"dbname=tiger2005fe user=postgres host=localhost" .
Next, load in each county:
for f in tgr* do echo $f cd $f ogr2ogr -update -append -f "PostgreSQL" PG:"dbname=tiger2005fe user=postgres host=localhost" . cd .. done
NOTE: this will not re-do tgr01001 ("TIGERFIRSTMODULE").
HINT: you can put this is a script and use "nohup" if you're running the program from a remote terminal (ignore this if you dont understand).
3. Create indexes in the database
create index indx_mod_completechain on completechain (module); create index indx_mod_landmarks on landmarks (module); create index indx_mod_altname on altname (module); create index indx_mod_arealandmarks on arealandmarks (module); create index indx_mod_entitynames on entitynames (module); create index indx_mod_featureids on featureids (module); create index indx_mod_idhistory on idhistory (module); create index indx_mod_polychainlink on polychainlink (module); create index indx_mod_pip on pip(module); create index indx_mod_polygoncorrections on polygoncorrections (module); create index indx_mod_tlidrange on tlidrange (module); create index indx_mod_zerocellid on zerocellid (module); create index indx_mod_zipcodes on zipcodes (module); create index indx_mod_zipplus4 on zipplus4(module); create index indx_mod_spatialmetadata on spatialmetadata (module); vacuum analyse;
4. QA/QC check:
a) process the counts.txt files shipped with TIGER:
Use the counts_to_sql.pl script to generate SQL:
cat ../www2.census.gov/geo/tiger/tiger2005fe/*/counts*.txt > all_counts.txt
perl all_counts.pl <all_counts.txt >out.sql
psql -U postgres tiger2005fe
\i out.sql
-- there should be 3232
select count(distinct county) from counts_txt;
This gives a list of count of each feature type for each country. We're going to compare this to the actual number we loaded. If they are the same, our load probably worked!
b) count the actual number of features loaded.
create table qa_qc (module text, filetype char, count int); insert into qa_qc select module,'1',count(*) from completechain group by module; insert into qa_qc select module,'7',count(*) from landmarks group by module; insert into qa_qc select module,'4',count(*) from altname group by module; insert into qa_qc select module,'8',count(*) from arealandmarks group by module; insert into qa_qc select module'c',count(*) from entitynames group by module; insert into qa_qc select module,'5',count(*) from featureids group by module; insert into qa_qc select module,'h',count(*) from idhistory group by module; insert into qa_qc select module,'p',count(*) from pip group by module; insert into qa_qc select module,'i',count(*) from polychainlink group by module; insert into qa_qc select module,'b',count(*) from polygoncorrections group by module; insert into qa_qc select module,'r',count(*) from tlidrange group by module; insert into qa_qc select module,'t',count(*) from zerocellid group by module; insert into qa_qc select module,'6',count(*) from zipcodes group by module; insert into qa_qc select module,'z',count(*) from zipplus4 group by module; insert into qa_qc select module,'m',count(*) from spatialmetadata group by module;
c) compare the actual counts to the expected counts;
-------------------- check -------------- -- we didnt count up featuretypes '2','a','b','c','s' -- and there will not be records in qa_qc where nfeats ==0 --- THIS SHOULD RETURN no rows SELECT county,type,nfeats FROM counts_txt WHERE not(type in ('2','a','b','c','s') ) AND nfeats=!0 EXCEPT SELECT module,filetype,count FROM qa_qc ; -- should return none! (if it does return something, then there was a load error) -- need to do the "opposite" too SELECT module,filetype,count FROM qa_qc EXCEPT SELECT county,type,nfeats FROM counts_txt WHERE not(type in ('2','a','c','s') ); - should return none! (if it does return something, then there was a load error) -- should be 3232 (see above) SELECT count(distinct(module)) from qa_qc; SELECT count(distinct(county)) from counts_txt;
-------------------- check -------------- -- should be 0 (checks for null geometry and geometries that cross each other) -- this takes forever! select module,tlid from completechain where (wkb_geometry isnull) or not(issimple(wkb_geometry)); --should be 0 (no null pips) select count(*) from pip where (wkb_geometry isnull);
| Problem In the above, you may find a problem with module='TGR66010' and tlid = 204575992. |
Here is a list of completechain feature codes (cfcc) and the number of features in each class: cfcc | count ------+---------- A11 | 86549 A12 | 117 A13 | 6399 A14 | 67 A15 | 191106 A16 | 168 A17 | 8222 A18 | 438 A19 | 707 A21 | 780816 A22 | 185 A23 | 1959 A24 | 36 A25 | 173281 A26 | 39 A27 | 769 A28 | 103 A29 | 1255 A31 | 1592487 A32 | 168 A33 | 2188 A34 | 29 A35 | 97673 A36 | 10 A37 | 291 A38 | 175 A39 | 2087 A41 | 29571526 A42 | 1180 A43 | 13096 A44 | 1051 A45 | 83586 A46 | 37 A47 | 306 A48 | 427 A49 | 7704 A51 | 520682 A52 | 130 A53 | 59 A63 | 206622 A64 | 42798 A65 | 489 A70 | 1087 A71 | 31441 A72 | 311 A73 | 30260 A74 | 2087931 B11 | 655592 B12 | 1255 B13 | 10456 B15 | 34 B16 | 21 B19 | 64 B21 | 104685 B22 | 37 B23 | 1409 B31 | 1958 B32 | 4 B33 | 11 B40 | 3 B50 | 12564 B51 | 329 B52 | 65 C00 | 32 C10 | 26579 C20 | 498637 C30 | 4265 C31 | 1342 C32 | 597 D51 | 25938 D54 | 23 E00 | 92 E10 | 6805 E20 | 3544 E21 | 14630 E23 | 6 E24 | 621 E27 | 7 F10 | 2089554 F11 | 70372 F12 | 38049 F13 | 107965 F14 | 221 F15 | 301 F16 | 56576 F17 | 29752 F18 | 11692 F19 | 281 F20 | 963 F21 | 275890 F22 | 80232 F23 | 455150 F24 | 977 F25 | 3 F30 | 11702 F40 | 242448 F52 | 1 F70 | 18 F71 | 504 F72 | 240 F73 | 2077 F74 | 14 F80 | 3 F81 | 456602 F83 | 833757 F84 | 129215 F85 | 259570 F86 | 1537 F87 | 4235 F88 | 160 H01 | 2480354 H02 | 79731 H10 | 3 H11 | 4151361 H12 | 3276916 H13 | 10452 H20 | 1 H21 | 274493 H22 | 167706 H70 | 90150 H71 | 18140 H72 | 23 H74 | 12026 H75 | 3409 H76 | 314006 H77 | 15467 H80 | 5 P11 | 327 P12 | 2 P13 | 10 P15 | 1728 P16 | 14 P17 | 7 P19 | 19 P21 | 1379 P22 | 1 P23 | 10 P25 | 1292 P27 | 18 P28 | 3 P29 | 10 P31 | 1957 P35 | 1344 P37 | 1 P39 | 15 P41 | 239049 P42 | 9 P43 | 23 P44 | 2 P45 | 544 P49 | 53 P51 | 1061 P63 | 2258 P64 | 277 P65 | 10 P70 | 1 P71 | 331 P73 | 284 P74 | 9024 (159 rows)
From the TIGER manual:

2005 First Edition TIGER/Line File Record Types
Record Type 1---Complete Chain Basic Data Record
Record Type 1 provides a single record for each unique complete chain
in the TIGER/Line files. The basic data record contains the end nodes
for the complete chain. This record also contains address ranges and
ZIP Codes (for most areas of the country where a street name/house
numbering system existed at the time of data extraction from the Census
TIGER database) and the Census 2000 census geographic entity codes for
each side of the complete chain. Additional feature identifier, address
range, and ZIP Code data related to Record Type 1 are found on Record
Types 4, 5, 6, and Z.
Record Type 2---Complete Chain Shape Coordinates
Record Type 2 provides an additional series of latitude and longitude
coordinate values describing the shape of each complete chain in Record
Type 1 that is not a straight line segment. That is, not all complete chains
in Record Type 1 have shape points and therefore not all have an associated
Record Type 2. Where a complete chain in Record Type 1 is not a
straight line, Record Type 2 may have a many-to-one relationship with
Record Type 1.
Record Type 4---Index to Alternate Feature Identifiers
Record Type 4 provides an index to alternate feature names associated
with the complete chain (Record Type 1). A Record Type 4 will not exist
for a Record Type 1 that has only one name. A complete chain can have
more than one alternate name. Record Type 4 has a many-to-one relationship
with Record Type 1 and a many-to-one relationship with Record
Type 5.
Record Type 5---Complete Chain Feature Identifiers
Record Type 5 contains a list of all unique feature names for complete
chains in the TIGER/Line files. Each name (or feature identifier) has an
identification code number (FEAT). Record Type 5 has a one-to-many
relationship with Record Type 4.
Record Type 6---Additional Address Range and ZIP Code Data
Record Type 6 provides additional address range information for a street
complete chain when the information cannot be presented as a single
address range (for example, the house/building numbers are not uniformly
arranged to form an address range). Record Type 6 appears only
for those counties that have address ranges and ZIP Code information in
the Census TIGER database. There is no assurance that the address
ranges provided on Record Type 6 will cover fewer addresses than the
address ranges appearing on Record Type 1. Data users must use Record
Type 6 to obtain the entire picture of the potential address ranges along a
complete chain. The address ranges used for geocoding along corporate
corridors and corporate offset limits appear only in Record Type 6.
Record Type 6 can have a one-to-one or a many-to-one relationship with
Record Types 1 and with Record Type Z.
Record Type 7---Landmark Features
Record Type 7 contains the area and point landmarks from the Census
TIGER database. If Record Type 7 represents an area landmark rather
than a point landmark, then a one-to-one relationship exists with Record
Type 8. Some of the information that was formerly in Record Type 9 now is
included in Record Type 7. If a county file has no landmarks Record Types 7
and 8 will not exist for that county.
Record Type 8---Polygons Linked to Area Landmarks
Record Type 8 links the polygon identification codes with the area landmark
identification codes. If a county file does not have any area landmarks then
there will not be a Record Type 7 or a Record Type 8 for that county. Record
Type 8 can have a one-to-one, one-to many, many-to-one, or many-to-many
relationship with Record Type P.
Record Type A---Polygon Geographic Entity Codes: Current Geography
Record Type A contains a record for each GT-polygon represented by
Record Type P in the TIGER/Line files. The U.S. Census Bureau provides
current (post-Census 2000) geographic entity codes---state, county, county
subdivision, place, American Indian/Alaska Native Area/Hawaiian Home
Land---on this record type.
Record Type B---Polygon Geographic Entity Codes: Corrections
Record Type B contains records for only those GT-polygons with
corrections. The U.S. Census Bureau provides corrected geographic entity
codes from the Count Question Resolution (CQR) Program for state,
county, county subdivision, place, and American Indian/Alaska Native
Area/Hawaiian Home Land areas on this record type. A Record Type B
will not exist in counties that do not contain corrected geography. The
CENID and POLYID fields may be used to link Record Type B to Record
Type S to determine which geographic entity code(s) were corrected.
Record Type C---Geographic Entity Names
Record Type C provides a unique list of all geographic codes, their associated
name, and some entity attributes in a flat (nonhierarchical) file. It
contains a Data Year field that may have five values: 2000 for geographic
names and codes valid for Census 2000, 200n (where 200n is the year of
extraction from the Census TIGER database) for geographic names and
codes valid for the current year, CQR0 for geographic names and codes
valid for corrected geography, EC02 for geographic names and codes
valid for the 2002 Economic Census, or blank when the geographic names
and codes for Census 2000, 200n, and CQR0 are the same. Multiple
records for the same geographic entity show its change or correction over
time. Record Type C is linked to other record types (1, A, B, E, and S)
through geographic entity codes.
Record Type E---Polygon Geographic Entity Codes: Economic Census
Record Type E provides the basic geographic entity codes---state, county, and
place---used in the Economic Census. Record Type E is linked to Record Type
C through geographic entity codes.
Record Type H---TIGER/Line ID History
Record Type H provides the history of each TIGER/Line ID when complete
chains (Record Type 1) are split or merged. Record Type H shows the TLIDs
of the complete chains in existence after the split or prior to the merge.
Record Type I---Link Between Complete Chains and Polygons
Record Type I links Record Type 1, the complete chain basic data, to
Record Type P, the GT-polygon internal point. The Record Type I to
Record Type 1 link (TLID) may be used to link complete chain attributes
and other data record types (2, 4, 6, H, U, and Z) to each other. The Record
Type I to Record Type P link (CENID and POLYID) may be used to link
polygon attributes and other data record types (8, A, B, E, and S) to each
other. Record Type I also serves as a link between the complete chain
attributes on Record Type 1 and the TIGER Zero-Cell ID (TZID) information
on Record Types T and U. Record Type I has a one-to-one relationship
with Record Types 1, T, and U but a many-to-one relationship with
Record Type P. When Record Type I is linked to a single-sided Record
Type 1 (county boundary), it will provide only the left- or the right-
GT-polygon identifier.
Record Type M---Feature Spatial Metadata Record
Record Type M provides spatial metadata for each feature in a TIGER/Line
file, identifying the source for the spatial coordinates. Record Type M can
have a one-to-one or many-to-one relationship with Record Type 1.
1-14 2005 First Edition TIGER/Line® Files
Record Type P---Polygon Internal Point
There is a Record Type P for every GT-polygon in the TIGER/Line files.
Record Type P has a one-to-one relationship with Record types A and S
and a one-to-many relationship with Record Type I and identifies the
internal point coordinates for each GT-polygon. See the Internal Points
section in Chapter 3.
The TIGER/Line files include all complete chains and GT-polygons in the
Census TIGER database. The topology of the Census TIGER database
ensures that a one-to-one relationship exists between the GT-polygons constructed
from Record Types 1 and 2 and Record Type P.
Record Type R---TIGER/Line ID Record Number Range
Record Type R contains the range of unique complete chain record
numbers (TLIDs) and TIGER Zero-Cell IDs (TZIDs) assigned to a census
file in a nationwide scheme. Record Type R has the lowest and the highest
record numbers for the range. Numbers are assigned to complete chains
or zero-cells beginning at the lowest value. The current number is the
highest record number for the census file used.
Each TIGER/Line file consists of an entire county or statistically equivalent
entity. In the Census TIGER database, the county or statistically
equivalent entity may be split into many partitions. The U.S. Census
Bureau assigns permanent record numbers to each of these partitions.
These record numbers are found in Record Type R. Record Type R is not
directly linked to any other record type.
Record Type S---Polygon Geographic Entity Codes: Census 2000
Record Type S contains a record for each GT-polygon represented by
Record Type P in the TIGER/Line files. The U.S. Census Bureau provides
the Census 2000 geographic entity codes that identify GT-polygons on this
record type.
Record Type T---TIGER Zero-Cell ID
There is a Record Type T for every TIGER Zero-Cell ID (TZID) in the
TIGER/Line files. Record Type T has a one-to-many relationship with
Record Type I.
Record Type U---TIGER/Line ID Overpass/Underpass Identification
Record Type U provides information identifying those complete chains
that are overpasses or underpasses and the order in which they cross
other complete chains. Record Type U has a one-to-one relationship with
Record Type T, a many-to-one relationship with Record Type 1, and a
one-to-many, many-to-one, or many-to-many relationship with Record
Type I.
Record Type Z---ZIP+4® Codes
Record Type Z provides Postal +4 Add-On codes that make ZIP+4 codes
out of the ZIP Codes on Type 1 and Type 6 records. Record Type Z has a
one-to-one or many-to-one relationship with Record Type 1 and with
Record Type 6.

You probably have the encoding set to UNICODE, set it to SQL_ASCII. I think on unix you get SQL_ASCII as the default, but the postgis installer on Windows sets it as UNICODE by default.
I think the answer is to manually install postgis on windows:
a. create a database – there's options to set the encoding
b. add the plpgsql language to your database
c. find the lwpostgis.sql file in your postgresql install (should be in share/contrib/ or share/contrib/postgis) and execute it.
d. (optional) add the spatial_ref_sys.sql to your database
You should be good to go from there. See the postgis install page (postgis.refractions.net) for more details.
Ah, but I'm running SuSE 10.0 and postgresql defaults to SQL_ASCII on Linux, as you mentinoed above. Note that I was using the 2005fe dataset, which appears to have possibly been updated, as when i made the post above puerto rico failed, but i have successfully loaded it now.
a note on puerto rico, i had to create my db with a different encoding (WIN1252) to get your java programs to work and maintain the spanish characters. a UTF8 encoding threw an "invalid UTF-8 byte sequence" error and SQL_ASCII just mangled the characters during loading since they were out of the ASCII range. also, the java programs such as MajorRoads.java do not recognize the spanish equivalents to their U.S. counterparts (i.e. Expreso). just a head's up!
There are a couple of typos above to be aware of in case you're cutting and pasting (as i did).
First, the SQL stmt to count the number of entitynames per module is missing a comma after module. Should be...
insert into qa_qc
from entitynames group by module;
select module,'c',count
Second, the first line of SQL in section 4c has incorrect syntax (remove the !). The correct statement is...
SELECT county,type,nfeats FROM counts_txt WHERE not(type in ('2','a','b','c','s') ) AND nfeats=0
Also, I personally had trouble getting the data from puerto rico and guam loaded at all on linux. ogr2ogr failed with an invalid UTF-8 byte error. If anyone has gotten this to work correctly please post what you did.
There were also several counties for which the data from the gov's website contained no zip code data. I re-downloaded the zip files again tonight to double-check file size then loaded one into a fresh PostGIS install and still no zip code data. There must be something wrong with the files available for download. Here are the modules I'm referring to...
tiger=# SELECT county,type,nfeats FROM counts_txt WHERE not(type in ('2','a','b','c','s') ) AND nfeats=0
tiger-# EXCEPT
tiger-# SELECT module,filetype,count FROM qa_qc ;
county | type | nfeats
---------
-----------TGR13307 | 6 | 0
TGR15005 | 4 | 0
TGR15005 | 6 | 0
TGR17069 | 6 | 0
TGR17151 | 6 | 0
TGR21039 | 6 | 0
TGR30069 | 6 | 0
TGR31009 | 6 | 0
TGR31103 | 6 | 0
TGR31117 | 6 | 0
TGR31183 | 6 | 0
TGR35021 | 6 | 0
TGR46017 | 6 | 0
TGR46095 | 6 | 0
TGR46113 | 6 | 0
TGR46137 | 6 | 0
TGR47175 | 6 | 0
TGR48261 | 4 | 0
TGR48301 | 6 | 0
TGR51021 | 6 | 0
TGR51029 | 6 | 0
TGR51091 | 6 | 0
TGR51115 | 6 | 0
TGR51540 | 8 | 0
TGR51685 | 8 | 0
(25 rows)