|  Both sides previous revision Previous revision |  | 
| users:indexing_a_corpus [2011/03/06 14:03]  – notes on data compression and using cwb-make from CWB/Perl stefan | users:indexing_a_corpus [2014/09/02 11:23] (current)  – [Indexing your first corpus]  eros | 
|---|
| Now run the encode tool: | Now run the encode tool: | 
 |  | 
| <code bash>cwb-encode -d ~/mycorpus -f filename.xml -R /corpora/c1/registry/mycorpus -c latin1 -P pos -P lemma -S text+id -S s -0 corpus</code> | <code bash>cwb-encode -d ~/mycorpus -f filename.xml -R /usr/local/share/cwb/registry/mycorpus -c latin1 -P pos -P lemma -S text+id -S s -0 corpus</code> | 
 |  | 
| In the above example: | In the above example: | 
|   * **-d ~/mycorpus** designates the directory where corpus data will be stored (here relative to your home directory) |   * **-d ~/mycorpus** designates the directory where corpus data will be stored (here relative to your home directory) | 
|   * **-f filename.xml** is the filename of the original text file containing the corpus (can also be compressed with filename ending in ''.gz'') |   * **-f filename.xml** is the filename of the original text file containing the corpus (can also be compressed with filename ending in ''.gz'') | 
|   * **-R /corpora/c1/registry/mycorpus** is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.); this file is automatically generated by //cwb-encode//, so you don't have to worry about the format details |   * **-R /usr/local/share/cwb/registry/mycorpus** is the full path to the registry file containing info about the corpus (i.e. its location, structures, attributes etc.); this file is automatically generated by //cwb-encode//, so you don't have to worry about the format details | 
|   * **-c latin1** indicates that the input file is encoded in the ISO-8859-1 ("Latin-1") character set; other character sets are available in [[http://cwb.sourceforge.net/beta.php|CWB release 3.2]] and newer (currently in beta-testing) |   * **-c latin1** indicates that the input file is encoded in the ISO-8859-1 ("Latin-1") character set; other character sets are available in [[http://cwb.sourceforge.net/beta.php|CWB release 3.2]] and newer (currently in beta-testing) | 
|   * **-P pos** tells cwb-encode that the corpus has the positional attribute ''pos'' (parts of speech) in the second column |   * **-P pos** tells cwb-encode that the corpus has the positional attribute ''pos'' (parts of speech) in the second column | 
 |  | 
| Congratulations, you've just indexed your first corpus! | Congratulations, you've just indexed your first corpus! | 
 |   | 
 |  | 
| ===== Test it ===== | ===== Test it ===== |