Documentation read from 07/15/2019 12:12:44 version of /vol/public-pseed/FIGdisk/dist/releases/cvs.1555556707/common/lib/FigKernelPackages/Sapling.pm.

Sapling Package

Sapling Package

Sapling Database Access Methods

Introduction

The Sapling database is a new Entity-Relationship Database that attempts to encapsulate our data in a portable form for distribution. It is loaded directly from the genomes and subsystems of the SEED. This object has minimal capabilities: most of its power comes the ERDB base class.

The fields in this object are as follows.

loadDirectory

Name of the directory containing the files used by the loaders.

loaderSource

Source object for the loaders (a FIG in our case).

genomeHash

Reference to a hash of the genomes to include when loading.

subHash

Reference to a hash of the subsystems to include when loading.

tuning

Reference to a hash of tuning parameters.

otuHash

Reference to a hash that maps genome IDs to genome set names.

Configuration and Construction

The default loading profile for the Sapling database is to include all complete genomes and all usable subsystems. This can be overridden by specifying a list of genomes and subsystems in an XML configuration file. The file name should be SaplingConfig.xml in the specified data directory. The document element should be Sapling, and it has two sub-elements. The Genomes element should contain as its text a space-delimited list of genome IDs. The <Subsystems> element should contain a list of subsystem names, one per line. If a particular section is missing, the default list will be used.

Example

The following configuration file specifies 10 genomes and 6 subsystems.

    <Sapling>
      <Genomes>
        100226.1 31033.3 31964.1 36873.1 126740.4
        155864.1 349307.7 350058.5 351348.5 412694.5
      </Genomes>
      <Subsystems>
        Sugar_utilization_in_Thermotogales
        Coenzyme_F420_hydrogenase
        Ribosome_activity_modulation
        prophage_tails
        CBSS-393130.3.peg.794
        Apigenin_derivatives
      </Subsystems>
    </Sapling>

The XML file also contains tuning parameters that affect the way the data is loaded. These are specified as attributes in the TuningParameters element, as follows.

maxLocationLength

The maximum number of base pairs allowed in a single location. IsLocatedIn records are split into sections based on this length, so when you are looking for all the features in a particular neighborhood, you can look for locations within the maximum location distance from the neighborhood, and even if you have a huge operon that contains tens of thousands of base pairs, you'll still be able to find it.

maxSequenceLength

The maximum number of base pairs allowed in a single DNA sequence. DNA sequences are broken into segments to prevent excessively large genomes from clogging memory during sequence resolution.

Global Section Constant

Each section of the database used by the loader corresponds to a single genome. The global section is loaded after all the others, and is concerned with data not related to a particular genome.

Tuning Parameter Defaults

Each tuning parameter must have a default value, in case it is not present in the XML configuration file. The defaults are specified in a constant hash reference called TUNING_DEFAULTS.

new

    my $sap = Sapling->new(%options);

Construct a new Sapling object. The following options are supported.

loadDirectory

Data directory to be used by the loaders.

DBD

XML database definition file.

dbName

Name of the database to use.

sock

Socket for accessing the database.

userData

Name and password used to log on to the database, separated by a slash.

dbhost

Database host name.

port

MYSQL port number to use (MySQL only).

dbms

Database management system to use (e.g. SQLite or postgres, default mysql).

Public Methods

OTU

    my $otu = $sap->OTU($genomeID);

Return the name of the Organism Taxonomic Unit (GenomeSet) for the specified genome ID. OTU information is cached in memory, so that once it is known, it does not need to be re-fetched from the database.

genomeID

ID of a genome or feature. If a feature ID is specified, the genome ID will be extracted from it.

RETURN

Returns the name of the genome set for the specified genome, or undef if the genome is not in the

ProteinID

    my $key = $sap->ProteinID($sequence);

Return the protein sequence ID that would be associated with a specific protein sequence.

sequence

String containing the protein sequence in question.

RETURN

Returns the ID value for the specified protein sequence. If the sequence exists in the database, it will have this ID in the ProteinSequence table.

IsProteinID

    my $md5 = $sap->IsProteinID($identifier);

Check for a protein identifier. If a protein identifier is found, the corresponding protein sequence ID will be returned; otherwise, an undefined value will be returned. A protein identifier is either a raw protein sequence ID, an ID preceded by md5|, or an ID preceded by gnl|md5|

identifier

Identifier to test.

RETURN

Returns the MD5 code from the protein identifier, or undef if the incoming string is not a protein identifier.

Assignment

    my $assignment = $sapling->Assignment($fid);

Return the functional assignment for the specified feature.

fid

FIG ID of the desired feature.

RETURN

Returns the functional assignment of the specified feature, or undef if the feature does not exist.

IdsForProtein

    my @ids = $sap->IdsForProtein($protID);

Return a list of all the identifiers associated with the specified protein.

protID

ID of the protein of interest.

RETURN

Returns a list of the Identifiers for the specific protein or for genes that produce the specific protein.

ComputeDNA

    my $dna = $sap->ComputeDNA($location);

Return the DNA sequence for the specified location.

location

A BasicLocation object indicating the contig, start location, direction, and length of the desired DNA segment.

RETURN

Returns a string containing the desired DNA. The DNA comes back in pure lower-case.

FilterByGenome

    my @filteredFids = $sapling->FilterByGenome(\@fids, $genomeFilter);

Filter the features using the specified genome-based criterion. The criterion can be either a comma-separated list of genome IDs, or a partial organism name.

fids

Reference to a list of feature IDs.

genomeFilter

A string specifying the filtering criterion. If undefined or blank, then no filter is applied. If a name, then only features from genomes with a matching name will be returned. A name is a match if the filter is an exact match for some prefix of the organism name. Thus, Listeria would get all Listerias, while Listeria monocytogenes EGD-e would match only the specific EGD-e strain. For a more precise match, you can specify instead a comma-delimited list of genome IDs. In this latter case, only features for the listed genomes will be included in the results.

RETURN

Returns the features from the incoming list that match the filter condition.

GetLocations

    my @locs = $sapling->GetLocations($fid);

Return the locations of the DNA for the specified feature.

fid

ID of the feature whose location is desired.

RETURN

Returns a list of BasicLocation objects for the locations containing the feature's DNA.

IdentifiedProtein

    my $proteinID = $sap->IdentifiedProtein($id);

Compute the protein for a specified identifier. If the identifier does not exist or does not identify a protein, this method will return undef.

id

Identifier whose protein is desired.

RETURN

Returns the protein ID corresponding to the incoming identifier, or undef if the identifier does not exist or is not for a protein.

FeaturesByID

    my @fids = $sapling->FeaturesByID($id);

Return all the features corresponding to the specified identifier. Only features that represent the same locus will be returned.

id

Identifier of interest.

RETURN

Returns a list of all the features in the database that match the given identifier.

ProteinsByID

    my @fids = $sapling->ProteinsByID($id);

Return all the features that have the same protein sequence as the identified feature. The returned features mar or may not have the same locus. If the identifier is not for a protein encoding gene, no result will be returned.

id

Identifier of interest. This can be any alias identifier from the Identifier table (which includes the FIG ID).

RETURN

Returns a list of FIG IDs for features having the same protein sequence. If the identifier does not specify a protein-encoding gene, the list will be empty.

GetSubsystem

    my $ssData = $sapling->GetSubsystem($ssName);

Return a SaplingSubsys object for the named subsystem.

ssName

Name of the desired subsystem.

RETURN

Returns an object that defines multiple useful methods for manipulating the named subsystem.

GenesInRegion

    my @pegs = $sap->GenesInRegion($location);

Return a list of the IDs for the features that overlap the specified region on a contig.

location

Location of interest, either in the form of a location string (e.g. 360108.3:NZ_AANK01000002_264528_264007) or a BasicLocation object.

RETURN

Returns a list of feature IDs. The features in the list will be all those that overlap or occur inside the location of interest.

GetFasta

    my $fasta = $sapling->GetFasta($proteinID, $id, $comment);

Return a FASTA sequence for the specified protein. An optional identifier can be provided to be used as the identification string.

proteinID

Protein sequence identifier.

id (optional)

The identifier to be used in the FASTA output. If omitted, the protein ID is used.

comment (optional)

The comment string to be used in the identification line of the FASTA output. If omitted, no comment will be present.

RETURN

Returns a FASTA string for the protein. This includes the identification line and the protein letters themselves.

Taxonomy

    my @taxonomy = $sap->Taxonomy($genomeID, $format);

Return the full taxonomy of the specified genome, starting from the domain downward.

genomeID

ID of the genome whose taxonomy is desired. The genome does not need to exist in the database: the version number will be lopped off and the result used as an entry point into the taxonomy tree.

format (optional)

Format of the taxonomy. names will return primary names, numbers will return taxonomy numbers, and both will return taxonomy number followed by primary name. The default is names.

RETURN

Returns a list of taxonomy names, starting from the domain and moving down to the node where the genome is attached.

IsDeletedFid

    my $flag = $sapling->IsDeletedFid($fid);

Return TRUE if the specified feature is not in the database, else FALSE.

fid

FIG ID of the relevant feature.

RETURN

Returns TRUE if the specified feature is in the database, else FALSE.

GenomeHash

    my $genomeHash = $sap->GenomeHash();

Return a hash of the genomes configured to be in this database. The list is either taken from the active SEED database or from a configuration file in the data directory. The hash maps genome IDs to TRUE.

SubsystemID

    my $subID = $sap->SubsystemID($subName);

Return the ID of the subsystem with the specified name.

subName

Name of the relevant subsystem. A subsystem name with underscores for spaces will return the same ID as a subsystem name with the spaces still in it.

RETURN

Returns a normalized subsystem name.

Alias

    my $translatedID = $sap->Alias($fid, $source);

Return an alternate ID of the specified type for the specified feature. If no alternate ID of that type exists, the incoming value will be returned unchanged.

fid

FIG ID of the feature whose alias identifier is desired.

source

Database type for the alternate ID (e.g. LocusTag, NCBI, RefSeq). If SEED is specified, the ID will be returned unchanged and no database lookup will occur.

RETURN

Returns an equivalent ID for the specified feature that belongs to the specified database (that is, has the specified source). If no such ID exists, returns the incoming ID.

ContigLength

    my $contigLen = $sap->ContigLength($contigID);

Return the number of base pairs in the specified contig.

contigID

ID of the contig of interest.

RETURN

Returns the number of base pairs in the specified contig, or 0 if the contig does not exist.

ReactionRoles

    my @roles = $sap->ReactionRoles($rxnID);

Return a list of all the roles for a single reaction. The reactions are connected to roles through the complexes, so an extra step is required to sort out duplicates from the results.

rxnID

ID of the reaction whose roles are desired.

RETURN

Returns a list of the roles associated with the reaction.

RoleReactions

    my @rxns = $sap->RoleReactions($roleID);

Return a list of all the reactions for a single role. The reactions are connected to roles through the complexes, so an extra step is required to sort out duplicates from the results.

roleID

ID of the role whose reactions are desired.

RETURN

Returns a list of the IDs for the reactions associated with the role.

Configuration-Related Methods

SubsystemHash

    my $subHash = $sap->SubsystemHash();

Return a hash of the subsystems configured to be in this database. The list is either taken from the active SEED database or from a configuration file in the data directory. The hash maps subsystem names to TRUE.

TuningParameter

    my $parm = $erdb->TuningParameter($parmName);

Return the value of the specified tuning parameter. Tuning parameters are read from the XML configuration file.

parmName

Name of the parameter whose value is desired.

RETURN

Returns the paramter value.

ReadConfigFile

    my $xmlObject = $sap->ReadConfigFile();

Return the hash structure created from reading the configuration file, or an undefined value if the file is not found.

GlobalSection

    my $flag = $sap->GlobalSection($name);

Return TRUE if the specified section name is the global section, FALSE otherwise.

name

Section name to test.

RETURN

Returns TRUE if the parameter matches the GLOBAL constant, else FALSE.

LoadGenome

    my $stats = $sap->LoadGenome($genome, $directory);

Load the specified genome directory into the database. The genome's DNA, features, protein sequences, and other supporting information will be inserted. If the genome already exists, numerous errors will occur; therefore, it is recommended that the genome be deleted first using the "Delete" in ERDB method.

genom

The ID of the genome being loaded.

directory

Name of the genome directory.

RETURN

Returns a statistics object describing the load activity.

Special-Purpose Methods

ComputeFeatureFilter

    my ($objects, $filter, @parms) = $sap->ComputeFeatureFilter($source, $genome);

Compute the initial object name list, filter string, and parameter list for a query by feature ID. The object name list will always end with the Feature entity, and the combination of the filter string and parameter list will translate the incoming ID from the specified format to a real FIG feature ID. If the specified format is FIG feature IDs, then the query will start on the Feature entity; otherwise, it will start with the Identifier entity. This is a special-purpose method that performs the task of intelligently modifying queries to allow for external ID types.

source (optional)

Database source of the IDs specified-- SEED for FIG IDs, GENE for standard gene identifiers, or LocusTag for locus tags. In addition, you may specify RefSeq, CMR, NCBI, Trembl, or UniProt for IDs from those databases. Use mixed to allow mixed ID types (though this may cause problems when the same ID has different meanings in different databases). Use prefixed to allow IDs with prefixing indicating the ID type (e.g. uni|P00934 for a UniProt ID, gi|135813 for an NCBI identifier, and so forth). The default is SEED.

genome (optional)

ID of a genome. If specified, only features from the specified genome will be accepted by the filter. This is important for IDs that are ambiguous between genomes (like Locus Tags). If omitted, no genome filtering will take place.

RETURN

Returns a list containing parameters to the desired query call. The first element is the prefix for the object name list, the second is the prefix for the filter string, and the subsequent elements form the prefix for the parameter value list.

FindGapLeft

    my @operonData = $sap->FindGapLeft($loc, $maxGap, $interval, \%redundancyHash, \$redundancyFlag);

This method performs a rather arcane task: searching for a gap to the left of a location in the contig. The search will proceed from the starting point to the left, and will stop when a gap between occupied locations is found that is larger than the specified maximum. The caller has the option of specifying a hash of feature IDs that are redundant. If any feature in the hash is found, the search will stop early and the provided redundancy flag will be set. In addition, an interval size can be specified to tune the process of retrieving data from the database.

loc

BasicLocation object for the location from which the search is to start. This gives us the contig ID, the strand of interest (forward or backward), and the starting point of the search.

maxGap

The maximum allowable gap. The search will stop at the left end of the contig or the first gap larger than this amount.

interval (optional)

Interval to use for retrieving data from the database. This is the size of the contig segments being retrieved. The default is 10000

redundancyHash (optional)

A hash of feature IDs. If any feature present in this hash is found during the search, the search will stop and no data will be returned. The default is an empty hash (no check).

redundancyFlag (optional)

A reference to a scalar flag. If present, the entire method will be bypassed if the flag is TRUE. If a redundancy hash is specified and a redundant feature is found, this flag will be set to TRUE by the method.

RETURN

Returns a list of 4-tuples. Each tuple will contain a feature ID, a begin offset, a direction (+ or -), and a length, representing an occupied location on the contig and the feature to which it belongs. The complete list of locations will be to the left of the starting location and relatively close together, with no gap larger than the caller-specified maximum.

FindGapRight

    my @operonData = $sap->FindGapRight($loc, $maxGap, $interval, \%redundancyHash, \$redundancyFlag);

This method is the dual of "FindGapLeft": it searches for a gap to the right of a location in the contig. The search will proceed from the starting point to the right, and will stop when a gap between occupied locations is found that is larger than the specified maximum. The caller has the option of specifying a hash of feature IDs that are redundant. If any feature in the hash is found, the search will stop early and the provided redundancy flag will be set. In addition, an interval size can be specified to tune the process of retrieving data from the database.

loc

BasicLocation object for the location from which the search is to start. This gives us the contig ID, the strand of interest (forward or backward), and the starting point of the search.

maxGap

The maximum allowable gap. The search will stop at the right end of the contig or the first gap larger than this amount.

interval (optional)

Interval to use for retrieving data from the database. This is the size of the contig segments being retrieved. The default is 10000

redundancyHash (optional)

A hash of feature IDs. If any feature present in this hash is found during the search, the search will stop and no data will be returned. The default is an empty hash (no check).

redundancyFlag (optional)

A reference to a scalar flag. If present, the entire method will be bypassed if the flag is TRUE. If a redundancy hash is specified and a redundant feature is found, this flag will be set to TRUE by the method.

RETURN

Returns a list of 4-tuples. Each tuple will contain a feature ID, a begin offset, a direction (+ or -), and a length, representing an occupied location on the contig and the feature to which it belongs. The complete list of locations will be to the right of the starting location and relatively close together, with no gap larger than the caller-specified maximum.

GenomesInPairSet

    my @genomes = $sap->GenomesInPairSet($pairSetID);

Return a list of the IDs for all of the genomes represented in the specified pair set. This is useful when analyzing what data is missing from the coupling tables.

pairSetID

ID of the pair set to examine.

RETURN

Returns a list of the IDs for the genomes represented in the specified pair set.

Virtual Methods

PreferredName

    my $name = $erdb->PreferredName();

Return the variable name to use for this database when generating code.

GetSourceObject

    my $source = $erdb->GetSourceObject();

Return the object to be used in creating load files for this database. This is only the default source object. Loaders have the option of overriding the chosen source object when constructing the ERDBLoadGroup objects.

SectionList

    my @sections = $erdb->SectionList();

Return a list of the names for the different data sections used when loading this database. The default is a single string, in which case there is only one section representing the entire database.

Loader

    my $groupLoader = $erdb->Loader($groupName, $source, $options);

Return an ERDBLoadGroup object for the specified load group. This method is used by ERDBGenerator.pl to create the load group objects. If you are not using ERDBGenerator.pl, you don't need to override this method.

groupName

Name of the load group whose object is to be returned. The group name is guaranteed to be a single word with only the first letter capitalized.

source

The source object used to access the data from which the load file is derived. This is the same object returned by "GetSourceObject"; however, we allow the caller to pass it in as a parameter so that we don't end up creating multiple copies of a potentially expensive data structure. It is permissible for this value to be undefined, in which case the source will be retrieved the first time the client asks for it.

options

Reference to a hash of command-line options.

RETURN

Returns an ERDBLoadGroup object that can be used to process the specified load group for this database.

LoadGroupList

    my @groups = $erdb->LoadGroupList();

Returns a list of the names for this database's load groups. This method is used by ERDBGenerator.pl when the user wishes to load all table groups. The default is a single group called 'All' that loads everything.

LoadDirectory

    my $dirName = $erdb->LoadDirectory();

Return the name of the directory in which load files are kept. The default is the FIG temporary directory, which is a really bad choice, but it's always there.

UseInternalDBD

    my $flag = $erdb->UseInternalDBD();

Return TRUE if this database should be allowed to use an internal DBD. The internal DBD is stored in the _metadata table, which is created when the database is loaded. The Sapling uses an internal DBD.