<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE Annotation PUBLIC "-//BIOSCOPE//DTD  SCOPE ANNOTATION 1.0//EN" "BioScope.dtd">
<Annotation created="23/6/2008"  creator="XMLconverter">
	<DocumentSet>
		<Document type="Biological_full_article">
			<DocID type="BMC_ID">1471-2105-8-225</DocID>
				<DocumentPart type="Title">
					<sentence id="S1.1">Mining prokaryotic genomes for unknown amino acids: a stop-codon-based approach</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S1.2">Abstract</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S1.3">Background</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.4">Selenocysteine and pyrrolysine are the 21st and 22nd amino acids, which are genetically encoded by stop codons.</sentence>
					<sentence id="S1.5">Since a number of microbial genomes have been completely sequenced to date, it is tempting to ask <xcope id="X1.5.1"><cue type="speculation" ref="X1.5.1">whether</cue> the 23rd amino acid is left undiscovered in these genomes</xcope>.</sentence>
					<sentence id="S1.6">Recently, a computational study addressed this question and reported that <xcope id="X1.6.1"><cue type="negation" ref="X1.6.1">no</cue> tRNA gene for unknown amino acid was found in genome sequences available</xcope>.</sentence>
					<sentence id="S1.7">However, <xcope id="X1.7.2">performance of the tRNA prediction program on an unknown tRNA family, which <xcope id="X1.7.1"><cue type="speculation" ref="X1.7.1">may</cue> have atypical sequence and structure</xcope>, is <cue type="speculation" ref="X1.7.2">unclear</cue></xcope>, thereby rendering their result inconclusive.</sentence>
					<sentence id="S1.8">A protein-level study will provide independent insight into the novel amino acid.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S1.9">Results</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.10"><xcope id="X1.10.1"><cue type="speculation" ref="X1.10.1">Assuming</cue> that the 23rd amino acid is also encoded by a stop codon</xcope>, we systematically predicted proteins that contain stop-codon-encoded amino acids from 191 prokaryotic genomes.</sentence>
					<sentence id="S1.11">Since our prediction method relies only on the conservation patterns of primary sequences, it also provides an opportunity to search novel selenoproteins and other readthrough proteins.</sentence>
					<sentence id="S1.12">It successfully recovered many of currently known selenoproteins and pyrrolysine proteins.</sentence>
					<sentence id="S1.13">However, <xcope id="X1.13.1"><cue type="negation" ref="X1.13.1">no</cue> promising candidate for the 23rd amino acid was detected</xcope>, and only one novel selenoprotein was predicted.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S1.14">Conclusion</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.15">Our result <xcope id="X1.15.3"><cue type="speculation" ref="X1.15.3">suggests</cue> that <xcope id="X1.15.2">the unknown amino acid encoded by stop codons does <xcope id="X1.15.1"><cue type="negation" ref="X1.15.1">not</cue> exist</xcope>, <cue type="speculation" ref="X1.15.2">or</cue> its phylogenetic distribution is rather limited</xcope></xcope>, which is in agreement with the previous study on tRNA.</sentence>
					<sentence id="S1.16">The method described here can be used in future studies to explore novel readthrough events from complete genomes, which are rapidly growing.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S1.17">Background</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.18">Stop codon readthrough is a phenomenon in which the translation process does <xcope id="X1.18.1"><cue type="negation" ref="X1.18.1">not</cue> terminate at a stop codon</xcope>, and an amino acid is inserted there instead 12.</sentence>
					<sentence id="S1.19">In some cases, the inserted amino acid is <xcope id="X1.19.1"><cue type="negation" ref="X1.19.1">not</cue> one of the 20 amino acids</xcope> but a noncanonical one.</sentence>
					<sentence id="S1.20">Two such amino acids have been discovered to date: selenocysteine 34 and pyrrolysine 56.</sentence>
					<sentence id="S1.21">Because each of them have specialized tRNA genes for decoding and can be considered extensions of the standard genetic code, they are called the 21st and 22nd amino acids, respectively.</sentence>
					<sentence id="S1.22">Selenocysteine, the 21st amino acid, is encoded by stop codon UGA, and organisms that use selenocysteine have been found from all three domains of life.</sentence>
					<sentence id="S1.23">Its insertion into UGA is directed by SECIS (selenocysteine insertion sequence) elements, a stem-loop structure on the selenoprotein mRNA.</sentence>
					<sentence id="S1.24">Along with the progress of genome sequencing projects, computational prediction methods of selenocysteine-containing proteins (selenoproteins) have been developed by several research groups 78910, and the repertoire of selenoproteins has been greatly expanded 1112.</sentence>
					<sentence id="S1.25">Pyrrolysine, the 22nd amino acid encoded by stop codon UAG, was recently discovered from a methanogenic archaea 56.</sentence>
					<sentence id="S1.26">Currently, only methanogenic archaea of the order Methanosarcinales and one bacterium are considered to utilize pyrrolysine 13.</sentence>
					<sentence id="S1.27">The limited phylogenetic distribution of pyrrolysine <xcope id="X1.27.2"><cue type="speculation" ref="X1.27.2">suggests</cue> that its incorporation into the genetic code of methanogen is relatively recent, and the insertion mechanism of a novel amino acid <xcope id="X1.27.1"><cue type="speculation" ref="X1.27.1">can</cue> evolve in a shorter period of time than anticipated</xcope></xcope>.</sentence>
					<sentence id="S1.28">This <xcope id="X1.28.1"><cue type="speculation" ref="X1.28.1">raises an interesting question</cue>:: ""Is there a 23rd amino acid"</xcope>?".</sentence>
					<sentence id="S1.29">If such an amino acid is discovered, it will deepen our understanding of the evolution and diversity of the genetic code.</sentence>
					<sentence id="S1.30">Because genome sequences of various prokaryotes are available today, there will be a chance to discover the novel amino acid via analysis of these genomes.</sentence>
					<sentence id="S1.31">Since both the 21st and 22nd amino acids are encoded by stop codons, the prime suspect is other stop codons (e.g. stop codon UAA), although the possibility of sense codons certainly remains.</sentence>
					<sentence id="S1.32">Using this clue, computational screening methods of the 23rd amino acid can be designed.</sentence>
					<sentence id="S1.33">Recently, Lobanov et al. addressed this problem by searching tRNAs with anticodons corresponding to stop codons 14.</sentence>
					<sentence id="S1.34">They analyzed 146 prokaryotic genomes, but <xcope id="X1.34.1"><cue type="negation" ref="X1.34.1">no</cue> likely tRNA of the novel amino acid was detected</xcope>.</sentence>
					<sentence id="S1.35">They concluded that the 23rd amino acid <xcope id="X1.35.1"><cue type="speculation" ref="X1.35.1">would</cue> have a limited phylogenetic distribution, if it exists</xcope>.</sentence>
					<sentence id="S1.36">However, programs for tRNA identification are based on the features of known tRNAs and do not necessarily perform well on unknown ones.</sentence>
					<sentence id="S1.37">Actually, tRNASec and tRNAPyl have unusual secondary structures 515 and often escape detection by programs <xcope id="X1.37.1"><cue type="negation" ref="X1.37.1">without</cue> special consideration</xcope>.</sentence>
					<sentence id="S1.38">Lobanov et al. thus developed a sensitive search method to deal with this problem, but they also admitted that it <xcope id="X1.38.2"><cue type="speculation" ref="X1.38.2">would</cue> <xcope id="X1.38.1"><cue type="negation" ref="X1.38.1">fail</cue> to identify highly unusual tRNAs</xcope></xcope>.</sentence>
					<sentence id="S1.39">There is another approach to searching for the 23rd amino acid.</sentence>
					<sentence id="S1.40">By enumerating ORFs that have an inframe stop codon from genomes and examining their evolutionary conservation, candidate proteins can be predicted.</sentence>
					<sentence id="S1.41">Because such an ORF-based study is independent from the tRNA analysis, it can <xcope id="X1.41.1"><cue type="speculation" ref="X1.41.1">either</cue> identify candidate organisms missed by the previous study <cue type="speculation" ref="X1.41.1">or</cue> strengthen its negative conclusion</xcope>.</sentence>
					<sentence id="S1.42">Here we report a comprehensive analysis of prokaryotic ORFs that contain an inframe stop codon.</sentence>
					<sentence id="S1.43">Through enumeration of theoretical ORFs and inspection of their evolutionary conservation, candidates of readthrough proteins were predicted.</sentence>
					<sentence id="S1.44">They contained many of the known proteins with stop-codon-encoded amino acids, but almost no novel candidates were identified.</sentence>
					<sentence id="S1.45">Therefore, <xcope id="X1.45.1">the unknown amino acid, if it is encoded by a stop codon, is <cue type="speculation" ref="X1.45.1">unlikely</cue> to exist in the current databases of microbial genomes</xcope>.</sentence>
					<sentence id="S1.46">The consequences for selenoproteins and other readthrough genes are also discussed.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S1.47">Results</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S1.48">Basic ideas</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.49">In this study, we focus on theoretical ORFs with one inframe stop codon, termed "interrupted ORFs" (iORFs) (Figure 1a).</sentence>
					<sentence id="S1.50">If we enumerate all iORFs from microbial genomes, most of the readthrough genes will be included in them.</sentence>
					<sentence id="S1.51">However, the vast majority of the enumerated iORFs will be biologically meaningless.</sentence>
					<sentence id="S1.52">To filter out such meaningless iORFs, we required the iORFs to have at least one homolog in other genomes, because evolutionary conservation of primary sequence is a strong indicator of functional importance.</sentence>
					<sentence id="S1.53">However, this condition is <xcope id="X1.53.1"><cue type="negation" ref="X1.53.1">not</cue> sufficient</xcope>, since two major problems remain:: pseudogenes and two adjacent genes.</sentence>
					<sentence id="S1.54">The first problem is that even if an iORF has homologs in other species, it <xcope id="X1.54.2"><cue type="speculation" ref="X1.54.2">could</cue> be <xcope id="X1.54.1">a pseudogene <cue type="speculation" ref="X1.54.1">or</cue> a product of sequencing error</xcope></xcope>.</sentence>
					<sentence id="S1.55">The second problem is that adjacent genes on the same reading frame <xcope id="X1.55.1"><cue type="speculation" ref="X1.55.1">may</cue> satisfy the condition of conserved iORFs</xcope>.</sentence>
					<sentence id="S1.56">In particular, gene pairs within an operon are problematic because their gene arrangement is often conserved.</sentence>
					<sentence id="S1.57">If <xcope id="X1.57.2">the intergenic distance between two genes in an operon <cue type="speculation" ref="X1.57.2">happens</cue> to be a multiple of three</xcope>, they <xcope id="X1.57.1"><cue type="speculation" ref="X1.57.1">look like</cue> a conserved readthrough gene</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S1.58">Basic ideas of the prediction method</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.59">Basic ideas of the prediction method.</sentence>
					<sentence id="S1.60">(a) Schematic illustration of an interrupted ORF (iORF).</sentence>
					<sentence id="S1.61">(b) Readthrough genes can be distinguished from two adjacent genes based on the results of BLAST searches.</sentence>
					<sentence id="S1.62">Boxes denote iORFs, and &#215; indicates the inframe stop codon.</sentence>
					<sentence id="S1.63">Shaded regions represent actual protein-coding regions.</sentence>
					<sentence id="S1.64">If an iORF codes a readthrough protein, BLAST hits from other organisms will cover the inframe stop codon.</sentence>
					<sentence id="S1.65">In contrast, if the iORF consists of two adjacent genes, many hits that do <xcope id="X1.65.1"><cue type="negation" ref="X1.65.1">not</cue> cover the inframe stop codon</xcope> will be found.</sentence>
					<sentence id="S1.66">To discriminate them from true readthrough genes, evolutionary information was exploited.</sentence>
					<sentence id="S1.67">In order to eliminate pseudogenes and sequencing errors, conservation of iORFs and their inframe stop codons was examined.</sentence>
					<sentence id="S1.68">Since pseudogenes are less conserved, and sequencing errors are relatively rare events, they will <xcope id="X1.68.1"><cue type="negation" ref="X1.68.1">not</cue> have homologous iORFs in other species</xcope>.</sentence>
					<sentence id="S1.69">Even if they do, the position or type (UAA, UAG or UGA) of their inframe stop codons will <xcope id="X1.69.1"><cue type="negation" ref="X1.69.1">not</cue> coincide</xcope>.</sentence>
					<sentence id="S1.70">In this way, they can be eliminated as candidates.</sentence>
					<sentence id="S1.71">A drawback of this criterion is that it limits the target of our study to readthrough genes conserved across two or more species.</sentence>
					<sentence id="S1.72">In other words, species-specific readthrough genes are <xcope id="X1.72.1"><cue type="negation" ref="X1.72.1">not</cue> in the scope of this study</xcope>.</sentence>
					<sentence id="S1.73">To address the second problem, adjacent gene pairs were filtered out by examining boundaries of sequence alignments between iORFs and its homologs (Figure 1b).</sentence>
					<sentence id="S1.74">The stop-codon-encoded amino acids of prokaryotes are usually located inside domains, the units of evolutionary sequence conservation.</sentence>
					<sentence id="S1.75">Therefore, the aligned regions of readthrough proteins contain their inframe stop codon.</sentence>
					<sentence id="S1.76">Based on this observation, each iORF was required to have:: (i) at least one homolog from other organisms that covers the inframe stop codon and (ii) <xcope id="X1.76.2"><cue type="negation" ref="X1.76.2">no</cue> homolog that does <xcope id="X1.76.1"><cue type="negation" ref="X1.76.1">not</cue> cover the stop codon</xcope></xcope>.</sentence>
					<sentence id="S1.77">Note that, however, if the whole length of an iORF was used as a query sequence, this procedure will erroneously discard multidomain readthrough proteins.</sentence>
					<sentence id="S1.78">To avoid this problem, a partial sequence around the inframe stop codon was used as a query.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S1.79">Prediction procedure</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.80">The prediction schema is shown in Figure 2.</sentence>
					<sentence id="S1.81">A total of 191 prokaryotes were analyzed in this study, of which 166 are bacteria and 25 are archaea.</sentence>
					<sentence id="S1.82">They were selected from 328 prokaryotes with completely sequenced genomes by excluding closely related species.</sentence>
					<sentence id="S1.83">From the genome sequences of the 191 organisms, all possible iORFs were enumerated.</sentence>
					<sentence id="S1.84">Two conditions were imposed on the geometry of the iORFs (Figure 1a).</sentence>
					<sentence id="S1.85">First, only iORFs longer than 80 codons were extracted.</sentence>
					<sentence id="S1.86">Secondly, margins between the inframe stop codon and both termini of the iORF must be longer than 10 codons.</sentence>
					<sentence id="S1.87">The total number of iORFs extracted under these conditions was 2,969,958.</sentence>
					<sentence id="S1.88">Next, iORFs that overlap RNA genes or protein-coding genes in different reading frames were discarded.</sentence>
					<sentence id="S1.89">This test significantly reduced the number of iORFs to 390,926.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S1.90">A flowchart of the prediction procedure</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.91">A flowchart of the prediction procedure.</sentence>
					<sentence id="S1.92">Several steps are omitted for simplicity.</sentence>
					<sentence id="S1.93">Detailed explanation is given in the text.</sentence>
					<sentence id="S1.94">As noted above, the target of this study is evolutionarily conserved iORFs.</sentence>
					<sentence id="S1.95">Thus, it was examined <xcope id="X1.95.1"><cue type="speculation" ref="X1.95.1">whether</cue> the iORFs have homologous regions in other genomes</xcope>.</sentence>
					<sentence id="S1.96">The 390,926 iORFs were translated into amino acid sequences and subjected to TBLASTN 16 against the 191 genome sequences.</sentence>
					<sentence id="S1.97"><xcope id="X1.97.1"><cue type="negation" ref="X1.97.1">Instead of</cue> the whole length of the amino acid sequence</xcope>, a window of 101 residues centered at the inframe stop codon was used as a BLAST query.</sentence>
					<sentence id="S1.98">After the BLAST searches, iORFs that have at least one interspecific hit that contains the inframe stop codon were collected.</sentence>
					<sentence id="S1.99"><xcope id="X1.99.3"><cue type="speculation" ref="X1.99.3">Whether</cue> the codon aligned to the inframe stop codon is <xcope id="X1.99.2">a nonsense codon <cue type="speculation" ref="X1.99.2">or</cue> <xcope id="X1.99.1"><cue type="negation" ref="X1.99.1">not</cue></xcope></xcope></xcope> was neglected at this stage.</sentence>
					<sentence id="S1.100">There were 94,690 iORFs that have interspecific hits.</sentence>
					<sentence id="S1.101">The result of the above homology searches was also used for the boundary analysis (Figure 1b).</sentence>
					<sentence id="S1.102">An iORF was discarded if there were any BLAST hits that do <xcope id="X1.102.1"><cue type="negation" ref="X1.102.1">not</cue> cover the inframe stop codon</xcope>.</sentence>
					<sentence id="S1.103">A total of 26,003 iORF satisfied the above criteria.</sentence>
					<sentence id="S1.104">To examine intrafamily conservation of the inframe stop codons, these iORFs were clustered into protein families based on sequence similarity.</sentence>
					<sentence id="S1.105">After removal of singletons, 679 clusters with two or more members were obtained.</sentence>
					<sentence id="S1.106">A cluster was discarded unless all members of the cluster had the same type of inframe stop codons (UAA, UAG or UGA).</sentence>
					<sentence id="S1.107">The locations of the inframe stop codons were also required to be identical in the multiple sequence alignment of the cluster members.</sentence>
					<sentence id="S1.108">These conditions reduced the number of clusters to 273.</sentence>
					<sentence id="S1.109">Manual inspection of these 273 clusters revealed that they still contain many false positives that are unrelated to stop-codon-encoded amino acids.</sentence>
					<sentence id="S1.110">Hence, three-step filtering procedures were applied to remove the false positives.</sentence>
					<sentence id="S1.111">Briefly, the first filter assesses protein-likeliness based on the signal of purifying selection, while the second and third filters try to remove adjacent gene pairs using the pattern of BLAST alignments (for details, see Materials and Methods).</sentence>
					<sentence id="S1.112">As a result of the filtering, the number of candidate clusters was reduced to 32.</sentence>
					<sentence id="S1.113">Through manual inspection of the BLAST alignments, 11 clusters were discarded because <xcope id="X1.113.1">they are highly <cue type="speculation" ref="X1.113.1">unlikely</cue> to code readthrough proteins</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S1.114">Known proteins in the predicted clusters</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.115">The clusters predicted by our method are summarized in Table 1.</sentence>
					<sentence id="S1.116">Of the 21 clusters, 15 were known selenoproteins, and four were known pyrrolysine proteins.</sentence>
					<sentence id="S1.117">To assess the sensitivity of our method, the result was compared with a list of prokaryotic selenoproteins reported by Kryukov and Gladyshev 12.</sentence>
					<sentence id="S1.118">Since our target is readthrough genes conserved across two or more species, such selenoprotein families were selected from their list.</sentence>
					<sentence id="S1.119">There were 15 families satisfying this criterion, but one family, proline reductase, was excluded because it was found in only one organism in our dataset.</sentence>
					<sentence id="S1.120">Of the 14 families, 11 were found in our prediction result.</sentence>
					<sentence id="S1.121">The three families we <xcope id="X1.121.1"><cue type="negation" ref="X1.121.1">failed</cue> to find</xcope> were SelW-like protein, peroxiredoxin and thiol:protein disulphide oxidoreductase.</sentence>
					<sentence id="S1.122">SelW-like protein was below the threshold of detection, because its stop codon is near the N-terminus and the amino acid sequences of its members are too divergent.</sentence>
					<sentence id="S1.123">The reason why <xcope id="X1.123.1">the two other families were <cue type="negation" ref="X1.123.1">not</cue> detected</xcope> is more complex.</sentence>
					<sentence id="S1.124">Since these two families are homologous, they were grouped into an identical cluster at the clustering stage of our method.</sentence>
					<sentence id="S1.125">However, the positions of selenocysteine were different between the two families (Figure 3).</sentence>
					<sentence id="S1.126">The cluster was thus discarded because of an <xcope id="X1.126.2"><cue type="speculation" ref="X1.126.2">apparent</cue> <xcope id="X1.126.1"><cue type="negation" ref="X1.126.1">lack</cue> of stop codon conservation</xcope></xcope>.</sentence>
					<sentence id="S1.127">To deal with a situation like this, a reexamination of the clustering threshold and subdivision of clusters will be required.</sentence>
				</DocumentPart>
				<DocumentPart type="TableLegend">
					<sentence id="S1.128">Predicted clusters of readthrough proteins</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.129">A plus sign in a locus <xcope id="X1.129.2"><cue type="speculation" ref="X1.129.2">indicates that</cue> the genomic coordinates of the iORF can be described by a concatenation of two <xcope id="X1.129.1">genes <cue type="speculation" ref="X1.129.1">or</cue> regions</xcope></xcope>.</sentence>
					<sentence id="S1.130">For example, "GSU2293 + downstream" means that the iORF consists of the gene GSU2293 and its downstream sequence.</sentence>
					<sentence id="S1.131"><xcope id="X1.131.1">HesB family was <cue type="negation" ref="X1.131.1">not</cue> clustered into one family</xcope>, because their sequences were too short and diverged.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S1.132"><xcope id="X1.132.1">Selenoprotein families we <cue type="negation" ref="X1.132.1">failed</cue> to detect</xcope> because of nonconserved location of stop codons</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.133"><xcope id="X1.133.1">Selenoprotein families we <cue type="negation" ref="X1.133.1">failed</cue> to detect</xcope> because of nonconserved location of stop codons.</sentence>
					<sentence id="S1.134">Selenocysteine residues of Peroxiredoxin-like protein families constitute homologous redox motifs (TXXU and UXXC), but their positions are different between two families.</sentence>
					<sentence id="S1.135">Columns are colored according to sequence conservation.</sentence>
					<sentence id="S1.136">Selenocysteine residues are shown in red, and the other residues in the redox motifs are shown in yellow.</sentence>
					<sentence id="S1.137">Prx; Peroxiredoxin, TPO; thiol:protein disulphide oxidereductase, Adeh; Anaeromyxobacter dehalogenans, Gmet; Geobacter metallireducens, Gsul; G. sulfurreducens, Dpsy; Desulfotalea psychrophila.</sentence>
					<sentence id="S1.138">The alignments were computed using ClustalW, and the figures were generated using Jalview.</sentence>
					<sentence id="S1.139">Of the four pyrrolysine proteins detected, three methylamine methyltransferases have been experimentally confirmed to contain pyrrolysine 617.</sentence>
					<sentence id="S1.140">The rest is a cluster of TetR-like transcriptional regulators from Methanosarcina acetivorans and M. barkeri.</sentence>
					<sentence id="S1.141">Since the genome annotation of M. acetivorans describes this protein as a gene containing an inframe amber codon, we classified it as a ''known'' candidate, although it is still <xcope id="X1.141.2"><cue type="speculation" ref="X1.141.2">unclear</cue> <xcope id="X1.141.1"><cue type="speculation" ref="X1.141.1">whether</cue> it really contains pyrrolysine</xcope></xcope>.</sentence>
					<sentence id="S1.142">The genome annotation of M. acetivorans also includes several amber-containing genes <xcope id="X1.142.1">that were <cue type="negation" ref="X1.142.1">absent</cue> from our prediction result</xcope>.</sentence>
					<sentence id="S1.143">They are a methlycobamide:CoM methylase and four transposases 18.</sentence>
					<sentence id="S1.144">The reason why <xcope id="X1.144.1">they were <cue type="negation" ref="X1.144.1">not</cue> detected</xcope> is that only one species in our dataset had an amber-containing form of these proteins.</sentence>
					<sentence id="S1.145">This is unavoidable because of the inability of our method to detect species-specific readthrough events.</sentence>
					<sentence id="S1.146">It is the price for reliably excluding pseudogenes and sequencing errors.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S1.147">Unknown candidates in the predicted clusters</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.148">The successful detection of many known proteins is encouraging, because our method relies only on general properties of proteins that contain stop-codon-encoded amino acids, but <xcope id="X1.148.1"><cue type="negation" ref="X1.148.1">not</cue> on specific features of selenocysteine or pyrrolysine</xcope>.</sentence>
					<sentence id="S1.149">Therefore, unknown clusters in our candidates have possibilities for the 23rd amino acid or novel readthrough proteins.</sentence>
					<sentence id="S1.150">There were two such clusters (Table 1).</sentence>
					<sentence id="S1.151">The first cluster is comprised of c-type cytochromes from &#948;-proteobacteria Geobacter sulfurreducens and G. metallireducens.</sentence>
					<sentence id="S1.152">The N-terminal part of the sequence contains five CXXCH heme-biding motifs, while the C-terminal part has <xcope id="X1.152.1"><cue type="negation" ref="X1.152.1">no</cue> similarity with any characterized proteins</xcope>.</sentence>
					<sentence id="S1.153">Homology search against unfinished microbial genomes identified seven homologous proteins from four other &#948;-proteobacteria species.</sentence>
					<sentence id="S1.154">Multiple sequence alignment of these sequences is shown in Figure 4a.</sentence>
					<sentence id="S1.155">Multiple sequence alignments of novel candidate proteins</sentence>
					<sentence id="S1.156">Multiple sequence alignments of novel candidate proteins.</sentence>
					<sentence id="S1.157">(a) A selenoprotein candidate from Geobacter sulfurreducens and its homologs.</sentence>
					<sentence id="S1.158">The <xcope id="X1.158.2"><cue type="speculation" ref="X1.158.2">possible</cue> selenocysteine residues</xcope> are shown in red, and <xcope id="X1.158.1"><cue type="speculation" ref="X1.158.1">putative</cue> heme-binding motifs</xcope> are underlined.</sentence>
					<sentence id="S1.159">Note that sequence conservation near the selenocysteine is comparable to that of the N-terminal cytochrome domain.</sentence>
					<sentence id="S1.160">A protein Dpro_2 contains yet another inframe stop codon (TAG) at the column 189.</sentence>
					<sentence id="S1.161">It will be <xcope id="X1.161.1"><cue type="speculation" ref="X1.161.1">either</cue> a sequencing error <cue type="speculation" ref="X1.161.1">or</cue> a pseudogene</xcope>.</sentence>
					<sentence id="S1.162">Gsul; G. sulfurreducens, Gmet; G. metallireducens, Gura; G. uraniumreducens, Gfrc; Geobacter sp. FRC-32, Dace; Desulfuromonas acetoxidans, Dpro; Delta proteobacterium MLMS-1.</sentence>
					<sentence id="S1.163">(b) Hypothetical proteins from Geobacter species.</sentence>
					<sentence id="S1.164">The inframe stop codons (TAG) are shown in red.</sentence>
					<sentence id="S1.165">This cluster is <xcope id="X1.165.1"><cue type="speculation" ref="X1.165.1">probably</cue> an artifact of close phylogenetic relationship</xcope>.</sentence>
					<sentence id="S1.166">We <xcope id="X1.166.2"><cue type="speculation" ref="X1.166.2">expect</cue> that this cluster <xcope id="X1.166.1"><cue type="speculation" ref="X1.166.1">may</cue> represent a novel selenoprotein family</xcope></xcope>.</sentence>
					<sentence id="S1.167">This is because the inframe stop codons of these proteins are exclusively TGA, and all of the above organisms possess selenocysteine insertion machinery <xcope id="X1.167.1">(data <cue type="negation" ref="X1.167.1">not</cue> shown)</xcope>.</sentence>
					<sentence id="S1.168">High conservation of residues near the inframe stop codon also <xcope id="X1.168.1"><cue type="speculation" ref="X1.168.1">suggests</cue> the importance of this region</xcope>.</sentence>
					<sentence id="S1.169">If they are true selenoproteins, this protein family becomes a rare instance of selenoprotein that <xcope id="X1.169.1"><cue type="negation" ref="X1.169.1">lacks</cue> non-selenocysteine homologs</xcope>.</sentence>
					<sentence id="S1.170">However, computational analysis of sequences immediately downstream of the inframe stop codons <xcope id="X1.170.1"><cue type="negation" ref="X1.170.1">failed</cue> to identify SECIS elements</xcope>, which is a hallmark of selenocysteine-containing genes.</sentence>
					<sentence id="S1.171">Therefore, yet another <xcope id="X1.171.1"><cue type="speculation" ref="X1.171.1">possibility</cue> is that they are a highly conserved operon</xcope>.</sentence>
					<sentence id="S1.172">An experimental verification is necessary to distinguish these two possibilities.</sentence>
					<sentence id="S1.173">The second cluster consists of two hypothetical proteins, again from G. sulfurreducens and G. metallireducens (Figure 4b).</sentence>
					<sentence id="S1.174">In contrast to the first cluster, <xcope id="X1.174.1"><cue type="negation" ref="X1.174.1">no</cue> homolog was identified from other species</xcope>.</sentence>
					<sentence id="S1.175">This cluster is <xcope id="X1.175.2"><cue type="speculation" ref="X1.175.2">probably</cue> a false positive and <xcope id="X1.175.1"><cue type="negation" ref="X1.175.1">not</cue> readthrough proteins</xcope></xcope>.</sentence>
					<sentence id="S1.176">This is because the residues near the inframe stop codons are poorly conserved.</sentence>
					<sentence id="S1.177">Moreover, the C-terminal extensions are quite short (about 20 aa).</sentence>
					<sentence id="S1.178">The sequence conservation in this region can be easily explained by the close phylogenetic relationship between the two species.</sentence>
					<sentence id="S1.179">In summary, although a possible selenoprotein was newly identified, there was <xcope id="X1.179.1"><cue type="negation" ref="X1.179.1">no</cue> promising candidate for an unknown amino acid encoded by a stop codon</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S1.180">Stop codon usage in the pre-filtering clusters</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.181"><xcope id="X1.181.1">The above negative result <cue type="speculation" ref="X1.181.1">could</cue> be explained if the filtering process, which is the final step of the prediction method (Figure 2), was too strict</xcope>.</sentence>
					<sentence id="S1.182">Although the raw output of the search for evolutionarily conserved iORFs was 273 clusters, most of them were discarded at the subsequent filtering stage.</sentence>
					<sentence id="S1.183">Because we have <xcope id="X1.183.1"><cue type="negation" ref="X1.183.1">no</cue> a priori knowledge about the 23rd amino acid</xcope>, cutoff thresholds for the filtering procedures were determined based on the known readthrough proteins.</sentence>
					<sentence id="S1.184">This is practically indispensable for objective classification of candidates, but there is <xcope id="X1.184.1"><cue type="speculation" ref="X1.184.1">no guarantee</cue> that unknown proteins with the 23rd amino acid will score higher than the thresholds</xcope>.</sentence>
					<sentence id="S1.185">To explore <xcope id="X1.185.1"><cue type="speculation" ref="X1.185.1">whether</cue> a number of good candidates lie below the thresholds</xcope>, the 273 clusters were analyzed in a way independent from filtering.</sentence>
					<sentence id="S1.186">If an organism has many readthrough proteins, proteins from the organism will frequently appear in the 273 clusters.</sentence>
					<sentence id="S1.187">Moreover, relative usage of the inframe stop codons will deviate from that of usual termination signals in the proteome.</sentence>
					<sentence id="S1.188">Figure 5 shows the discrepancies between relative usage of the inframe and C-terminal stop codons of 127 organisms in the pre-filtering clusters.</sentence>
					<sentence id="S1.189">Only seven organisms had statistically significant discrepancies (P &lt; 0.05), and all of them are known to utilize selenocysteine or pyrrolysine.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S1.190">Discrepancies of stop codon usages between the inframe and C-terminal stop codons</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.191">Discrepancies of stop codon usages between the inframe and C-terminal stop codons.</sentence>
					<sentence id="S1.192">The inframe stop codon usage is taken from the pre-filtering clusters, and the C-terminal usage is computed based on the annotated proteins of the organism.</sentence>
					<sentence id="S1.193">Red circle: an organism with pyrrolysine, blue; selenocysteine, yellow; both pyrrolysine and selenocysteine, white; <xcope id="X1.193.1"><cue type="negation" ref="X1.193.1">neither</cue> pyrrolysine <cue type="negation" ref="X1.193.1">nor</cue> selenocysteine</xcope>.</sentence>
					<sentence id="S1.194">The organisms are ordered by their discrepancy scores.</sentence>
					<sentence id="S1.195">The discrepancy score is the negative logarithm of a p-value of Fisher's exact test.</sentence>
					<sentence id="S1.196">The dotted line indicates significance level 0.05 after a correction for multiple testing.</sentence>
					<sentence id="S1.197">When top ten organisms were examined, only Gluconobacter oxydans was an organism <xcope id="X1.197.1"><cue type="negation" ref="X1.197.1">not</cue> known to have stop-codon-encoded amino acids</xcope>.</sentence>
					<sentence id="S1.198">An inspection of the G. oxydans iORFs in the 273 clusters revealed that their inframe stop codons are dominated by TAA, but all of them belong to a single protein cluster associated with transposable elements.</sentence>
					<sentence id="S1.199">Because it <xcope id="X1.199.4"><cue type="speculation" ref="X1.199.4">seems</cue> <xcope id="X1.199.3"><cue type="speculation" ref="X1.199.3">unlikely</cue> that an insertion system of novel amino acid evolves solely for transposable elements</xcope></xcope>, <xcope id="X1.199.1"><xcope id="X1.199.2">this organism <cue type="negation" ref="X1.199.1">cannot</cue> be <cue type="speculation" ref="X1.199.2">considered</cue> as a good candidate of the 23rd amino acid</xcope></xcope>.</sentence>
					<sentence id="S1.200">Sensitivity of this test is not high because many organisms that utilize selenocysteine were below the defined threshold.</sentence>
					<sentence id="S1.201">However, the result agrees with the filtering-dependent analysis that <xcope id="X1.201.1"><cue type="negation" ref="X1.201.1">no</cue> candidate of the novel stop-codon-encoded amino acid is detectable in the current dataset</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S1.202">Discussion</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.203">As the number of completely sequenced genomes increases, several research groups started to predict proteins that contain stop-codon-encoded amino acids through computational analyses.</sentence>
					<sentence id="S1.204">Most of them are aimed at identification of selenoproteins, reflecting concerns from the scientific community and accumulated knowledge on selenocysteine.</sentence>
					<sentence id="S1.205">In order to improve prediction specificity, they have fully exploited the known features of selenocysteine, such as the SECIS elements or cysteine homologs, which have cysteine in place of selenocysteine.</sentence>
					<sentence id="S1.206">However, since the target of this study is the 23rd amino acid, and there is <xcope id="X1.206.1"><cue type="negation" ref="X1.206.1">no</cue> a priori knowledge</xcope>, only general properties of stop-codon-encoded amino acids can be used for prediction.</sentence>
					<sentence id="S1.207">Such general-purpose algorithms have also been developed to date.</sentence>
					<sentence id="S1.208">The method of Chaudhuri and Yeates 10 extracts iORFs from microbial genomes and analyzes sequence conservation around the inframe stop codon.</sentence>
					<sentence id="S1.209">Their method is thus similar to ours and applicable to both selenocysteine and pyrrolysine.</sentence>
					<sentence id="S1.210">Perrodou et al. 19 constructed a database of predicted recoding events in microbes.</sentence>
					<sentence id="S1.211">Their method is applicable not only to stop codon readthrough but also to frameshift.</sentence>
					<sentence id="S1.212">However, both of them did <xcope id="X1.212.1"><cue type="negation" ref="X1.212.1">not</cue> apply their methods to search for novel amino acids</xcope>.</sentence>
					<sentence id="S1.213">Therefore, <xcope id="X1.213.1">the question of the 23rd amino acid has <cue type="negation" ref="X1.213.1">not</cue> been investigated from the viewpoint of coding sequences</xcope>.</sentence>
					<sentence id="S1.214">Additionally, the previous methods <xcope id="X1.214.1"><cue type="negation" ref="X1.214.1">cannot</cue> effectively discriminate pseudogenes from readthrough genes</xcope>.</sentence>
					<sentence id="S1.215">For instance, Chaudhuri and Yeates reported a homolog of cobalamin biosynthesis protein CobN as a novel candidate of pyrrolysine protein.</sentence>
					<sentence id="S1.216">However, the gene is <xcope id="X1.216.2"><cue type="speculation" ref="X1.216.2">probably</cue> a pseudogene</xcope> because it contains an inframe TAA codon in addition to the TAG codon, and only <xcope id="X1.216.1">one species <cue type="speculation" ref="X1.216.1">seems</cue> to have the amber-containing form of the gene</xcope>.</sentence>
					<sentence id="S1.217">The previous methods also <xcope id="X1.217.2"><cue type="speculation" ref="X1.217.2">assume</cue> that proteins with stop-codon-encoded amino acids will have non-readthrough homologs (i.e., homologous proteins that do <xcope id="X1.217.1"><cue type="negation" ref="X1.217.1">not</cue> have inframe stop codons)</xcope></xcope>.</sentence>
					<sentence id="S1.218">However, that is not necessarily true.</sentence>
					<sentence id="S1.219">For example, pyrrolysine-containing monomethylamine methyltransferases adopt TIM barrel fold 6, but their primary sequences do <xcope id="X1.219.1"><cue type="negation" ref="X1.219.1">not</cue> exhibit detectable similarity to other TIM barrel proteins</xcope> because of evolutionary divergence.</sentence>
					<sentence id="S1.220">Dimethylamine methyltransferases also <xcope id="X1.220.1"><cue type="negation" ref="X1.220.1">lack</cue> non-readthrough homologs</xcope>.</sentence>
					<sentence id="S1.221">Yet another example is glycine reductase selenoprotein A.</sentence>
					<sentence id="S1.222">Only the selenocysteine-containing form of the enzyme is currently known 20.</sentence>
					<sentence id="S1.223">Therefore, it is important <xcope id="X1.223.1"><cue type="negation" ref="X1.223.1">not</cue> to assume non-readthrough homologs for exploring novel candidates</xcope>.</sentence>
					<sentence id="S1.224">If any non-readthrough homologs are registered in public sequence databases, a careful annotation process of a newly sequenced genome will be able to detect readthrough genes, even though <xcope id="X1.224.1">they <cue type="speculation" ref="X1.224.1">may</cue> be annotated as pseudogenes</xcope>.</sentence>
					<sentence id="S1.225">However, if all members of a gene family have stop codon readthrough, correct annotation of their gene structure will be extremely difficult, and all of them will be split into two distinct genes.</sentence>
					<sentence id="S1.226">The method reported here is unique in that it does <xcope id="X1.226.1"><cue type="negation" ref="X1.226.1">not</cue> assume non-readthrough homologs</xcope>.</sentence>
					<sentence id="S1.227">Using this method, a systematic screening of the 23rd amino acid and other readthrough genes was carried out.</sentence>
					<sentence id="S1.228">Many of the currently known selenoproteins and pyrrolysine proteins were recovered, indicating the effectiveness of this approach.</sentence>
					<sentence id="S1.229">In particular, successful detection of pyrrolysine-containing methyltransferases and selenoprotein A should be noted.</sentence>
					<sentence id="S1.230">However, almost no novel candidates for readthrough genes were predicted.</sentence>
					<sentence id="S1.231">What can be concluded from this result?</sentence>
					<sentence id="S1.232">The most <xcope id="X1.232.3"><cue type="speculation" ref="X1.232.3">likely</cue> explanation</xcope> is that <xcope id="X1.232.2">the 23rd amino acid does <xcope id="X1.232.1"><cue type="negation" ref="X1.232.1">not</cue> exist</xcope>, <cue type="speculation" ref="X1.232.2">or</cue> its distribution on the tree of life is rather limited</xcope>.</sentence>
					<sentence id="S1.233">Although a broad spectrum of taxonomic groups has been subjected to genome sequencing, the genomes of most microbial species on the earth have yet to be determined.</sentence>
					<sentence id="S1.234"><xcope id="X1.234.1">The unknown amino acid <cue type="speculation" ref="X1.234.1">may</cue> be used by these species</xcope>.</sentence>
					<sentence id="S1.235">Alternatively, only one organism in our dataset <xcope id="X1.235.1"><cue type="speculation" ref="X1.235.1">may</cue> have the 23rd amino acid</xcope>.</sentence>
					<sentence id="S1.236">This is because our method is limited to readthrough genes conserved across two or more species.</sentence>
					<sentence id="S1.237">If the novel amino acid appears in younger, non-conserved sequences, our technique will miss them.</sentence>
					<sentence id="S1.238">In either case, the distribution of the 23rd amino acid will be significantly narrower than that of selenocysteine, which has scattered but wide distribution 21.</sentence>
					<sentence id="S1.239">This conclusion coincides with and strengthens that of the previous research on tRNA 14.</sentence>
					<sentence id="S1.240">Yet another <xcope id="X1.240.2"><cue type="speculation" ref="X1.240.2">possibility</cue> is that the 23rd amino acid exists but is <xcope id="X1.240.1"><cue type="negation" ref="X1.240.1">not</cue> encoded by stop codons</xcope></xcope>.</sentence>
					<sentence id="S1.241">It is well known that the genetic code varies in several organisms 22.</sentence>
					<sentence id="S1.242">Thus, certain organisms <xcope id="X1.242.1"><cue type="speculation" ref="X1.242.1">may</cue> use one of the sense codons for the novel amino acid</xcope>.</sentence>
					<sentence id="S1.243">Because codons for most amino acids are degenerate, redefinition of <xcope id="X1.243.1">one of them is <cue type="speculation" ref="X1.243.1">feasible</cue></xcope>.</sentence>
					<sentence id="S1.244">However, that possibility is beyond the scope of this study and is left as an open problem.</sentence>
					<sentence id="S1.245">Bioinformatics analysis of unusual tRNA genes and codon usage <xcope id="X1.245.1"><cue type="speculation" ref="X1.245.1">may</cue> provide insights into this problem</xcope>.</sentence>
					<sentence id="S1.246">In addition to the 23rd amino acid, our method can simultaneously explore selenoproteins and other readthrough proteins.</sentence>
					<sentence id="S1.247">A common <xcope id="X1.247.1"><cue type="speculation" ref="X1.247.1">assumption</cue> in microbial selenoprotein predictions is that selenoproteins will have cysteine homologs</xcope>.</sentence>
					<sentence id="S1.248">Zhang et al. 20 examined the validity of this assumption using a SECIS-based method and concluded that selenoproteins <xcope id="X1.248.1"><cue type="negation" ref="X1.248.1">without</cue> cysteine homologs</xcope> will be extremely rare.</sentence>
					<sentence id="S1.249">Our method can reassess this assumption in a SECIS-independent way.</sentence>
					<sentence id="S1.250">Such selenoproteins identified through our screening of nearly 200 microbial genomes were selenoprotein A and only one uncertain candidate.</sentence>
					<sentence id="S1.251">Therefore, selenoproteins that <xcope id="X1.251.1"><cue type="negation" ref="X1.251.1">lack</cue> cysteine homologs</xcope> will be scarce, as previously reported.</sentence>
					<sentence id="S1.252">Other readthrough proteins with canonical amino acids (i.e., proteins that have canonical amino acids at their inframe stop codons) are quite rare in prokaryotes 1.</sentence>
					<sentence id="S1.253">The result reported here is in agreement, but it is <xcope id="X1.253.1"><cue type="negation" ref="X1.253.1">not</cue> conclusive</xcope>.</sentence>
					<sentence id="S1.254">This is because our method <xcope id="X1.254.3"><cue type="speculation" ref="X1.254.3">assumes</cue> that stop-codon-encoded amino acid is located inside a domain</xcope>, but it is <xcope id="X1.254.2"><cue type="speculation" ref="X1.254.2">unclear</cue> <xcope id="X1.254.1"><cue type="speculation" ref="X1.254.1">whether</cue> it holds true in prokaryotic readthrough with canonical amino acids</xcope></xcope>.</sentence>
					<sentence id="S1.255">At least, only one experimentally-confirmed example from a pathogenic strain of Escherichia coli 23, <xcope id="X1.255.2">whose genome is <cue type="negation" ref="X1.255.2">not</cue> yet determined</xcope>, does <xcope id="X1.255.1"><cue type="negation" ref="X1.255.1">not</cue> obey this rule</xcope>.</sentence>
					<sentence id="S1.256">What can be concluded from our result is that this type of readthrough will be located outside of domains, such as a linker between two domains.</sentence>
					<sentence id="S1.257">Such a stop codon <xcope id="X1.257.1"><cue type="speculation" ref="X1.257.1">may</cue> behave as a switch that regulates production of short and long isoforms from a single mRNA, as in readthrough genes from viruses</xcope> 24.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S1.258">Conclusion</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.259">To explore the possibility of a 23rd amino acid, ORFs in prokaryotic genomes were investigated in a comprehensive way.</sentence>
					<sentence id="S1.260">Although many of the currently known selenoproteins and pyrrolysine proteins were successfully detected, <xcope id="X1.260.1"><cue type="negation" ref="X1.260.1">no</cue> candidate for the 23rd amino acid was discovered</xcope>.</sentence>
					<sentence id="S1.261">Therefore, if such an amino acid exists, it will have limited distribution in the tree of life.</sentence>
					<sentence id="S1.262">Alternatively, <xcope id="X1.262.1">it <cue type="speculation" ref="X1.262.1">may</cue> be encoded by one of the sense codons</xcope>.</sentence>
					<sentence id="S1.263">From the viewpoint of selenoprotein prediction, the sensitivity of our method was lower than an existing method.</sentence>
					<sentence id="S1.264">However, our method has several unique features.</sentence>
					<sentence id="S1.265">It is applicable to general readthrough genes and rigorously excludes pseudogenes and sequencing errors.</sentence>
					<sentence id="S1.266">Moreover, it does <xcope id="X1.266.1"><cue type="negation" ref="X1.266.1">not</cue> assume the occurrence of non-readthrough homologs in the public databases</xcope>.</sentence>
					<sentence id="S1.267">It will help in identification of novel readthrough genes from the rapidly expanding collection of complete microbial genomes.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S1.268">Methods</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S1.269">Enumeration of iORFs from prokaryotic genomes</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.270">A total of 328 complete genome sequences of prokaryotes were downloaded from the KEGG FTP site 25 in April 2006.</sentence>
					<sentence id="S1.271">From them, 191 representative organisms were selected by excluding close relatives.</sentence>
					<sentence id="S1.272">The threshold was set to average sequence identity 90% of two house-keeping genes, DNA polymerase III &#945; subunit and alanyl-tRNA synthetase.</sentence>
					<sentence id="S1.273">From these 191 genomes, iORFs longer than 80 codons were enumerated using inhouse software, which is available from the author's web site 26.</sentence>
					<sentence id="S1.274">Both upstream and downstream regions of its inframe stop codon were required to be longer than 10 codons.</sentence>
					<sentence id="S1.275">Two stop codons of an iORF (i.e. the inframe and C-terminal stops) can be any combination of canonical stop codons (TAA, TAG, TGA).</sentence>
					<sentence id="S1.276">However, for Mycoplasma, only TAA and TAG were used.</sentence>
					<sentence id="S1.277">Three codons ATG, TTG and GTG were allowed to be start signals.</sentence>
					<sentence id="S1.278">The iORFs of each organism were compared with protein-coding genes of the organism using BLASTX.</sentence>
					<sentence id="S1.279">If an iORF matched any protein-coding genes (E-value &lt; 10-3) and their reading frames did <xcope id="X1.279.1"><cue type="negation" ref="X1.279.1">not</cue> coincide</xcope>, the iORF was discarded.</sentence>
					<sentence id="S1.280">Similarly, iORFs were compared with RNA genes using BLASTN, and those matched with the RNAs were removed.</sentence>
					<sentence id="S1.281">Remaining iORFs were translated into amino acid sequences.</sentence>
					<sentence id="S1.282">We translated all three types of nonsense codons into the one-letter code U, so as to simplify visual inspection of sequence alignments.</sentence>
					<sentence id="S1.283">Although the code U is usually for selenocysteine, it will be harmless because U is automatically converted into &#215; inside the BLAST programs.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S1.284">Construction of clusters of conserved iORFs</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.285">To examine evolutionary conservation of the iORFs, a window of 101 residues around the inframe stop codon was extracted and subjected to TBLASTN searches against the above 191 genome sequences.</sentence>
					<sentence id="S1.286">If there were any hits (E-value &lt; 0.01) in other organisms, and if the hit includes 10 upstream and 10 downstream residues of the inframe stop codon, then the iORF was retained.</sentence>
					<sentence id="S1.287">However, if there were any hits (E-value &lt; 10-5) that did <xcope id="X1.287.1"><cue type="negation" ref="X1.287.1">not</cue> cover the inframe stop codon</xcope>, the iORF was discarded.</sentence>
					<sentence id="S1.288">Eligible iORFs were then clustered using BLASTCLUST with score density 0.5 and minimum length coverage 0.6.</sentence>
					<sentence id="S1.289">After removing singleton clusters, multiple sequence alignments of the remaining clusters were computed using MAFFT 27 with the L-INS-i option.</sentence>
					<sentence id="S1.290">Subsequently, conservation of the inframe stop codons in each cluster was examined.</sentence>
					<sentence id="S1.291">If the location or type of stop codons was <xcope id="X1.291.1"><cue type="negation" ref="X1.291.1">not</cue> identical</xcope>, the cluster was discarded.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S1.292">Three-step filtering of the candidate clusters</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.293">The first filter examines protein-likeliness of the iORFs.</sentence>
					<sentence id="S1.294">This filter is mainly designed to remove conserved non-coding sequences (CNS) immediately downstream of non-readthrough genes.</sentence>
					<sentence id="S1.295">If we measure purifying selection for amino acid sequences by the ratio of nonsynonymous to synonymous substitution rates (dN/dS), a protein with a stop-codon-encoded amino acid will indicate the sign of selection, while CNS will <xcope id="X1.295.1"><cue type="negation" ref="X1.295.1">not</cue></xcope>.</sentence>
					<sentence id="S1.296">The dN/dS was calculated for each of the two parts flanking the inframe stop codon in an iORF using codeml program in the PAML package 28.</sentence>
					<sentence id="S1.297">Statistical significance was estimated by likelihood ratio test 29.</sentence>
					<sentence id="S1.298">The observed alignment was fitted to two distinct substitution models, one of which estimates dN/dS from the data, and the other fixes it to 1.0.</sentence>
					<sentence id="S1.299">Let lfree and lfix denote log likelihood of these models.</sentence>
					<sentence id="S1.300">Then, 2&#916;l = 2(lfree &#8211; lfix) approximately follows the &#967;2 distribution with one degree of freedom.</sentence>
					<sentence id="S1.301">If dN/dS was less than 1.0, and the statistics 2&#916;l was larger than a threshold, we regard it as a sign of purifying selection.</sentence>
					<sentence id="S1.302">In this study, the threshold was set to 5.0 (corresponds to P &lt; 0.025) so that the known readthrough proteins score higher than the threshold.</sentence>
					<sentence id="S1.303">For each of the above clusters, an all-against-all comparison of cluster members was performed.</sentence>
					<sentence id="S1.304">If any pair exhibits such signals in both the N- and C-terminal parts, the cluster was retained.</sentence>
					<sentence id="S1.305">Even if both the upstream and downstream regions of the inframe stop codon code proteins, they <xcope id="X1.305.2"><cue type="speculation" ref="X1.305.2">may</cue> be two adjacent genes <xcope id="X1.305.1"><cue type="negation" ref="X1.305.1">instead of</cue> a readthrough protein</xcope></xcope>.</sentence>
					<sentence id="S1.306">The second and third filtering processes remove such genes based on BLAST alignment patterns.</sentence>
					<sentence id="S1.307">Although the boundary analysis applied previously has the same goal (Figure 1b), some gene pairs escaped elimination.</sentence>
					<sentence id="S1.308">To enhance sensitivity of the filters, the whole length of an iORF was used as a BLAST query <xcope id="X1.308.1"><cue type="negation" ref="X1.308.1">instead of</cue> the partial sequence</xcope>, and the size of the BLAST database was increased from the 191 nonredundant genomes to the 328 complete genomes in GenomeNet and 246 draft genome sequences downloaded from GenBank in May 2006.</sentence>
					<sentence id="S1.309">The second filter inspects synteny of iORFs.</sentence>
					<sentence id="S1.310">If the N- and C-terminal parts of an iORF have distinct but closely arranged BLAST hits in other genomes, it strongly <xcope id="X1.310.1"><cue type="speculation" ref="X1.310.1">suggests</cue> the iORF is actually two adjacent genes</xcope>.</sentence>
					<sentence id="S1.311">Translated sequences of iORFs in the pre-filtering clusters were subjected to TBLASTN searches against the genome database.</sentence>
					<sentence id="S1.312">If both the best hits of the N- and C-terminal parts are statistically significant (E-value &lt; 10-5), and distance between them is less than 1 kbp, we call these hits 'syntenic hits'.</sentence>
					<sentence id="S1.313">If any syntenic hits with non-coinciding reading frames were found, the cluster was removed.</sentence>
					<sentence id="S1.314">The third filter uses co-occurrence of residues around the inframe stop codon as another source of information for screening stop codon readthrough.</sentence>
					<sentence id="S1.315">Suppose a window of 21 residues centered at the inframe stop codon.</sentence>
					<sentence id="S1.316">In prokaryotes, most stop-codon-encoded amino acids are located inside a domain, the unit of evolutionary sequence conservation.</sentence>
					<sentence id="S1.317">Therefore, in an ideal situation the presence or <xcope id="X1.317.1"><cue type="negation" ref="X1.317.1">absence</cue> of the 21 residues in alignments</xcope> will be synchronized.</sentence>
					<sentence id="S1.318">In contrast, if the iORF is actually two adjacent genes, then upstream and downstream residues of the stop codon will appear separately in many alignments.</sentence>
					<sentence id="S1.319">We defined a co-occurrence matrix as a 21 &#215; 21 matrix whose (i,j)-th element represents how often residue i and j appeared simultaneously in N alignments.</sentence>
					<sentence id="S1.320">The matrix elements were subsequently normalized to the number of alignments N.</sentence>
					<sentence id="S1.321">By definition, the more often the upstream and downstream residues of the inframe stop codon co-occur in the alignments, the higher the density in the upper right quarter of the matrix.</sentence>
					<sentence id="S1.322">If average density in the quarter was lower than 0.85, the cluster was filtered out.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S1.323">Stop codon usage</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S1.324">For each organism, its iORFs were extracted from the pre-filtering clusters, and codon usage at the inframe stop positions was counted.</sentence>
					<sentence id="S1.325">Codon usage at the C-terminal stop codons in its proteome was also computed using data of coding sequences downloaded from KEGG GENES 25.</sentence>
					<sentence id="S1.326">These data were combined into a 3 &#215; 2 matrix, and Fisher's exact test was applied.</sentence>
					<sentence id="S1.327">The p-value was corrected for multiple testing using the Bonferroni correction because there were 127 organisms in the pre-filtering clusters.</sentence>
				</DocumentPart>
		</Document>
		<Document type="Biological_full_article">
			<DocID type="BMC_ID">1471-2105-8-239</DocID>
				<DocumentPart type="Title">
					<sentence id="S2.1">Probabilistic prediction and ranking of human protein-protein interactions</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S2.2">Abstract</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S2.3">Background</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.4">Although the prediction of protein-protein interactions has been extensively investigated for yeast, few such datasets exist for the far larger proteome in human.</sentence>
					<sentence id="S2.5">Furthermore, it has recently been <xcope id="X2.5.1"><cue type="speculation" ref="X2.5.1">estimated</cue> that the overall average false positive rate of available computational and high-throughput experimental interaction datasets is as high as 90%</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S2.6">Results</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.7">The prediction of human protein-protein interactions was investigated by combining orthogonal protein features within a probabilistic framework.</sentence>
					<sentence id="S2.8">The features include co-expression, orthology to known interacting proteins and the full-Bayesian combination of subcellular localization, co-occurrence of domains and post-translational modifications.</sentence>
					<sentence id="S2.9">A novel scoring function for local network topology was also investigated.</sentence>
					<sentence id="S2.10">This topology feature greatly enhanced the predictions and together with the full-Bayes combined features, made the largest contribution to the predictions.</sentence>
					<sentence id="S2.11">Using a conservative threshold, our most accurate predictor identifies 37606 human interactions, 32892 (80%) of which are <xcope id="X2.11.1"><cue type="negation" ref="X2.11.1">not</cue> present in other publicly available large human interaction datasets</xcope>, thus substantially increasing the coverage of the human interaction map.</sentence>
					<sentence id="S2.12">A subset of the 32892 novel predicted interactions have been independently validated.</sentence>
					<sentence id="S2.13">Comparison of the prediction dataset to other available human interaction datasets <xcope id="X2.13.1"><cue type="speculation" ref="X2.13.1">estimates</cue> the false positive rate of the new method to be below 80% which is competitive with other methods</xcope>.</sentence>
					<sentence id="S2.14">Since the new method scores and ranks all human protein pairs, smaller subsets of higher quality can be generated thus leading to even lower false positive prediction rates.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S2.15">Conclusion</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.16">The set of interactions predicted in this work increases the coverage of the human interaction map and will help determine the highest confidence human interactions.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S2.17">Background</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.18">Protein-protein interactions perform and regulate fundamental cellular processes.</sentence>
					<sentence id="S2.19">The comprehensive study of such interactions on a genome-wide scale will lead to a clearer understanding of diverse cellular processes and of the molecular mechanisms of disease.</sentence>
					<sentence id="S2.20">Although the determination of interactions by small-scale laboratory techniques is impractical for a complete proteome on the grounds of cost and time, several experimental techniques now exist to determine protein-protein interactions in a high-throughput manner 1.</sentence>
					<sentence id="S2.21">High-throughput datasets have been generated for model organisms such as yeast 23456, worm 7 and fly 89 as well as Escherichia coli 10.</sentence>
					<sentence id="S2.22">In addition, the first broad-focus experimental datasets for the human interactome have recently been published 1112.</sentence>
					<sentence id="S2.23"><xcope id="X2.23.1">Interactions determined by high-throughput methods are generally <cue type="speculation" ref="X2.23.1">considered</cue> to be less reliable than those obtained by low-throughput studies</xcope> 1314 and as a consequence efforts are also underway to extract evidence for interactions from the literature 151617.</sentence>
					<sentence id="S2.24">Analysis of the high-throughput datasets has shown that they overlap very little with each other, <xcope id="X2.24.1"><cue type="speculation" ref="X2.24.1">suggesting</cue> that their coverage is low</xcope>.</sentence>
					<sentence id="S2.25">Indeed, it has been <xcope id="X2.25.1"><cue type="speculation" ref="X2.25.1">estimated</cue> recently that the current yeast and human protein interaction maps are only 50% and 10% complete, respectively</xcope> 18.</sentence>
					<sentence id="S2.26">The low coverage and variable quality of the experimental interaction datasets have prompted many groups to investigate computational methods to predict interactions or to determine the most likely interactions seen in the high-throughput datasets.</sentence>
					<sentence id="S2.27">The different approaches to predict interactions can be grouped into five main categories:</sentence>
					<sentence id="S2.28">1) Predictors based on sequence and structure exploit the observation that some pairs of sequence motifs, domains and structural families tend to interact preferentially.</sentence>
					<sentence id="S2.29">Some methods predict interaction from sequence-motifs found to be over-represented in interacting protein pairs 19, or by considering the physico-chemical properties and the location of groups of amino acids in the sequence 2021.</sentence>
					<sentence id="S2.30">Others investigate the co-occurrence in interacting proteins of specific protein domains or their structural family classification 2223.</sentence>
					<sentence id="S2.31">When three-dimensional structures are available for <xcope id="X2.31.2">both proteins <cue type="speculation" ref="X2.31.2">thought</cue> to interact</xcope>, <xcope id="X2.31.1">high quality predictions and additional information such as the residues involved in the interaction and their binding affinity <cue type="speculation" ref="X2.31.1">may</cue> be estimated</xcope> (reviewed in 24).</sentence>
					<sentence id="S2.32">Similarly, when two proteins show clear sequence similarity to proteins that exist in a complex for which the three-dimensional structure is known, <xcope id="X2.32.1">detailed predictions of the atomic-level interactions <cue type="speculation" ref="X2.32.1">may</cue> be made</xcope>.</sentence>
					<sentence id="S2.33">For example, the major complexes in yeast have been predicted by this strategy 25.</sentence>
					<sentence id="S2.34">2) Predictors based on comparative genomics have been exploited primarily in prokaryotes.</sentence>
					<sentence id="S2.35">They consider the physical location of genes, as well as their pattern of occurrence and evolutionary rate, to predict interactions or functional relationships between protein pairs.</sentence>
					<sentence id="S2.36">Some predictors make use of the observation that <xcope id="X2.36.1">neighboring genes whose relative location is conserved across several prokaryotic organisms are <cue type="speculation" ref="X2.36.1">likely</cue> to interact</xcope> 26.</sentence>
					<sentence id="S2.37">Other predictors exploit the observation that <xcope id="X2.37.1">gene pairs that co-occur in related species or that co-evolve also tend to be more <cue type="speculation" ref="X2.37.1">likely</cue> to interact</xcope> 27282930.</sentence>
					<sentence id="S2.38">In addition, domains that exist as separate proteins in some genomes but are also seen fused in a single protein in other genomes have been used to <xcope id="X2.38.2"><cue type="speculation" ref="X2.38.2">suggest</cue> the isolated domains <xcope id="X2.38.1"><cue type="speculation" ref="X2.38.1">may</cue> interact</xcope></xcope> 3132.</sentence>
					<sentence id="S2.39">3) Predictors based on orthology work on the <xcope id="X2.39.1"><cue type="speculation" ref="X2.39.1">assumption</cue> that the orthologs of a protein pair that are known to interact in one organism will also interact</xcope>.</sentence>
					<sentence id="S2.40">Such relationships are often referred to as interologs 33.</sentence>
					<sentence id="S2.41">For example, at BLAST e-values below 10-10, it has been shown that 16&#8211;30% of yeast interactions can be transferred to the worm 34 while further studies have <xcope id="X2.41.1"><cue type="speculation" ref="X2.41.1">estimated</cue> that a joint e-value below 10-70 is required to transfer interactions reliably between organisms</xcope> 35.</sentence>
					<sentence id="S2.42">Interologs have been used to predict protein-protein interactions in human 36.</sentence>
					<sentence id="S2.43">4) Predictors based on functional features exploit non-sequence information to infer interactions.</sentence>
					<sentence id="S2.44">Some predictors exploit the observation that there is a significant correlation in the expression levels of transcripts encoding proteins that interact 37.</sentence>
					<sentence id="S2.45">Since proteins must be co-localized in order to interact, protein subcellular localization has often been used to assess the quality of interaction datasets 3839.</sentence>
					<sentence id="S2.46">Similarly, interacting proteins are also often involved in similar cellular processes, so Gene Ontology "process" and "function" annotations have been exploited to predict interactions and validate high-throughput datasets 163638.</sentence>
					<sentence id="S2.47">5) Predictors have exploited similarities in the network topology of known interaction datasets to predict novel interactions.</sentence>
					<sentence id="S2.48">In one study, the local topology of small-world networks has been used to assess the quality of interaction datasets and predict novel interactions 40 while Gerstein and colleagues have investigated the prediction of interactions by the identification of missing edges in almost fully connected complexes 41.</sentence>
					<sentence id="S2.49">In addition to these diverse approaches, some groups have combined concepts from several of the above categories in integrative frameworks.</sentence>
					<sentence id="S2.50">The first such predictor integrated co-expression data, co-essentiality as well as biological function in a na&#239;ve Bayes network to provide proteome-wide de novo prediction of yeast protein interactions 37.</sentence>
					<sentence id="S2.51">Subsequently, the combination of many more diverse features was investigated using different frameworks to predict yeast protein-protein interactions, increasing the prediction accuracy and allowing an assessment of the limits of genomic integration 424344.</sentence>
					<sentence id="S2.52">The integration of diverse genomic features has also been useful in the investigation of the related but broader problem of predicting protein-protein associations as well as complex and pathway membership (see for example 45).</sentence>
					<sentence id="S2.53">Although, many computational methods have investigated the prediction of protein-protein interactions, few have so far been applied to the human proteome.</sentence>
					<sentence id="S2.54">The first large-scale prediction of the human interactome map involved transferring interactions from model organisms 36.</sentence>
					<sentence id="S2.55">This resulted in over 70000 predicted physical interactions involving approximately 6200 human proteins.</sentence>
					<sentence id="S2.56">A second method integrated expression data, orthology, protein domain data and functional annotations into a probabilistic framework and resulted in the <xcope id="X2.56.1"><cue type="speculation" ref="X2.56.1">prediction</cue> of nearly 40000 human protein interactions</xcope> 46.</sentence>
					<sentence id="S2.57">It has recently been estimated that the false-positive rates of these computational datasets as well as of available high-throughput human interaction datasets are, on average, as high as 90% and their coverage is only approximately 10%, <xcope id="X2.57.1"><cue type="speculation" ref="X2.57.1">indicating that</cue> more such efforts are needed to increase the coverage and confidence we have in current maps of the human interactome</xcope> 18.</sentence>
					<sentence id="S2.58">In this paper, the prediction of physical interactions between human proteins has been investigated by integrating in a Bayesian framework several different pieces of evidence including orthology, functional features and local network topology.</sentence>
					<sentence id="S2.59">In order to increase the accuracy and coverage of the predictions, different types of negative data (non-interacting protein pairs) were explored to train the predictor.</sentence>
					<sentence id="S2.60">The most accurate of the predictors was then used to assess the likelihood of pair-wise interaction for over 20000 human proteins from the IPI (International Protein Index) database.</sentence>
					<sentence id="S2.61">These predictions provide a <xcope id="X2.61.2"><cue type="speculation" ref="X2.61.2">likelihood</cue> of interaction for over 260 million human protein pairs</xcope> and lead to the <xcope id="X2.61.1"><cue type="speculation" ref="X2.61.1">prediction</cue> of over 37000 human interactions</xcope>.</sentence>
					<sentence id="S2.62">They <xcope id="X2.62.1"><cue type="speculation" ref="X2.62.1">should</cue> thus augment current knowledge of the human interactome as well as the understanding of the relationship between distinct cellular processes</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S2.63">Results and discussion</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S2.64">Architecture of the predictor and training of the modules</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.65">The prediction of human protein-protein interactions was investigated in a Bayesian framework by considering combinations of individual protein features known to be indicative of interaction.</sentence>
					<sentence id="S2.66">The seven individual features considered are summarized in Table 1 and detailed in the Methods section.</sentence>
					<sentence id="S2.67">As indicated in Table 1, the different features were grouped into five distinct modules: Expression (E), Orthology (O), Combined (C), Disorder (D) and Transitive (T).</sentence>
					<sentence id="S2.68">Figure 1 illustrates the training scheme and architecture of the method.</sentence>
					<sentence id="S2.69">The Expression, Orthology, Combined and Disorder modules can calculate likelihood ratios (LR) of interaction independently and are referred to as the Group A modules (Figure 1A).</sentence>
					<sentence id="S2.70">The product of their likelihood ratios is referred to as the Preliminary Score.</sentence>
					<sentence id="S2.71">The Transitive module considers the local topology of the network predicted by the group A modules and thus requires the completion of their analysis to calculate its own likelihood ratios of interaction (Figure 1B).</sentence>
					<sentence id="S2.72">As such, all combinations of the Group A modules can be used to predict interaction in the presence or <xcope id="X2.72.1"><cue type="negation" ref="X2.72.1">absence</cue> of the Transitive module</xcope>.</sentence>
					<sentence id="S2.73">In the <xcope id="X2.73.1"><cue type="negation" ref="X2.73.1">absence</cue> of the Transitive module</xcope>, the Preliminary Score is used as the final likelihood ratio output by the predictor.</sentence>
				</DocumentPart>
				<DocumentPart type="TableLegend">
					<sentence id="S2.74">Features considered in the prediction of interactions for each module</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S2.75">Architecture of the predictor and likelihoods of the modules</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.76">Architecture of the predictor and likelihoods of the modules.</sentence>
					<sentence id="S2.77">The predictor consists of two different parts (A and B) which are trained consecutively.</sentence>
					<sentence id="S2.78">The Group A modules (shown in panel A) are trained in parallel.</sentence>
					<sentence id="S2.79">The likelihood ratios (LR) for most of their states are shown in panel A (their complete likelihood ratios are available in Additional File 4).</sentence>
					<sentence id="S2.80">The product of the likelihood ratios of all Group A modules considered in a given prediction is referred to as the preliminary score (PS) and can be calculated for all human protein pairs.</sentence>
					<sentence id="S2.81">If <xcope id="X2.81.1">the Transitive module is <cue type="negation" ref="X2.81.1">not</cue> considered</xcope>, the final likelihood ratios assigned to all protein pairs is the preliminary score (PS).</sentence>
					<sentence id="S2.82">If the Transitive module is considered, the local topology of the network determined by the assignment of preliminary scores to all protein pairs considered in the training set is used to calculate the likelihood ratios for the transitive module (shown in panel B) for every protein pair in the training set.</sentence>
					<sentence id="S2.83">The final likelihood ratio is then the product of the preliminary score calculated in panel A and the likelihood ratio output by the transitive module in panel B.</sentence>
					<sentence id="S2.84">For the Orthology module: YL, YM, YH: yeast low, medium and high scoring bins; FL, FM, FH: fly low, medium and high scoring bins; WL, WM, WH: worm low, medium and high scoring bins; HM and HL: medium and low scoring bins for human protein pairs that have human paralogs; &gt; 1 organism: bin for human protein pairs that have interologs in more than one organism.</sentence>
					<sentence id="S2.85">For the Combined module, &#8211;&#8212; refers to the lowest scoring bin (for the domain (Dom), post-translational modification (PTM) and subcellular localization (Loc) features), &#8211; refers to the second lowest scoring bin and +, ++, +++ refer respectively to the third highest, second highest and highest scoring bins.</sentence>
					<sentence id="S2.86">The likelihood ratios of interaction are evaluated for each module by considering the relative proportions of positive and negative training examples that have a specific state (i.e. that fall in a particular bin of a module).</sentence>
					<sentence id="S2.87">The datasets used to train the predictor consisted of 26896 known human protein interactions extracted from the Human Protein Reference Database (HPRD) 15 and approximately 100 times more randomly chosen protein pairs used as negative examples.</sentence>
					<sentence id="S2.88">The composition of the datasets and likelihood ratio calculations are explained in greater detail in the Methods section.</sentence>
					<sentence id="S2.89">Once the final likelihood ratio of interaction (LRfinal) is calculated for a given protein pair as shown in Figure 1B, it is <xcope id="X2.89.1"><cue type="speculation" ref="X2.89.1">possible</cue> to estimate the posterior odds ratio of interaction by multiplying the final likelihood ratio by the prior odds ratio of interaction</xcope>.</sentence>
					<sentence id="S2.90"><xcope id="X2.90.2">Protein pairs that have a posterior odds of interaction above 1 are more <cue type="speculation" ref="X2.90.2">likely</cue> to interact than <xcope id="X2.90.1"><cue type="negation" ref="X2.90.1">not</cue> to interact</xcope></xcope>, thus providing an obvious threshold to predict interacting proteins.</sentence>
					<sentence id="S2.91">Estimates for the prior odds ratio of interaction vary.</sentence>
					<sentence id="S2.92">Previous interaction studies on yeast and human use prior odds ratios that range from 1/600 to &gt; 1/400 37434647.</sentence>
					<sentence id="S2.93">The evaluation of this ratio is difficult because not all true interactions are known.</sentence>
					<sentence id="S2.94">As detailed in Methods, the prior odds ratio for human protein interaction was explored by considering different versions and subsets of human interaction datasets.</sentence>
					<sentence id="S2.95">This <xcope id="X2.95.1"><cue type="speculation" ref="X2.95.1">suggested</cue> that there is insufficient data currently available to determine a reliable ratio for human</xcope>.</sentence>
					<sentence id="S2.96">Accordingly, we selected a prior odds ratio of interaction of 1/400 which is similar to current estimates for yeast and is <xcope id="X2.96.1"><cue type="speculation" ref="X2.96.1">probably</cue> still quite conservative</xcope>.</sentence>
					<sentence id="S2.97">Thus, the likelihood ratio threshold to predict interactions is 400.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S2.98">Likelihood ratios of the modules</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.99">Figure 1 summarizes the likelihood ratios computed for the five modules.</sentence>
					<sentence id="S2.100">The different modules differ in the range of likelihood ratio values achieved by their different states.</sentence>
					<sentence id="S2.101">The Orthology and Combined modules both have states that achieve likelihood ratios above 400 (as high as 1207 for the Orthology module and 613 for the Combined module), <xcope id="X2.101.2"><cue type="speculation" ref="X2.101.2">indicating that</cue> both these modules <xcope id="X2.101.1"><cue type="speculation" ref="X2.101.1">can</cue>, on their own, predict some interacting protein pairs with a posterior odds ratio above 1</xcope></xcope>.</sentence>
					<sentence id="S2.102">The Expression module follows trends seen in previous studies with increasing likelihood ratios of interaction reflecting increasing expression correlation 3746.</sentence>
					<sentence id="S2.103">However, since the highest likelihood ratio for the expression datasets that we consider is 33, they are <xcope id="X2.103.1"><cue type="negation" ref="X2.103.1">not</cue> sufficient on their own to predict interacting protein pairs with a posterior odds ratio above 1</xcope>.</sentence>
					<sentence id="S2.104">Similarly, but in a much more pronounced way, the Disorder module is only slightly predictive of interaction, with a maximum likelihood ratio of 1.8.</sentence>
					<sentence id="S2.105">Most states of the Orthology module achieve higher likelihood ratios than the highest obtained by the Expression and Disorder modules.</sentence>
					<sentence id="S2.106">This is <xcope id="X2.106.1"><cue type="negation" ref="X2.106.1">not</cue> surprising</xcope> as the transfer of interacting orthologs (known as interologs 33) from one organism to another is a popular method to predict interactions (see for example 3448), particularly in the case of organisms like human for which only a small proportion of interactions are known.</sentence>
					<sentence id="S2.107">The direct transfer of interactions to human from either yeast, fly or worm does not alone result in a posterior odds ratio above 1 (as the likelihood ratios of interaction for all yeast, fly and worm bins in the Orthology module are below 400).</sentence>
					<sentence id="S2.108">This is <xcope id="X2.108.2"><cue type="negation" ref="X2.108.2">not</cue> surprising</xcope> as previous studies have <xcope id="X2.108.1"><cue type="speculation" ref="X2.108.1">indicated that</cue> quite stringent joint E-values must be used to transfer interactions safely between organisms</xcope> 3435.</sentence>
					<sentence id="S2.109">In contrast, the consideration of human interactions paralogous to the human protein pairs under investigation results in likelihood ratios of 431 and 1034 (depending on how close the paralogs are as described in Methods) which is much higher than those obtained for any single model organism.</sentence>
					<sentence id="S2.110">This agrees with a recent report that <xcope id="X2.110.1"><cue type="speculation" ref="X2.110.1">suggested</cue> protein-protein interactions are more conserved within species than across species</xcope> 49.</sentence>
					<sentence id="S2.111">The Combined module uses domain co-occurrence, post-translational modification (PTM) co-occurrence and subcellular localization information to predict interaction.</sentence>
					<sentence id="S2.112">These features were originally investigated separately, as shown in Figure 3, but their combination into one module that considers all dependencies between them achieves higher accuracy <xcope id="X2.112.1">(data <cue type="negation" ref="X2.112.1">not</cue> shown)</xcope> and higher likelihood ratios (as can be seen by comparing to Figure 1) while still being computationally feasible.</sentence>
					<sentence id="S2.113">Additionally, this combination circumvents <xcope id="X2.113.1"><cue type="speculation" ref="X2.113.1">possible</cue> problems of dependence between these features</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S2.114">Likelihood ratios of the features that form the Combined module, considered separately</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.115">Likelihood ratios of the features that form the Combined module, considered separately.</sentence>
					<sentence id="S2.116">The Combined module considers simultaneously three distinct features: the co-occurrence of both domains and PTMs as well as the subcellular co-localization of proteins.</sentence>
					<sentence id="S2.117">Here the likelihood ratios of these features considered separately are shown.</sentence>
					<sentence id="S2.118">In panel A, all domain pairs considered were given scores and likelihood ratios were estimated for different values of these scores.</sentence>
					<sentence id="S2.119">Similarly, shown in panel B are the likelihood ratios for different values of PTM co-occurrence scores.</sentence>
					<sentence id="S2.120">Panel C shows the likelihood ratios for protein pairs localized to different sets of cellular compartments.</sentence>
					<sentence id="S2.121">Previous methods have investigated the use of co-occurring domains to predict interaction (see for example 2346).</sentence>
					<sentence id="S2.122">Many pairs of domains co-occur in proteins known to interact.</sentence>
					<sentence id="S2.123">When investigated as a separate feature, the chi-square score of co-occurrence of domain pairs correlates well with the likelihood of interaction of protein pairs that contain these domains, with the highest chi-square score bin obtaining a likelihood ratio of 14, as shown in Figure 3A.</sentence>
					<sentence id="S2.124">Similarly, the co-occurrence of PTMs is also predictive of interaction, with its highest scoring bin obtaining a likelihood ratio of 6 as shown in Figure 3B.</sentence>
					<sentence id="S2.125">Lists of high scoring domain pairs and PTM pairs are shown in Additional Files 1 and 2.</sentence>
					<sentence id="S2.126">Subcellular localization has been extensively used both to assess the quality of interaction datasets 115051 and to generate examples of non-interacting protein pairs to use as negative datasets when training and testing predictors 3746.</sentence>
					<sentence id="S2.127">In the present study, the use of localization was investigated as a feature predictive of interaction.</sentence>
					<sentence id="S2.128">Four <xcope id="X2.128.2"><cue type="speculation" ref="X2.128.2">possible</cue> localization states</xcope> were considered for protein pairs:: same compartment, neighboring compartments, different non-neighboring compartments and <xcope id="X2.128.1"><cue type="negation" ref="X2.128.1">absence</cue> of localization annotation</xcope> (more details are given in the Methods section).</sentence>
					<sentence id="S2.129">As shown in Figure 3C, the likelihood ratio of same compartment protein pairs was found to be twice as high as that of randomly chosen or non-annotated protein pairs whereas different non-neighboring protein pairs are more than three times less likely to interact than random protein pairs.</sentence>
					<sentence id="S2.130">Individual localization features achieve low interaction likelihood ratios.</sentence>
					<sentence id="S2.131">However, when integrated into the Combined module, domain, PTM and localization information together achieve likelihood ratios that are high enough to predict interaction on their own (i.e. above 400).</sentence>
					<sentence id="S2.132">As expected, the highest likelihood ratio bins for the Combined module are those representing the highest combinations of the three features separately.</sentence>
					<sentence id="S2.133">The transitive module enhances the preliminary likelihood score (PS) (calculated using the group A modules) by considering the local topology of the resulting network which is assessed using the neighborhood topology score as detailed in the Methods section.</sentence>
					<sentence id="S2.134">The likelihood ratios for different values of the neighborhood topology score are shown in Figure 1B.</sentence>
					<sentence id="S2.135">The Transitive module is highly predictive of interaction and achieves likelihood ratios as high as 229.</sentence>
					<sentence id="S2.136"><xcope id="X2.136.1">This module <cue type="negation" ref="X2.136.1">cannot</cue> be used alone</xcope> as it requires as input the output of at least one group A module.</sentence>
					<sentence id="S2.137">However, it can <xcope id="X2.137.1"><cue type="speculation" ref="X2.137.1">predict</cue> interacting protein pairs with a posterior odds ratio above 1.0 when used in combination with any single module in group A</xcope> (as the product of the highest likelihood ratios of the transitive module and any group A module is greater than 400 as can be seen from Figure 1).</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S2.138">Independence of the modules</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.139">The final likelihood ratio output by the predictor is only representative of the true likelihood of interaction of a protein pair if the modules considered are independent.</sentence>
					<sentence id="S2.140">If the modules were <xcope id="X2.140.3"><cue type="negation" ref="X2.140.3">not</cue> independent</xcope>, <xcope id="X2.140.1"><xcope id="X2.140.2">some likelihood ratios <cue type="speculation" ref="X2.140.1">would</cue> <cue type="speculation" ref="X2.140.1">likely</cue> be overestimated, particularly for protein pairs that achieve simultaneously high likelihoods for non-independent features</xcope></xcope>.</sentence>
					<sentence id="S2.141">Conversely, <xcope id="X2.141.1">some likelihood ratios <cue type="speculation" ref="X2.141.1">would</cue> be underestimated for protein pairs achieving simultaneously low likelihoods for non-independent features</xcope>.</sentence>
					<sentence id="S2.142">Previous studies have demonstrated that some of the features considered here are indeed independent 43.</sentence>
					<sentence id="S2.143">Independence of all modules used in our predictor was verified by calculating Pearson correlation coefficients for all pairs of modules.</sentence>
					<sentence id="S2.144">As shown in Table 2, all modules considered are independent, since the highest Pearson correlation coefficients computed are well below any value considered significant.</sentence>
				</DocumentPart>
				<DocumentPart type="TableLegend">
					<sentence id="S2.145">Pairwise Pearson correlation for all modules</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S2.146">Accuracy of the predictors</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.147">All combinations of modules were examined to determine which of the resulting predictors achieved the highest prediction accuracy.</sentence>
					<sentence id="S2.148">In order to analyze the predictions, five-fold cross validation experiments were performed and the area under partial ROC (receiver operator characteristic) curves (partial AUCs) measured.</sentence>
					<sentence id="S2.149">ROC50 and ROC100 curves were selected as they consider a large enough number of positives to include <xcope id="X2.149.1">all protein pairs <cue type="speculation" ref="X2.149.1">predicted</cue> to have a posterior odds ratio above 1.0 by all the predictors investigated</xcope>.</sentence>
					<sentence id="S2.150">Protein pairs predicted to have a posterior odds ratio below 1.0 have an estimated true positive rate below 50% and thus <xcope id="X2.150.2">are more <cue type="speculation" ref="X2.150.2">likely</cue> <xcope id="X2.150.1"><cue type="negation" ref="X2.150.1">not</cue> to interact</xcope> than to interact</xcope>.</sentence>
					<sentence id="S2.151">These protein pairs are therefore <xcope id="X2.151.1"><cue type="negation" ref="X2.151.1">not</cue> of interest in this context</xcope>.</sentence>
					<sentence id="S2.152">The area under all ROCn curves considered is relatively low because of the high proportion of negatives with respect to positives in the training and test sets (100:1).</sentence>
					<sentence id="S2.153">Table 3 summarizes the characteristics of 19 different predictors and shows accuracy measures.</sentence>
					<sentence id="S2.154">Individual modules do <xcope id="X2.154.1"><cue type="negation" ref="X2.154.1">not</cue> achieve high scores for the areas under the ROC50 and ROC100</xcope>.</sentence>
					<sentence id="S2.155">In fact, all ROC50 AUC values achieved by individual modules are below 0.025 and the Expression and Disorder modules do <xcope id="X2.155.2"><cue type="negation" ref="X2.155.2">not</cue> predict any protein pairs <xcope id="X2.155.1">(positive <cue type="speculation" ref="X2.155.1">or</cue> negative)</xcope> above a posterior odds ratio of 1</xcope>, which is expected as the highest likelihood ratios they achieve are lower than 400 (see Figure 1A).</sentence>
					<sentence id="S2.156">As more Group A modules are considered within the same predictor, the ROCn AUC scores increase significantly, as <xcope id="X2.156.1"><cue type="speculation" ref="X2.156.1">would</cue> be expected</xcope> since these features are independent (as shown in Table 2) and thus contribute different information to the prediction.</sentence>
					<sentence id="S2.157">For example, the predictor that considers both the Expression and Combined modules achieves a ROC50 AUC of 0.033 compared to 0.003 and 0.022 respectively for the individual modules.</sentence>
					<sentence id="S2.158">However, the Disorder module does not contribute significantly to the prediction as predictors that consider it do <xcope id="X2.158.2"><cue type="negation" ref="X2.158.2">not</cue>, in general, do better than their counterparts that do <xcope id="X2.158.1"><cue type="negation" ref="X2.158.1">not</cue> use it</xcope></xcope>.</sentence>
					<sentence id="S2.159">For example, both the Expression-Orthology predictor and the Expression-Orthology-Disorder predictor achieve a ROC50 AUC of 0.024.</sentence>
					<sentence id="S2.160">The Disorder module offers the advantage of increasing the coverage of the prediction as a disorder score is calculated for all protein pairs.</sentence>
					<sentence id="S2.161">However, <xcope id="X2.161.2">this <cue type="speculation" ref="X2.161.2">appears</cue> to add more noise to the prediction <xcope id="X2.161.1"><cue type="negation" ref="X2.161.1">without</cue> increasing the accuracy</xcope></xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="TableLegend">
					<sentence id="S2.162">Prediction accuracy of different combinations of modules</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.163">As the scores of the predictors increase, so do the number of interactions predicted above different posterior odds ratio thresholds (see lower portion of Table 3).</sentence>
					<sentence id="S2.164">For example, the Expression-Orthology predictor achieves a ROC50 AUC of 0.024 and predicts 5670 interactions at a posterior odds ratio greater than 1 whereas the Expression-Orthology-Combined predictor achieves a ROC50 AUC of 0.044 and predicts over 15000 interactions at a posterior odds ratio above 1.</sentence>
					<sentence id="S2.165">The best combination of Group A modules is the predictor consisting of the Expression, Orthology and Combined modules.</sentence>
					<sentence id="S2.166">The Transitive module, which can only be used in combination with other modules, increases substantially the scores and number of interactions predicted.</sentence>
					<sentence id="S2.167">The right-hand portion of Table 3 shows the accuracy measures for the highest scoring subset of predictors that consider the Transitive module.</sentence>
					<sentence id="S2.168">The Transitive module enhances the prediction by identifying among protein pairs with a relatively high preliminary score those <xcope id="X2.168.1">that are most <cue type="speculation" ref="X2.168.1">likely</cue> to interact</xcope>, by considering the local topology of the network around them.</sentence>
					<sentence id="S2.169">For example, the ROC50 AUC rises from 0.044 to 0.075 when the Transitive module is added to the Expression-Orthology-Combined predictor, and the number of predictions above a posterior odds ratio of 1 doubles from 15330 to 34780.</sentence>
					<sentence id="S2.170">Once again, the Disorder module does <xcope id="X2.170.1"><cue type="negation" ref="X2.170.1">not</cue> contribute positively to the prediction</xcope>.</sentence>
					<sentence id="S2.171">Its inclusion does <xcope id="X2.171.1"><cue type="negation" ref="X2.171.1">not</cue> increase any of the measures of accuracy considered</xcope>.</sentence>
					<sentence id="S2.172">The predictor that considers the Expression, Orthology, Combined and Transitive modules is the one that achieves the highest accuracy overall.</sentence>
					<sentence id="S2.173">It is this predictor that is further analyzed in the next sections.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S2.174">Comparison to predictions generated using alternative training sets</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.175">In this work training sets were used that comprised 100 times more negatives than positives, with the negatives randomly selected and filtered to remove any known or <xcope id="X2.175.1"><cue type="speculation" ref="X2.175.1">suspected</cue> positives</xcope> (see Methods).</sentence>
					<sentence id="S2.176">Other groups have used negative:positive ratios ranging from 1 to more than 600 (see for example 374752).</sentence>
					<sentence id="S2.177">In addition, several groups use localization-derived negatives (i.e. protein pairs <xcope id="X2.177.2">that are <cue type="negation" ref="X2.177.2">not</cue> annotated as being localized to the same cellular compartment</xcope>) <xcope id="X2.177.1"><cue type="negation" ref="X2.177.1">rather than</cue> randomly chosen negatives</xcope> (see for example 374346).</sentence>
					<sentence id="S2.178">These issues have been investigated previously 53.</sentence>
					<sentence id="S2.179">Since the choice of negative training data <xcope id="X2.179.1"><cue type="speculation" ref="X2.179.1">may</cue> influence the method</xcope>, the choice of different training sets in the context of the probabilistic predictor presented here was investigated to determine which type of training set offers the highest accuracy.</sentence>
					<sentence id="S2.180">Table 4 compares the accuracy of predictors trained with negative:positive ratios of 1:100 and 1:1 and tested by five-fold cross validation.</sentence>
					<sentence id="S2.181"><xcope id="X2.181.1">Ratios greater than 100 were <cue type="negation" ref="X2.181.1">not</cue> considered</xcope> because they are computationally infeasible given the size of our datasets and the architecture of the predictor.</sentence>
					<sentence id="S2.182">To perform such a comparison, the EOCT predictor (Expression, Orthology, Combined and Transitive modules) was trained on datasets consisting of <xcope id="X2.182.1"><cue type="speculation" ref="X2.182.1">either</cue> equal numbers of positives and negatives <cue type="speculation" ref="X2.182.1">or</cue> 100 times more negatives than positives</xcope> and then tested on both types of datasets.</sentence>
					<sentence id="S2.183">As shown in Table 4, the predictors trained on datasets containing 100 times more negatives than positives perform significantly better than those trained on datasets containing equal numbers of positives and negatives.</sentence>
					<sentence id="S2.184">For example, the 1:1 pos:neg trained predictor achieves a ROC50 AUC of 0.0645 whereas its 1:100 pos:neg trained counterpart achieves a 0.0747 ROC50 AUC.</sentence>
					<sentence id="S2.185">This <xcope id="X2.185.1"><cue type="speculation" ref="X2.185.1">could</cue> be due to the fact that the number of non-interacting protein pairs outweighs greatly the number of interacting protein pairs in cells</xcope>.</sentence>
					<sentence id="S2.186">When equal numbers of positives and negatives are used in training, <xcope id="X2.186.1"><xcope id="X2.186.2">the diversity that exists in the non-interacting protein pair space <cue type="speculation" ref="X2.186.1">may</cue> <cue type="negation" ref="X2.186.2">not</cue> be captured</xcope></xcope>, thus resulting in misleading likelihood ratios for the predictive modules.</sentence>
					<sentence id="S2.187">It should be noted that predictors tested on datasets consisting of equal numbers of positives and negatives achieve much higher accuracy measures than those tested on datasets containing 100 times more negatives than positives.</sentence>
					<sentence id="S2.188">This is because the number of positives scoring higher than the highest scoring n negatives, for a given value of n and a given predictor, will be greater if there are equal numbers of positives and negatives in the test set than if there are more negatives than positives.</sentence>
				</DocumentPart>
				<DocumentPart type="TableLegend">
					<sentence id="S2.189">Influence of the negative:positive training set ratio on the prediction accuracy</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.190">The ROCn AUCs are an average of five separate experiments (each of which is itself a five-fold cross validation experiment).</sentence>
					<sentence id="S2.191">Their standard deviation is shown in parenthesis.</sentence>
					<sentence id="S2.192">The effect of localization-derived negatives <xcope id="X2.192.3"><cue type="negation" ref="X2.192.3">rather than</cue> randomly chosen negatives</xcope> was also investigated to see <xcope id="X2.192.2"><cue type="speculation" ref="X2.192.2">if</cue> it <xcope id="X2.192.1"><cue type="speculation" ref="X2.192.1">would</cue> increase the prediction accuracy</xcope></xcope>.</sentence>
					<sentence id="S2.193">A criticism of randomly chosen negatives is that they will contain some true interactors.</sentence>
					<sentence id="S2.194">However, the set of interacting pairs in the full protein pair space is small and thus the contamination rate of randomly chosen negative datasets will in fact be very low.</sentence>
					<sentence id="S2.195">Contamination is <xcope id="X2.195.2"><cue type="speculation" ref="X2.195.2">probably</cue> below 1%</xcope>, which is <xcope id="X2.195.1"><cue type="speculation" ref="X2.195.1">likely</cue> lower than the contamination rate of the positive dataset</xcope> as discussed in 47.</sentence>
					<sentence id="S2.196">Localization-derived negatives, on the other hand, <xcope id="X2.196.1"><cue type="speculation" ref="X2.196.1">should</cue> be free of contamination</xcope>, if the localization annotations are complete and accurate, both conditions that are difficult to obtain as discussed in 54.</sentence>
					<sentence id="S2.197">However, one can argue that localization-derived negatives <xcope id="X2.197.3"><cue type="speculation" ref="X2.197.3">might</cue> <xcope id="X2.197.2"><cue type="negation" ref="X2.197.2">not</cue> be able to capture the full diversity of the non-interacting protein space</xcope></xcope> since many proteins in the same cellular compartment do <xcope id="X2.197.1"><cue type="negation" ref="X2.197.1">not</cue> interact</xcope>.</sentence>
					<sentence id="S2.198">In addition, proteins specific to a cellular compartment <xcope id="X2.198.1"><cue type="speculation" ref="X2.198.1">may</cue> have different characteristics to proteins in other compartments</xcope>.</sentence>
					<sentence id="S2.199">Such predictors <xcope id="X2.199.3"><cue type="speculation" ref="X2.199.3">may</cue> <xcope id="X2.199.2"><cue type="negation" ref="X2.199.2">not</cue> generalize well</xcope></xcope> when predicting on cell-wide protein pairs which consist not only of non-colocalized non-interacting pairs but also numerous protein pairs that do <xcope id="X2.199.1"><cue type="negation" ref="X2.199.1">not</cue> interact</xcope> but are present in the same cellular compartment.</sentence>
					<sentence id="S2.200">These issues have been discussed previously 52.</sentence>
					<sentence id="S2.201">In order to see <xcope id="X2.201.2"><cue type="speculation" ref="X2.201.2">if</cue> different types of negatives <xcope id="X2.201.1"><cue type="speculation" ref="X2.201.1">could</cue> influence the accuracy of the predictors developed here</xcope></xcope> we generated negative training/test sets as in 46 by identifying all pairs of human proteins for which one protein is annotated as being nuclear and the other is annotated as being localized to the plasma membrane in the HPRD database 15.</sentence>
					<sentence id="S2.202">The Combined module for these predictors only considers domains and PTMs but <xcope id="X2.202.2"><cue type="negation" ref="X2.202.2">not</cue> subcellular localization</xcope> as this <xcope id="X2.202.1"><cue type="speculation" ref="X2.202.1">would</cue> result in using this feature both in the selection of the training set and as a feature predictive of interaction</xcope>.</sentence>
					<sentence id="S2.203">The localization-derived negative trained predictor tested on sets containing localization-derived negatives achieves a lower accuracy than that of the random negative trained predictor tested on a test set containing randomly-generated negatives (0.0686 +/- 0.0010 vs 0.0747 +/- 0.0022).</sentence>
					<sentence id="S2.204">This is most <xcope id="X2.204.4"><cue type="speculation" ref="X2.204.4">likely</cue> due to the fact that the localization-derived negative trained predictor <xcope id="X2.204.3"><cue type="negation" ref="X2.204.3">cannot</cue> take full advantage of the Transitive module</xcope></xcope>, since the network resulting from the predictions of the Group A modules <xcope id="X2.204.2"><cue type="speculation" ref="X2.204.2">likely</cue> does <xcope id="X2.204.1"><cue type="negation" ref="X2.204.1">not</cue> sample the whole protein pair space well</xcope></xcope>.</sentence>
					<sentence id="S2.205">Our predictor trained with randomly generated negatives and a negative:positive ratio of 100 performs the best out of all the combinations of training sets investigated.</sentence>
					<sentence id="S2.206">It is this predictor that is further analyzed in subsequent sections.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S2.207">Contribution of the modules</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.208">The relative contribution of the modules to the prediction of interaction was investigated in order to gain a better understanding of the predictive power and areas of highest usefulness of the different modules.</sentence>
					<sentence id="S2.209">To do this, all protein pairs were considered that achieve an estimated posterior odds ratio &gt; 1 when the EOCT predictor was trained on the full datasets <xcope id="X2.209.1"><cue type="negation" ref="X2.209.1">without</cue> cross-validation</xcope>.</sentence>
					<sentence id="S2.210">This set consists of 37606 distinct predicted interactions and is referred to as the LR400 dataset (all these interactions are listed and ranked in Additional File 3).</sentence>
					<sentence id="S2.211">These protein pairs represent the most <xcope id="X2.211.1"><cue type="speculation" ref="X2.211.1">probable</cue> interactors</xcope> with respect to the features considered, among all protein pairs examined by the predictor.</sentence>
					<sentence id="S2.212">To investigate the individual contribution of each module, we looked at the number of interactions predicted out of all LR400 pairs as a function of the minimum likelihood ratio of each module.</sentence>
					<sentence id="S2.213">As shown in Figure 4A, all modules contribute positively (i.e. contribute a likelihood ratio greater than 1.0) to the prediction of a certain proportion of the interactions in the LR400 dataset.</sentence>
					<sentence id="S2.214">The Transitive module and to an even greater extent, the Combined module contribute positively to the prediction of a very high proportion of the LR400 protein pairs (73% and 91% of the LR400 interactions have likelihood ratios greater than 1 for the Transitive and Combined modules respectively).</sentence>
					<sentence id="S2.215">The Transitive module provides a likelihood ratio of 91 for the prediction of over 70% of the LR400 interactions.</sentence>
					<sentence id="S2.216">The Combined module provides positive evidence for the highest number of interactions of the LR400 dataset.</sentence>
					<sentence id="S2.217">However, the value of the likelihood ratio it contributes is below 20 for over 50% of protein pairs in the LR400 dataset (which means that for these protein pairs, the Combined module must be used in combination with other modules to achieve a total likelihood ratio above 400).</sentence>
					<sentence id="S2.218">The Combined module does, however, achieve likelihood ratios high enough to predict over two thousand interactions of the LR400 dataset on its own, less than 15% of which are present in the training set.</sentence>
					<sentence id="S2.219">The Orthology module contributes to the prediction of only 8474 protein pairs in the LR400 dataset (23%).</sentence>
					<sentence id="S2.220">However, a large majority (&gt; 75%) of these 8474 predicted interactions achieve likelihood ratios above 200 from this module.</sentence>
					<sentence id="S2.221">In fact, almost 40% of these LR400 interactions achieve a likelihood ratio above 400 from the Orthology module.</sentence>
					<sentence id="S2.222">This <xcope id="X2.222.1"><cue type="speculation" ref="X2.222.1">indicates that</cue> most interactions predicted by the Orthology module (alone or in combination with other modules) are based on the highest scoring Orthology bins (see Figure 1A) which are the most conserved yeast interactions (whose bin achieves a likelihood ratio of 237), as well as human paralogous interactions and interactions found in more than one model organism (both of which achieve a likelihood well above 400)</xcope>.</sentence>
					<sentence id="S2.223">Few interactions in the LR400 dataset are predicted on the basis of having interacting orthologs in worm or fly alone.</sentence>
					<sentence id="S2.224">The Expression module provides positive evidence for a little less than half the predictions in the LR400 dataset.</sentence>
					<sentence id="S2.225">However, as previously noted, the highest likelihood provided by this module is 33 and thus the Expression module <xcope id="X2.225.1"><cue type="negation" ref="X2.225.1">cannot</cue> predict interaction on its own</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S2.226">Contribution of the modules</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.227">Contribution of the modules.</sentence>
					<sentence id="S2.228">To examine the contribution of the different modules, we plotted the number of interactions predicted among all LR400 interactions (all interactions predicted using the full predictor that obtain a likelihood ratio of interaction greater than 400) as a function of the minimum likelihood ratio of individual modules (in panel A) or of combinations of modules (in panel B).</sentence>
					<sentence id="S2.229">In the case of combinations of modules (panel B), the minimum likelihood ratio is the product of the likelihood ratios of the modules considered.</sentence>
					<sentence id="S2.230">Thus for example, the product of the expression and orthology ratios is greater than 1 for almost 20000 LR400 interactions and greater than 10 for approximately 10000 LR400 interactions (dark blue diamonds in panel B).</sentence>
					<sentence id="S2.231">E: Expression module, O: Orthology module, C: Combined module, T: Transitive module.</sentence>
					<sentence id="S2.232">Figure 4B summarizes the contributions of different combinations of modules.</sentence>
					<sentence id="S2.233">The Combined and Transitive modules contribute the most to the prediction of interactions.</sentence>
					<sentence id="S2.234">They alone can predict approximately 27000 of the 37606 interactions of the LR400 dataset.</sentence>
					<sentence id="S2.235">When they are both present, regardless of which other modules are also present, they predict over 70% of the LR400 interactions.</sentence>
					<sentence id="S2.236">When <xcope id="X2.236.1">either of these two modules is <cue type="negation" ref="X2.236.1">absent</cue></xcope>, fewer than 12500 interactions are predicted.</sentence>
					<sentence id="S2.237">In contrast, the two remaining modules (Expression and Orthology) can predict approximately 5000 interactions together.</sentence>
					<sentence id="S2.238">This is interesting as many of the publicly available predicted interaction datasets mentioned in the Background section use mainly orthology transfer from model organisms to identify interactions.</sentence>
					<sentence id="S2.239">As the majority of the LR400 interactions are derived from the Combined and Transitive modules, it is <xcope id="X2.239.2"><cue type="speculation" ref="X2.239.2">possible</cue> that the method is identifying a large subset of interactions that are <xcope id="X2.239.1"><cue type="negation" ref="X2.239.1">not</cue> common to previous human protein interaction datasets</xcope></xcope>.</sentence>
					<sentence id="S2.240">This is discussed further in the next section.</sentence>
					<sentence id="S2.241">The curve representing the full predictor (consisting of the Expression, Orthology, Combined and Transitive modules) is also represented in Figure 4B (the dark green squares).</sentence>
					<sentence id="S2.242">By definition, it predicts all proteins in the LR400 dataset at likelihood ratios equal to or above 400 (this is how the LR400 dataset was generated).</sentence>
					<sentence id="S2.243">The right side of the curve illustrates the number of interactions that are predicted above likelihood ratios of 400 and more.</sentence>
					<sentence id="S2.244">As shown in Figure 4B, the full predictor predicts approximately 20000 interactions at a total likelihood ratio of 1600 (which is equivalent to an estimated posterior odds ratio of 4).</sentence>
					<sentence id="S2.245">At a likelihood ratio of 4000, approximately 11000 interactions are predicted and at a likelihood ratio of 8000, approximately 6500 interactions are predicted.</sentence>
					<sentence id="S2.246">We verified that the increasing estimated posterior odds ratios translated into better predictive value.</sentence>
					<sentence id="S2.247">Figure 5 shows the true positive rate versus false positive rate for different posterior odds ratios as measured by five-fold cross validation.</sentence>
					<sentence id="S2.248">As the posterior odds ratio increases, the false positive rate decreases and the relative proportion of true positives increases when compared to the proportion of false positives.</sentence>
					<sentence id="S2.249">Accordingly, <xcope id="X2.249.1">subsets of very high quality predictions <cue type="speculation" ref="X2.249.1">may</cue> be generated by choosing a suitably high posterior odds ratio threshold</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S2.250">True positive rate versus false positive rate for different estimated posterior odds ratios</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.251">True positive rate versus false positive rate for different estimated posterior odds ratios.</sentence>
					<sentence id="S2.252">The true positive rate (TPR) versus false positive rate (FPR) is plotted for different values of the posterior odds ratio estimated for the dataset by five-fold cross-validation.</sentence>
					<sentence id="S2.253">As the posterior odds ratio increases, the false positive rate decreases and the ratio of the true positive rate divided by the false positive ratio increases.</sentence>
					<sentence id="S2.254">Thus, higher quality datasets can be generated by requiring higher posterior odds ratios.</sentence>
					<sentence id="S2.255">The TPR is calculated as the number of true positives predicted divided by the total number of positives in the test set.</sentence>
					<sentence id="S2.256">The FPR is calculated as the number of false positives predicted divided by the total number of negatives in the test set.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S2.257">Comparison to other interaction datasets</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.258">The false positive rate (FPR) of our predictor was estimated by the method of D'Haeseleer and Church 1855 and used to compare it to other prediction datasets.</sentence>
					<sentence id="S2.259">The Ramani interaction dataset that was automatically extracted from the literature 16 as well as all new interactions present in the October 2006 version of the manually curated HPRD database 15 (but <xcope id="X2.259.1"><cue type="negation" ref="X2.259.1">none</cue> of the interactions also present in earlier versions of the HPRD which were used to train our predictor) were taken as reference datasets</xcope>.</sentence>
					<sentence id="S2.260">The D'Haeseleer and Church method compares two experimental datasets to a reference set and <xcope id="X2.260.1"><cue type="speculation" ref="X2.260.1">assumes</cue> that all intersections between the three datasets contain true positives</xcope>.</sentence>
					<sentence id="S2.261">It is thus <xcope id="X2.261.1"><cue type="speculation" ref="X2.261.1">possible</cue> to estimate the number of true positives predicted by an experimental dataset by comparing the number of interactions present in the different intersections of the two experimental methods and the reference dataset</xcope> (for details, see 1855).</sentence>
					<sentence id="S2.262">Here, we compare three human interaction prediction datasets: the Rhodes probabilistic dataset 46, the Lehner orthology-derived dataset 36 and the most accurate of our predictors (the LR400 subset of the predictor considering the Expression, Orthology, Combined and Transitive modules).</sentence>
					<sentence id="S2.263">We estimated false positive rates for each of the datasets by comparing them two by two to one of the reference datasets, thus generating 4 to 6 different estimates of false positive rates for each computational dataset, as shown in Figure 6A <xcope id="X2.263.1">(the two Lehner datasets were <cue type="negation" ref="X2.263.1">not</cue> compared to each other</xcope>, which is why they have fewer FPR estimates).</sentence>
					<sentence id="S2.264">The rates estimated for the Rhodes and Lehner datasets are similar to previous estimates 18.</sentence>
					<sentence id="S2.265">The estimated false positive rates for the LR400, Rhodes and core Lehner are quite similar (an average of 76% FPR for both the LR400 and core Lehner datasets and 78% for the Rhodes dataset) and well below the overall average false-positive rate of 90% estimated for most available human high-throughput experimental and prediction interaction datasets 18.</sentence>
					<sentence id="S2.266">It should be noted that the Rhodes, Lehner and Ramani datasets annotate interactions as a relationship between human genes and <xcope id="X2.266.1"><cue type="negation" ref="X2.266.1">not</cue> their protein products directly</xcope>.</sentence>
					<sentence id="S2.267">However, not all proteins encoded by a single gene will necessarily interact with all protein products encoded by a second gene, even if one such protein pair does.</sentence>
					<sentence id="S2.268">This is why we describe interactions as a relationship between two proteins, allowing for a more precise description of the interaction.</sentence>
					<sentence id="S2.269">To compare our predictions to these datasets, we <xcope id="X2.269.1"><cue type="speculation" ref="X2.269.1">consider</cue> that two genes interact if at least one of their respective protein products interact</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S2.270">Comparison to other interaction datasets</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.271">Comparison to other interaction datasets.</sentence>
					<sentence id="S2.272">The false positive rates shown in panel A were estimated for the LR400 dataset as well as the Rhodes [46] and Lehner [36] predictions using the method described in [18, 55] by comparing them two-by-two to a reference dataset.</sentence>
					<sentence id="S2.273">The number and overlap of distinct proteins (shown in B) and distinct interactions (shown in C) are shown for the LR400 dataset, the Rhodes prediction dataset and the June 2006 version of the HPRD.</sentence>
					<sentence id="S2.274">In Figure 6B and 6C, we compare the number of distinct proteins and distinct interactions of the LR400 dataset to those of the Rhodes prediction dataset and the June 2006 version of the HPRD which was used to train our predictor.</sentence>
					<sentence id="S2.275">The Rhodes dataset was trained using an earlier version of the HPRD.</sentence>
					<sentence id="S2.276">As can be seen in Figure 6, the intersections between the three datasets considered are low, especially when comparing the interactions.</sentence>
					<sentence id="S2.277">Both the Rhodes dataset and our LR400 dataset predict interactions involving many proteins that are <xcope id="X2.277.1"><cue type="negation" ref="X2.277.1">not</cue> even present in their positive training set (the HPRD)</xcope>.</sentence>
					<sentence id="S2.278">Many of the predictions in these two datasets concern protein pairs and proteins that are <xcope id="X2.278.2"><cue type="negation" ref="X2.278.2">not</cue> present in other datasets</xcope>, <xcope id="X2.278.1"><cue type="speculation" ref="X2.278.1">suggesting</cue> that they cover different regions of the human interaction space</xcope>.</sentence>
					<sentence id="S2.279">As <xcope id="X2.279.3"><cue type="speculation" ref="X2.279.3">suggested</cue> in 18, by making more such datasets available, it will be <xcope id="X2.279.2"><cue type="speculation" ref="X2.279.2">possible</cue> to increase our coverage of the interaction space and determine the most <xcope id="X2.279.1"><cue type="speculation" ref="X2.279.1">likely</cue> human interactions</xcope></xcope></xcope>.</sentence>
					<sentence id="S2.280">Another human interaction dataset has recently become available: the IntNetDB 56.</sentence>
					<sentence id="S2.281">It was generated by integrating seven different features (four of which involve transferring interactions or characteristics of protein pairs from model organisms to human) in a probabilistic framework.</sentence>
					<sentence id="S2.282">Interactions were predicted above a TP/FP ratio (number of true positives divided by the number of false positives in the test set) of 1.</sentence>
					<sentence id="S2.283">Using such a threshold, the authors claim to predict 180 010 human interactions.</sentence>
					<sentence id="S2.284">We do <xcope id="X2.284.2"><cue type="negation" ref="X2.284.2">not</cue> compare our predictions to this dataset</xcope> because such a threshold of TP/FP &gt; 1 does <xcope id="X2.284.1"><cue type="negation" ref="X2.284.1">not</cue> correspond to a posterior odds threshold &gt; 1</xcope>.</sentence>
					<sentence id="S2.285">Depending on the positive-to-negative ratio used in the datasets, TP/FP &gt; 1 <xcope id="X2.285.1"><cue type="speculation" ref="X2.285.1">might</cue> correspond to an average posterior odds ratio of 1</xcope>.</sentence>
					<sentence id="S2.286">In contrast, the average posterior odds ratio of our LR400 dataset is above 700.</sentence>
					<sentence id="S2.287">In comparison, by using a threshold of TP/FP &gt; 1 in our test set, we <xcope id="X2.287.1"><cue type="speculation" ref="X2.287.1">predict</cue> over 1 000 000 human interactions</xcope>.</sentence>
					<sentence id="S2.288">We do <xcope id="X2.288.2"><cue type="negation" ref="X2.288.2">not</cue> <xcope id="X2.288.1"><cue type="speculation" ref="X2.288.1">believe</cue> that the quality of this large number of predictions is high enough to warrant their publication</xcope></xcope> since the great majority of these protein pairs achieve a posterior odds ratio below 1.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S2.289">Independent validation</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.290">Although the overlap between the LR400 dataset and the HPRD-derived positive training set is below 10% as shown in Figure 6C, the proportion of interactions common to these two sets is <xcope id="X2.290.1"><cue type="negation" ref="X2.290.1">not</cue> equally distributed for all posterior odds ratios of interaction values</xcope>.</sentence>
					<sentence id="S2.291">As shown in Figure 7, while less than 3% of the protein pairs predicted to interact at posterior odds ratios between 1 and 2 overlap with the HPRD dataset used for training, this value increases to over 50% for the highest scoring subsets of the LR400 dataset.</sentence>
					<sentence id="S2.292">These highest scoring predictions receive high likelihood ratios of interaction from all four predictive modules and represent the strongest examples of interaction as evaluated by our predictor.</sentence>
					<sentence id="S2.293">Such examples include interactions that allow the formation of well-known protein complexes such as the proteasome, the MCM protein complex involved in the initiation of genome replication, replication factor C, the TBP/TAF complex (TBP-associated factors) and the EIF complex (eukaryotic translation initiation factors).</sentence>
					<sentence id="S2.294">The highest scoring predictions in the LR400 dataset thus mainly represent interactions present in the HPRD dataset as well as interactions between proteins that have strong sequence identity to these known interacting pairs.</sentence>
					<sentence id="S2.295">However, as the posterior odds ratio decreases, the overlap between the predictions and the HPRD-derived training set decreases.</sentence>
					<sentence id="S2.296">Some subsets of quite high posterior odds have much smaller overlaps with the training set.</sentence>
					<sentence id="S2.297">For example, interactions predicted at posterior odds ratios between 128 and 2048 have a 20 to 30% overlap with the training set as shown in Figure 7.</sentence>
					<sentence id="S2.298">Although <xcope id="X2.298.2">many of these novel predictions have <cue type="negation" ref="X2.298.2">not</cue> been previously investigated in the literature</xcope>, there exists experimental evidence supporting a subset of these predictions which is <xcope id="X2.298.1"><cue type="negation" ref="X2.298.1">not</cue> present in the June 2006 version of the HPRD used to train our predictor</xcope>, thus providing independent validation of our method.</sentence>
					<sentence id="S2.299">Five such validated predictions are reported here:</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S2.300">Overlap of different subsets of the LR400 dataset with the HPRD-derived training set</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.301">Overlap of different subsets of the LR400 dataset with the HPRD-derived training set.</sentence>
					<sentence id="S2.302">The number of interactions predicted and the proportion of overlap with the training set (which was derived from the HPRD) were calculated for subsets of the LR400 dataset of different posterior odds ratios.</sentence>
					<sentence id="S2.303">-TCPTP was predicted to interact with STAT6 at a posterior odds ratio of 4300.</sentence>
					<sentence id="S2.304">It has been recently reported that TCPTP, the only protein tyrosine phosphatase known to localize to the nucleus, dephosphorylates STAT6 in this cellular compartment, which <xcope id="X2.304.1"><cue type="speculation" ref="X2.304.1">may</cue> in turn lead to the suppression of Interleukine-4 (IL-4) induced signaling</xcope> 57.</sentence>
					<sentence id="S2.305">-N-WASP and ARP3 achieve a predicted posterior odds ratio of interaction of 2700.</sentence>
					<sentence id="S2.306">A recent report <xcope id="X2.306.2"><cue type="speculation" ref="X2.306.2">suggested</cue> that the IQGAP1 protein <xcope id="X2.306.1"><cue type="speculation" ref="X2.306.1">can</cue> activate N-WASP</xcope> thus changing its conformation and allowing it to bind the ARP2/3 complex, which in turn directs the generation of branched actin filaments required for the extension of a lamellipodium</xcope> 58.</sentence>
					<sentence id="S2.307">-The VAMP3-VTI1A interaction was predicted with a posterior odds ratio of 1518.</sentence>
					<sentence id="S2.308"><xcope id="X2.308.1">Both these proteins are <cue type="speculation" ref="X2.308.1">believed</cue> to be part of the SNARE (soluble N-ethylmaleimide-sensitive factor attachment protein receptor) family of proteins which are involved in membrane fusion events</xcope>.</sentence>
					<sentence id="S2.309">VTI1A is a trans-Golgi-network-localized <xcope id="X2.309.1"><cue type="speculation" ref="X2.309.1">putative</cue> t-SNARE</xcope> 59 and VAMP3 is an early/recycling endosomal v-SNARE 60.</sentence>
					<sentence id="S2.310">These two proteins were recently shown to interact, leading to their functional implication in the post-Golgi retrograde transport step 61.</sentence>
					<sentence id="S2.311">-CDK2 and MCM4 were predicted to interact at a posterior odds ratio of 62.</sentence>
					<sentence id="S2.312">CDK2 has recently been shown to phosphorylate MCM4, a subunit of a <xcope id="X2.312.1"><cue type="speculation" ref="X2.312.1">putative</cue> replicative helicase</xcope> essential for DNA replication, on two distinct residues, leading to a change in its affinity to chromatin and its enrichment in the nucleolus 62.</sentence>
					<sentence id="S2.313">-Sam68 and Smad2 achieve a predicted posterior odds ratio of 32.</sentence>
					<sentence id="S2.314">This interaction has been experimentally demonstrated by large-scale yeast-two-hybrid analysis of the Smad signaling system 63.</sentence>
					<sentence id="S2.315">Our probabilistic predictor therefore not only reproduces and completes well-known protein complexes but also identifies novel interactions, a subset of which have been independently validated.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S2.316">Conclusion</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.317"><xcope id="X2.317.1">The current human protein interaction map is <cue type="speculation" ref="X2.317.1">estimated</cue> to be only 10% complete</xcope> 18.</sentence>
					<sentence id="S2.318">Here, we investigated the prediction of human protein-protein interactions in an effort to increase the coverage of the human interactome while simultaneously providing high quality predictions.</sentence>
					<sentence id="S2.319">By considering several different types of orthogonal and quite distinct features including expression, orthology, combined protein characteristics and local network topology, we predicted over 37000 human protein interactions and explored a subspace of the human interactome <xcope id="X2.319.1">that has <cue type="negation" ref="X2.319.1">not</cue> been investigated by previous large interaction datasets</xcope>.</sentence>
					<sentence id="S2.320">Our investigation led us to compare the influence of different training sets on the prediction accuracy.</sentence>
					<sentence id="S2.321">The use of randomly generated negative training examples and large negative-to-positive ratios in the training set generated the most accurate predictors in the context of our model.</sentence>
					<sentence id="S2.322">A comparison to other large human interaction datasets revealed the average false positive rate of our dataset to be 76%, which is much lower than the overall average for most large scale, currently available, human interaction datasets (experimental and computational) <xcope id="X2.322.1"><cue type="speculation" ref="X2.322.1">estimated</cue> to be 90%</xcope> 18.</sentence>
					<sentence id="S2.323">A subset of our novel predictions have been independently validated by identifying recent reports that experimentally investigated and confirmed that these protein pairs do interact.</sentence>
					<sentence id="S2.324">We provide all our predictions ranked according to the posterior odds ratio of interaction in Additional File 3.</sentence>
					<sentence id="S2.325">It is thus <xcope id="X2.325.2"><cue type="speculation" ref="X2.325.2">possible</cue> to restrict the dataset to the highest scoring protein pairs (and only choose for example, protein pairs that have an <xcope id="X2.325.1"><cue type="speculation" ref="X2.325.1">estimated</cue> true positive rate of interaction above 90%)</xcope></xcope>.</sentence>
					<sentence id="S2.326">By making this human interaction prediction dataset publicly available, it is our <xcope id="X2.326.1"><cue type="speculation" ref="X2.326.1">hope</cue> that it will help to identify the most high-confidence interactions, leading to a more complete and accurate human interaction map</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S2.327">Methods</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S2.328">Datasets</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.329">In order to investigate the likelihood of interaction of human proteins, 62322 human protein sequences were downloaded from the International Protein Index (IPI) database (version 3.16) 64.</sentence>
					<sentence id="S2.330">Some of these proteins are alternative transcripts of the same gene but can have distinct interaction partners.</sentence>
					<sentence id="S2.331">Known interactions were downloaded from the Human Protein Reference Database (HPRD; June 2006 version) 15.</sentence>
					<sentence id="S2.332"><xcope id="X2.332.1">Duplicate interactions and self-interactions were <cue type="negation" ref="X2.332.1">not</cue> considered</xcope>.</sentence>
					<sentence id="S2.333">Additionally, <xcope id="X2.333.1">some proteins were <cue type="negation" ref="X2.333.1">not</cue> recovered in the conversion between different identifiers</xcope>.</sentence>
					<sentence id="S2.334">This resulted in 26896 distinct human protein interactions involving 7531 distinct human proteins present in the initial IPI dataset.</sentence>
					<sentence id="S2.335">The 26896 interactions from the June 2006 version of the HPRD were used as the positive dataset in the training/testing of the predictor.</sentence>
					<sentence id="S2.336">Two different sets of non-interacting protein pairs were investigated: the main analysis employed a randomly-generated negative dataset but this was also compared to a localization-derived negative dataset.</sentence>
					<sentence id="S2.337">Both non-interacting protein datasets were cleaned by removing all protein pairs that came from the positive dataset as well as protein pairs that were annotated as interacting in other databases (DIP 65: 679 interactions, BIND 66: 2650 interactions), or predicted to interact in other studies (OPHID 67: 21815 interactions).</sentence>
					<sentence id="S2.338">Of the 62322 human proteins from the initial IPI dataset, 22889 were characterized by at least one of the features that we considered to predict interaction (see the Features section).</sentence>
					<sentence id="S2.339">These 22889 human proteins are encoded by 16904 distinct genes and are referred to as the Informative Protein Set.</sentence>
					<sentence id="S2.340">The randomly-generated negative dataset used for the training and testing of the predictor was created by selecting protein pairs at random from the Informative Protein Set.</sentence>
					<sentence id="S2.341">In contrast, the localization-derived negative dataset was created by selecting protein pairs from the Informative Protein Set for which the HPRD 15 annotates one as being primarily in the plasma membrane and the other as primarily in the nucleus.</sentence>
					<sentence id="S2.342">Training and testing was performed with 5-fold cross-validation.</sentence>
					<sentence id="S2.343">In addition, positive to negative ratios of 1:1 and 1:100 were considered.</sentence>
					<sentence id="S2.344">The predictions were compared to the literature-mined Ramani dataset 16, the orthology-derived Lehner prediction dataset 36 and the probabilistic Rhodes prediction dataset 46.</sentence>
					<sentence id="S2.345">All three datasets identify the interactions by stating the names and/or gene locus IDs of the genes that encode the interacting proteins.</sentence>
					<sentence id="S2.346">In contrast, we work directly on the protein sequences and so related the gene annotations to our protein identifiers by extracting Entrez Gene IDs corresponding to the IPI protein entries from the IPI cross-reference files (for the IPI release 3.24) 64.</sentence>
					<sentence id="S2.347">Ensembl gene identifiers (Ensembl 42) were also matched to Entrez Locus IDs (NCBI36) using BioMart 68.</sentence>
					<sentence id="S2.348"><xcope id="X2.348.1">Some gene-gene entries were <cue type="negation" ref="X2.348.1">not</cue> recovered in the conversion between different identifiers, or due to the deletion or replacement of some Entrez Locus IDs</xcope>.</sentence>
					<sentence id="S2.349">Despite this, 37714 gene-gene interactions were recovered from the Rhodes dataset and 6132 interactions from the Ramani dataset as well as 64306 and 10454 interactions from the Lehner full and core datasets respectively.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S2.350">Learning method</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.351">Semi-na&#239;ve Bayes classifiers were used to measure the likelihood of interaction of two proteins given the presence of the features considered.</sentence>
					<sentence id="S2.352">This learning method was chosen because it allows the integration of highly heterogeneous data in a model that is easy to interpret and that can readily accommodate missing data.</sentence>
					<sentence id="S2.353">The transparency of the method allows the straightforward determination of which features are most predictive of interaction at the level of the whole proteome as well as for individual protein pairs.</sentence>
					<sentence id="S2.354">The prediction of protein interaction is a binary problem which can be expressed in Bayesian formalism.</sentence>
					<sentence id="S2.355">We are interested in determining the posterior odds ratio of interaction of two proteins, given the presence of the features we are considering.</sentence>
					<sentence id="S2.356">This posterior odds ratio can be re-written using Bayes rule: Opost = P(I|f1,...,&#160;fn)P(~I|f1,...,&#160;fn)=P(f1,...,&#160;fn|I)&#8727;P(I)P(f1,...,&#160;fn)P(f1,...,&#160;fn|~I)&#8727;P(~I)P(f1,...,&#160;fn)=P(f1,...,&#160;fn|I)&#8727;P(I)P(f,...,&#160;fn|~I&#8727;P(~I)=P(I)P(~I&#8727;P(f1,...,&#160;fn|I)P(f1,...,&#160;fn|~I)=Oprior&#8727;LR(1,...,fn) where I is a binary variable representing interaction, ~I represents non-interaction, f1 through fn are the features we are considering, Oprior is the prior odds ratio and LR is the likelihood ratio.</sentence>
					<sentence id="S2.357">If the features considered are independent, the likelihood ratio LR can be calculated as the product of the individual likelihood ratios with respect to the features considered separately.</sentence>
					<sentence id="S2.358">If the features are <xcope id="X2.358.2"><cue type="negation" ref="X2.358.2">not</cue> independent</xcope>, all possible combinations of all states of these features must be considered, which <xcope id="X2.358.1"><cue type="speculation" ref="X2.358.1">can</cue> be computationally quite intensive</xcope>.</sentence>
					<sentence id="S2.359">In the independent case, the likelihood ratio can be calculated as: LR(f1,...,fn)=[P(f1,...,&#160;fn|I)P(f1,...,&#160;fn|~I)]=&#8719;i=1n[P(fi|I)P(fi~I)]</sentence>
					<sentence id="S2.360">The likelihood ratios for the different features considered can be estimated by evaluating the ratio of the proportion of interacting and non-interacting proteins for which a particular state of the feature is true in the training set (i.e. by determining to which bin of the feature the protein pair belongs, for every protein pair in the positive and negative training sets).</sentence>
					<sentence id="S2.361">More precisely, the training step consisted of calculating the respective proportions of positive and negative examples that fall into each bin of the feature(s) considered (i.e. that have a particular state).</sentence>
					<sentence id="S2.362">The likelihood ratio of interaction for a given state is simply the ratio of the proportion of all positives that have that state divided by the proportion of all negatives that have that same state.</sentence>
					<sentence id="S2.363">When a particular state of a feature occurs only in positive examples (known interacting proteins), the likelihoods are set to the highest non-infinite value of any state for that feature (to avoid infinite values).</sentence>
					<sentence id="S2.364">Additionally, when <xcope id="X2.364.1"><cue type="negation" ref="X2.364.1">no</cue> data are available for a specific feature (for a given pair of proteins)</xcope>, the likelihood of the feature is set to 1.0.</sentence>
					<sentence id="S2.365">For a detailed calculation of the likelihoods see Additional File 4.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S2.366">Prior odds ratio estimate</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.367">The prior odds ratio (Oprior) is difficult to estimate because we do <xcope id="X2.367.1"><cue type="negation" ref="X2.367.1">not</cue> know all the true interactions, even for a small subset of proteins</xcope>.</sentence>
					<sentence id="S2.368">The prior odds ratio of interaction for yeast was estimated by combining all protein-protein interactions (but only those related to direct physical interactions, and <xcope id="X2.368.1"><cue type="negation" ref="X2.368.1">no</cue> entries derived by synthetic lethal-type experiments) from the BIND, DIP and GRID databases</xcope> 656669.</sentence>
					<sentence id="S2.369">This subset of interactions contains 36466 distinct interactions involving 5202 distinct proteins, thus resulting in a prior odds ratio of 1/370.</sentence>
					<sentence id="S2.370">This is most <xcope id="X2.370.2"><cue type="speculation" ref="X2.370.2">likely</cue> a conservative estimate</xcope> since <xcope id="X2.370.1">a certain proportion of interactions remain <cue type="speculation" ref="X2.370.1">unknown</cue></xcope> and so when more data become available, the prior odds ratio will increase.</sentence>
					<sentence id="S2.371">For human proteins, 12191 distinct interactions were recovered, involving 5164 human proteins from the September 2005 version of the HPRD 15 and 26896 distinct interactions involving 7531 human proteins from the June 2006 version, leading respectively to prior odds estimates of 1/1093 and 1/1053.</sentence>
					<sentence id="S2.372">However, taking the subset of 5164 proteins from the September 2005 version that are seen in the June 2006 version (20842 distinct interactions), gave a prior odds of interaction estimate of 1/639.</sentence>
					<sentence id="S2.373">Thus, between the two releases of the HPRD, there was a large increase in the number of interactions for this subset of proteins and <xcope id="X2.373.1">this is <cue type="speculation" ref="X2.373.1">likely</cue> to continue for at least the next few releases</xcope>.</sentence>
					<sentence id="S2.374">Accordingly, it is reasonable to conclude that there are <xcope id="X2.374.1"><cue type="negation" ref="X2.374.1">not</cue> enough known human interactions to calculate a realistic and stable estimate of the prior odds ratio of interactions for human</xcope>.</sentence>
					<sentence id="S2.375">As a consequence, a prior odds ratio of 1/400 was used for all work in the paper, which is similar to the estimate for yeast and is <xcope id="X2.375.1"><cue type="speculation" ref="X2.375.1">likely</cue> still an underestimate of the true value</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S2.376">Features</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.377">Seven distinct features combined into five modules were investigated as summarized in Table 1 and described below.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S2.378">1.</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.379">Expression module</sentence>
					<sentence id="S2.380">Expression data were downloaded from the Gene Expression Omnibus 70.</sentence>
					<sentence id="S2.381">The GDS596 dataset was used which examines gene expression profiles from 79 physiologically normal tissues obtained from various sources 71.</sentence>
					<sentence id="S2.382">Expression data were recovered for 10642 distinct transcripts in 158 different arrays (2 arrays per tissue).</sentence>
					<sentence id="S2.383">Pearson correlations were calculated for all 56620761 transcript pairs and correlation values were grouped into 20 bins of increasing co-expression.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S2.384">2.</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.385">Orthology module</sentence>
					<sentence id="S2.386">Orthology maps between human and yeast, worm and fly were downloaded from the InParanoid database 72.</sentence>
					<sentence id="S2.387">Interaction datasets for model organisms were downloaded from the BIND 66, DIP 65 and GRID 69 databases.</sentence>
					<sentence id="S2.388">Orthology interaction data were classified into 13 bins.</sentence>
					<sentence id="S2.389">High, medium and low confidence bins were defined for human protein pairs that have interacting orthologs in either yeast, fly or worm (for a total of 9 bins).</sentence>
					<sentence id="S2.390">The high confidence bins were populated by human protein pairs that have interacting orthologs that both achieve an InParanoid score of 1 (i.e. both proteins involved in an interaction in another organism are respectively the best orthology match for the two human proteins under consideration).</sentence>
					<sentence id="S2.391">The medium confidence bins were populated by human protein pairs that have interacting orthologs but only one of the interacting orthologs has an InParanoid score of 1.</sentence>
					<sentence id="S2.392">The low confidence bins were filled by human protein pairs that have interacting orthologs according to InParanoid but <xcope id="X2.392.1"><cue type="negation" ref="X2.392.1">neither</cue> achieves a score of 1</xcope> (i.e. neither is the best match for the two human proteins under consideration).</sentence>
					<sentence id="S2.393">The orthology module has four additional bins: two bin for human pairs that have interacting paralogs in human (a medium and a low confidence bin which use the same definition as above for the model organisms), one bin for human pairs that have interacting homologs in more than one organism (these can be orthologs in yeast, worm or fly, or paralogs in human) and one bin for human pairs that have only non-interacting orthologs.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S2.394">3.</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.395">Combined module</sentence>
					<sentence id="S2.396">This module incorporates three distinct features in a non-na&#239;ve Bayesian framework: subcellular localization, domain co-occurrence and post-translational modification co-occurrence.</sentence>
					<sentence id="S2.397">Subcellular localization</sentence>
					<sentence id="S2.398">PSLT (Protein Subcellular Localization Tool) subcellular localization predictions 54 were used to classify protein pairs in one of four groups:: pairs of proteins predicted to be in the same compartment, pairs of proteins predicted to be in neighboring compartments (cytosol-nucleus, endoplasmic reticulum-Golgi, Golgi-cytosol, cytosol-plasma membrane, and plasma membrane-secreted), pairs of proteins predicted in different non-neighboring compartments and pairs of proteins for which there were <xcope id="X2.398.1"><cue type="negation" ref="X2.398.1">no</cue> localization predictions</xcope>.</sentence>
					<sentence id="S2.399">Neighboring compartments were chosen as compartment pairs sharing a high proportion of proteins, as investigated previously 54.</sentence>
					<sentence id="S2.400">Co-occurrence of domains</sentence>
					<sentence id="S2.401">The chi-square test was used as a measure of the likelihood of co-occurrence of specific InterPro domains and motifs 73 in protein pairs.</sentence>
					<sentence id="S2.402">Chi-square scores were calculated for all pairs of domains/motifs that occurred in the training data and were then grouped into 5 bins of increasing value.</sentence>
					<sentence id="S2.403">Additionally, Pfam 74 domain pairs known to interact from three-dimensional structures 75 were included in the highest Chi-square score bin.</sentence>
					<sentence id="S2.404">When protein pairs contained more than one domain pair, the domain pair assigned to the highest Chi-square score bin was used to assign a likelihood of interaction.</sentence>
					<sentence id="S2.405">Post-translational modification (PTM) pair co-occurrence</sentence>
					<sentence id="S2.406">Likelihoods were assessed using a PTM pair enrichment score calculated as the probability of co-occurrence of two specific PTMs in all pairs of interacting protein pairs divided by the probability of occurrence of both of these PTMs separately: PTM_score=P(PTM[i],PTM[j]|I)P(PTM[i]|I)&#8727;P(PTM[j]|I) where PTM[i] and PTM[j] are distinct PTMs and I is the set of all interacting proteins that were used to train the predictor.</sentence>
					<sentence id="S2.407">The annotations of PTMs in human proteins were downloaded from UniProt 76 and HPRD 15.</sentence>
					<sentence id="S2.408">PTM instances described as "predicted", "probable" or "possible" were excluded, leaving 3439 distinct proteins with PTM annotations in the training set.</sentence>
					<sentence id="S2.409">The PTM pair enrichment scores were grouped into 4 bins of increasing value.</sentence>
					<sentence id="S2.410">The localization, co-occurrence of domains, and PTMs were considered simultaneously to measure their predictive power in assessing the likelihood of protein interaction.</sentence>
					<sentence id="S2.411">To do this, all possible combinations of the 4 localization bins, 5 chi-square domain-co-occurrence bins and 4 PTM_score bins were investigated and are referred to as the combined module.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S2.412">4.</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.413">Disorder module</sentence>
					<sentence id="S2.414">It has been <xcope id="X2.414.1"><cue type="speculation" ref="X2.414.1">suggested</cue> that unstructured regions of proteins are often involved in binding interactions, particularly in the case of transient interactions</xcope> 77.</sentence>
					<sentence id="S2.415">Protein intrinsic disorder was predicted for all proteins considered by using the VSL2B predictor 78.</sentence>
					<sentence id="S2.416">The disorder score for protein pairs was then calculated as the sum of percent disorder for each protein of the pair.</sentence>
					<sentence id="S2.417">Disorder scores were grouped into 6 bins of increasing value.</sentence>
					<sentence id="S2.418">The Expression, Orthology, Combined and Disorder modules are referred collectively as the Group A modules.</sentence>
					<sentence id="S2.419">Likelihood ratios for each of the Group A modules are illustrated in Figure 1A (see Additional File 4 for complete likelihood ratios for every possible state of these modules and for detailed calculations of these likelihood ratios).</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S2.420">5.</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.421">Transitive module</sentence>
					<sentence id="S2.422">The transitive module works on the premise that <xcope id="X2.422.1">a pair of proteins is more <cue type="speculation" ref="X2.422.1">likely</cue> to interact if it shares interacting partners</xcope>.</sentence>
					<sentence id="S2.423">It does this by considering the local topology of the network predicted by the integration of the Group A modules as depicted in Figure 2.</sentence>
					<sentence id="S2.424">Thus, the transitive module takes as input the product of the likelihood ratios of all other modules considered by the predictor (as illustrated in Figure 1B).</sentence>
					<sentence id="S2.425">For each pair of proteins in the training set, the product of the likelihood ratios from all other modules (referred to as the preliminary score (PS) in Figure 1) was calculated for all protein pairs neighboring the pair (i.e. all protein pairs which involve one protein from the initial protein pair under study and for which it is <xcope id="X2.425.1"><cue type="speculation" ref="X2.425.1">possible</cue> to calculate such a score</xcope>).</sentence>
					<sentence id="S2.426">All preliminary scores above 10 were kept.</sentence>
					<sentence id="S2.427">This parameter was determined empirically.</sentence>
					<sentence id="S2.428">A neighborhood topology score T was then calculated as follows:</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S2.429">Transitive module hypothesis</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.430">Transitive module hypothesis.</sentence>
					<sentence id="S2.431">The Transitive module investigates <xcope id="X2.431.3"><cue type="speculation" ref="X2.431.3">whether</cue> <xcope id="X2.431.2">two proteins (such as i and j) that share many common interactors and have few additional interactors that are <xcope id="X2.431.1"><cue type="negation" ref="X2.431.1">not</cue> common to both proteins</xcope> are more <cue type="speculation" ref="X2.431.2">likely</cue> to interact than two proteins (such as i' and j') that share few common interactors</xcope></xcope>.</sentence>
					<sentence id="S2.432">T=&#8721;e&#8712;Ecse1+|Ei\Ec|+|Ej\Ec| where Ec is the set of edges that connect proteins i and j to their common interactors, Ei is the set of edges that involve protein i, se is the score (likelihood ratio) of edge e and Ei\Ec refers to the set difference of Ei and Ec.</sentence>
					<sentence id="S2.433">For a given set of neighbors, T increases as the interactions with these neighbors become more likely (as the sum of se increases).</sentence>
					<sentence id="S2.434">Additionally, the topology score T of a pair of proteins increases as the proportion of likely interactors that these two proteins share increases.</sentence>
					<sentence id="S2.435">The topology scores were grouped into 5 bins of increasing value.</sentence>
					<sentence id="S2.436">It should be noted that the neighborhood topology score calculated for a given protein pair does <xcope id="X2.436.1"><cue type="negation" ref="X2.436.1">not</cue> consider the preliminary score assigned to that protein pair</xcope>.</sentence>
					<sentence id="S2.437">It only considers the preliminary scores of its neighbors and so is truly based on the local network topology around that protein pair.</sentence>
					<sentence id="S2.438">Accordingly, the likelihood ratio the transitive module outputs for a given protein pair is independent of the likelihood ratio calculated by the Group A modules for this same protein pair.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S2.439">Correlation analysis</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.440">The Pearson correlation between pairs of modules was estimated by taking 150 samples of 10000 protein pairs each and calculating the Pearson correlation of the likelihood ratios for the two modules considered, for each sample.</sentence>
					<sentence id="S2.441">The reported correlation values are the average of the 150 experiments.</sentence>
					<sentence id="S2.442">Samples of the protein pair space were taken <xcope id="X2.442.1"><cue type="negation" ref="X2.442.1">instead of</cue> considering the whole space</xcope> as this was more computationally tractable.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S2.443">Accuracy measurements</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S2.444">The accuracy of the predictors was measured by performing five-fold cross validation experiments in which the datasets were randomly divided into five non-overlapping sets, four of which were used to train the predictor while the fifth was used to test the prediction accuracy.</sentence>
					<sentence id="S2.445">The accuracy reported is the average measured for all combinations of training and testing sets using these five sets.</sentence>
					<sentence id="S2.446">Testing was done by predicting the total likelihood scores for all protein pairs in the test set using the models computed in the training phase and then counting the number of pairs that were well predicted.</sentence>
					<sentence id="S2.447">We used the area under partial ROC curves as a measure of accuracy.</sentence>
					<sentence id="S2.448">Receiver operator characteristic (ROC) curves plot the true positive rate versus the false positive rate over their full range of possible values.</sentence>
					<sentence id="S2.449">In some circumstances, it is more informative to use partial ROC curves (ROCn curves) which illustrate the number of true positives identified by the predictor that score higher than the n highest scoring negatives, plotted for all values from 0 to n.</sentence>
					<sentence id="S2.450">There are many more negatives than positives in our datasets and this is also <xcope id="X2.450.1"><cue type="speculation" ref="X2.450.1">thought</cue> to be true for the full protein interaction networks we are modeling</xcope>.</sentence>
					<sentence id="S2.451">Since the aim is to identify the largest number of true interacting pairs while leaving out as many non-interacting pairs as possible, it is most informative to measure the performance of the predictor under conditions of very low false-positive rates.</sentence>
					<sentence id="S2.452">Accordingly, ROC50 and ROC100 curves were analyzed because given the size of the datasets, these curves consider all the protein pairs predicted to have a posterior odds ratio above 1.0, for all the predictors investigated.</sentence>
					<sentence id="S2.453">The area under ROC curves is often used as a summary measure of accuracy.</sentence>
					<sentence id="S2.454">For ROCn curves, it can be calculated as AUC ROCn = 1nT&#8727;(&#8721;i=1nTi) where i takes on values from 1 to n, T is the total number of positives in the test set and Ti is the number of positives that score higher than the ith highest scoring negative.</sentence>
				</DocumentPart>
		</Document>
		<Document type="Biological_full_article">
			<DocID type="BMC_ID">1471-2105-8-249</DocID>
				<DocumentPart type="Title">
					<sentence id="S3.1">A novel ensemble learning method for de novo computational identification of DNA binding sites</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S3.2">Abstract</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S3.3">Background</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.4">Despite the diversity of motif representations and search algorithms, the de novo computational identification of transcription factor binding sites remains constrained by the limited accuracy of existing algorithms and the need for user-specified input parameters that describe the motif being sought.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S3.5">Results</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.6">We present a novel ensemble learning method, SCOPE, that is based on the <xcope id="X3.6.1"><cue type="speculation" ref="X3.6.1">assumption</cue> that transcription factor binding sites belong to one of three broad classes of motifs:: non-degenerate, degenerate and gapped motifs</xcope>.</sentence>
					<sentence id="S3.7">SCOPE employs a unified scoring metric to combine the results from three motif finding algorithms each aimed at the discovery of one of these classes of motifs.</sentence>
					<sentence id="S3.8">We found that SCOPE's performance on 78 experimentally characterized regulons from four species was a substantial and statistically significant improvement over that of its component algorithms.</sentence>
					<sentence id="S3.9">SCOPE outperformed a broad range of existing motif discovery algorithms on the same dataset by a statistically significant margin.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S3.10">Conclusion</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.11">SCOPE demonstrates that combining multiple, focused motif discovery algorithms can provide a significant gain in performance.</sentence>
					<sentence id="S3.12">By building on components that efficiently search for motifs <xcope id="X3.12.1"><cue type="negation" ref="X3.12.1">without</cue> user-defined parameters</xcope>, SCOPE requires as input only a set of upstream sequences and a species designation, making it a practical choice for non-expert users.</sentence>
					<sentence id="S3.13">A user-friendly web interface, Java source code and executables are available at http://genie.dartmouth.edu/scope.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S3.14">Backgound</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.15">The computational discovery of DNA binding sites for previously uncharacterized transcription factors in groups of co-regulated genes is a well-studied problem with a great deal of practical relevance to the biologist, since such binding sites provide targets for mutational analyses (for reviews see 123).</sentence>
					<sentence id="S3.16">The position-specific variability of transcription factor binding sites makes their de novo identification challenging.</sentence>
					<sentence id="S3.17">Many computational motif finding methods are based on the observation that transcription factor binding sites occur more often than expected by chance in the upstream regions of the set of genes regulated by the same transcription factor 1.</sentence>
					<sentence id="S3.18">The problem thus simplifies to the identification of overrepresented motifs in a given set of upstream sequences.</sentence>
					<sentence id="S3.19">Motif finding programs rely on a search algorithm to optimize a motif model (an abstract representation of a set of transcription factor binding sites).</sentence>
					<sentence id="S3.20">Most recent programs represent motifs as position weight matrices (PWMs), which record the frequency of each base at every position in the motif.</sentence>
					<sentence id="S3.21">Other motif finding programs have relied on the use of consensus motif models (in which every base is represented by a letter of the 15-letter IUPAC code, which accounts for degeneracies as well as single bases) or k-mismatch motif models (in which a non-degenerate word with at most k allowed mismatches is used to represent the word).</sentence>
					<sentence id="S3.22">Regardless of the motif model used, a search for all overrepresented motifs of any length and degree of degeneracy leads to a dauntingly large search space.</sentence>
					<sentence id="S3.23">Thus, motif finding algorithms restrict their search space by using simplified motif representations, employing heuristic search strategies that are prone to local optima, or invoking additional parameters to limit the search space and thereby pass some of the optimization process off to the user 3.</sentence>
					<sentence id="S3.24">Program parameters (such as motif length, number of occurrences and orientation) <xcope id="X3.24.2">that <cue type="negation" ref="X3.24.2">cannot</cue> be reasonably specified by the user <xcope id="X3.24.1"><cue type="negation" ref="X3.24.1">without</cue> prior knowledge about the true binding sites</xcope></xcope> are referred to as nuisance parameters 4.</sentence>
					<sentence id="S3.25">Selection of the correct settings for these parameters is a crucial step in motif finding, and <xcope id="X3.25.1">is often <cue type="speculation" ref="X3.25.1">assumed</cue> to be the domain of experts</xcope>.</sentence>
					<sentence id="S3.26">In a recent evaluation, Hu and colleagues 4 compared the performance of five motif finders on a single prokaryotic genome, systematically exploring the effects of nuisance parameters, including expected motif length and number of occurrences.</sentence>
					<sentence id="S3.27">Every motif finder they tested was found to be sensitive to values used for these parameters.</sentence>
					<sentence id="S3.28"><xcope id="X3.28.1">Guidance on the specific parameter settings to use for given motif finding situations is <cue type="negation" ref="X3.28.1">not</cue> provided in most publications presenting motif finders</xcope>.</sentence>
					<sentence id="S3.29">Even <xcope id="X3.29.1"><cue type="speculation" ref="X3.29.1">assuming</cue> that optimal parameter settings exist for a motif finding program for each specific situation</xcope>, for the typical biologist looking to identify motifs in a set of uncharacterized sequences, acquiring such expertise is an onerous task.</sentence>
					<sentence id="S3.30">Nuisance parameters complicate the interpretation of performance comparisons as well.</sentence>
					<sentence id="S3.31">A recent large-scale performance comparison between thirteen different motif finding tools used expert knowledge in setting the parameters for every program 5.</sentence>
					<sentence id="S3.32">Several of the programs contributing to the performance comparison were run with different parameter settings for each regulon, and in some cases, motifs were hand filtered as a post-processing step.</sentence>
					<sentence id="S3.33">Such performance comparisons evaluate not just algorithms but also the expertise of the users, making it difficult for a first-time user to select a motif finder on a principled basis.</sentence>
					<sentence id="S3.34">A key result of the Tompa, et al. study was the finding that all of the motif finders had roughly the same average performance under a wide range of conditions and test statistics 5.</sentence>
					<sentence id="S3.35">This finding was particularly notable because the motif finders studied employed a wide range of motif representations, scoring functions and search strategies and all were operated under the most favorable conditions possible.</sentence>
					<sentence id="S3.36">Although the average performance of the programs did not differ significantly, the authors found that, for each pair of programs, each program performed better than the other on some subset of the data 5.</sentence>
					<sentence id="S3.37">Previous studies over smaller numbers of motif finders have found that <xcope id="X3.37.1"><cue type="negation" ref="X3.37.1">no</cue> program clearly stands out as superior to the others</xcope> and each program outperforms all others on some subset of the regulons 678.</sentence>
					<sentence id="S3.38">This diversity of performance has led a number of authors to <xcope id="X3.38.2"><cue type="speculation" ref="X3.38.2">speculate</cue> that ensemble methods, comprising multiple motif finders, <xcope id="X3.38.1"><cue type="speculation" ref="X3.38.1">may</cue> lead to improvements in accuracy</xcope></xcope> 158.</sentence>
					<sentence id="S3.39">Ensemble methods, well known in the machine learning community 9, are typically composed of multiple methods comprising different search strategies (or the same search strategies with different initiation settings or random restarts) with a unified objective function.</sentence>
					<sentence id="S3.40">The final predictions are chosen from the ensemble of methods by a learning rule, which <xcope id="X3.40.1"><cue type="speculation" ref="X3.40.1">may</cue> be as simple as finding the maximum score from all the methods, or as complex as optimizing a weighted scoring scheme from among the methods</xcope>.</sentence>
					<sentence id="S3.41">The construction of this learning rule is key to the performance of an ensemble learning method, as the performance of an ensemble method with an ineffective learning rule will be the average of the performance of its component algorithms.</sentence>
					<sentence id="S3.42">In this context, we note that Tompa et al. 5 found that, although every motif finding program tested had some regulons on which its performance was clearly superior, it was <xcope id="X3.42.2"><cue type="negation" ref="X3.42.2">not</cue> <xcope id="X3.42.1"><cue type="speculation" ref="X3.42.1">possible</cue> a priori to predict which motif finder represented the best choice under any given set of conditions</xcope></xcope> 5.</sentence>
					<sentence id="S3.43">This observation serves to illustrate the challenges to the construction of an effective learning rule.</sentence>
					<sentence id="S3.44">To the best of our knowledge, only one study to date has explored ensemble learning in motif finding.</sentence>
					<sentence id="S3.45">Hu, Li and Kihara 4 described a simple ensemble method wherein the component programs were random restarts of the same stochastic algorithm (such as Gibbs sampling or Expectation Maximization) and the learning rule was a voting scheme in which the results of each random restart cast a "vote" for which positions in the DNA sequence should be part of the final reported motif (hereafter, we refer to this as the HLK method).</sentence>
					<sentence id="S3.46">Under this scheme, the authors found that ensemble learning resulted in an increase in performance ranging from 6 to 45%.</sentence>
					<sentence id="S3.47">The HLK voting method provides a framework wherein <xcope id="X3.47.1">a number of different motifs finders <cue type="speculation" ref="X3.47.1">can</cue> be combined under the heuristic that if several motif finders make the same (or overlapping) prediction, then that prediction is accurate</xcope>.</sentence>
					<sentence id="S3.48">Here we present a novel ensemble motif finder based on a different conceptual approach.</sentence>
					<sentence id="S3.49"><xcope id="X3.49.4"><cue type="negation" ref="X3.49.4">Rather than</cue> <xcope id="X3.49.3">randomly restarting the same search algorithm <cue type="speculation" ref="X3.49.3">or</cue> comparing multiple search strategies that all search for the same global optimum (and are <xcope id="X3.49.2"><cue type="speculation" ref="X3.49.2">potentially</cue> vulnerable</xcope> to the same local optima)</xcope></xcope>, our algorithm <xcope id="X3.49.1"><cue type="speculation" ref="X3.49.1">assumes</cue> that the ""biological significance surface"" primarily consists of three local optima, and that one of these peaks represents the global optimum</xcope>.</sentence>
					<sentence id="S3.50">Thus, our ensemble uses three specialized algorithms whose search spaces restrict them to each of these three local optima (BEAM for non-degenerate motifs, PRISM for degenerate motifs and SPACER for bipartite motifs).</sentence>
					<sentence id="S3.51">We have previously demonstrated that the greedy search strategies employed by each of these methods allow them to reliably search their respective motif domains <xcope id="X3.51.1"><cue type="negation" ref="X3.51.1">without</cue> the use of nuisance parameters</xcope>, as the algorithms themselves efficiently optimize the parameters that are typically forced on the users 101112.</sentence>
					<sentence id="S3.52">The results of these component algorithms are then combined using a learning rule that is simply the maximum score returned by each component algorithm.</sentence>
					<sentence id="S3.53">To make comparisons possible, the motif scores returned by each algorithm are penalized according to the complexity of the motif.</sentence>
					<sentence id="S3.54">The resulting ensemble algorithm, SCOPE, has <xcope id="X3.54.1"><cue type="negation" ref="X3.54.1">no</cue> nuisance parameters</xcope> and performs significantly better than its component algorithms.</sentence>
					<sentence id="S3.55">In addition, we find that SCOPE performs favorably compared to a diverse range of existing methods and is robust to the presence of extraneous sequences in its input.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S3.56">Results</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S3.57">Algorithm</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.58">SCOPE takes as input a set of sequences U that are upstream of a set of genes G <xcope id="X3.58.1">that are <cue type="speculation" ref="X3.58.1">thought</cue> to be coregulated</xcope>.</sentence>
					<sentence id="S3.59">The ultimate goal of a motif finder is to identify the specific subsequences &#219; in U that act as binding sites for the transcription factor(s) that regulate G.</sentence>
					<sentence id="S3.60">In practice, sets of binding sites are represented using a motif.</sentence>
					<sentence id="S3.61">We have found that simple consensus motifs over the full IUPAC alphabet (a 15-letter code consisting of the bases A,T,C,G and all possible combinations) provide enough representational power to adequately describe &#219;, while still allowing for an efficient search 34.</sentence>
					<sentence id="S3.62">While alternative representations, such as position weight matrices (PWMs) are more expressive, their heuristic searches are <xcope id="X3.62.2"><cue type="speculation" ref="X3.62.2">prone to</cue> local optima</xcope> and often do <xcope id="X3.62.1"><cue type="negation" ref="X3.62.1">not</cue> perform well in practice</xcope> @34111213@.</sentence>
					<sentence id="S3.63">SCOPE has three component algorithms, BEAM, PRISM and SPACER, which search for non-degenerate, short degenerate, and long, highly degenerate and "gapped" motifs, respectively (Figure 1).</sentence>
					<sentence id="S3.64">Each motif is scored considering one or both strands and the motif is marked to indicate which calculation scores higher.</sentence>
					<sentence id="S3.65">The results of the three algorithms are merged and sorted.</sentence>
					<sentence id="S3.66">Artifactual motifs, whose significance can be accounted for by higher scoring motifs that they overlap, are identified and removed (for details, see Additional file 1, section S1).</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S3.67">Flow diagram for SCOPE</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.68">Flow diagram for SCOPE.</sentence>
					<sentence id="S3.69">BEAM and SPACER are run independently; PRISM runs on the top 100 motifs output by BEAM.</sentence>
					<sentence id="S3.70">For yeast (whose upstream regions are standardized to 800 bp), BEAM and PRISM use the overrepresentation-KS objective function (so/ks), while SPACER's slower running time requires the simpler overrepresentation objective function (so).</sentence>
					<sentence id="S3.71">The top 5 motifs from SPACER are rescored using the combined objective function.</sentence>
					<sentence id="S3.72">For bacteria and Drosophila, upstream regions are defined to be the intergenic region upstream of each gene; thus, <xcope id="X3.72.1">the KS objective function is <cue type="negation" ref="X3.72.1">not</cue> used</xcope>.</sentence>
					<sentence id="S3.73">The results of each program are sorted by Sig and lower scoring motifs that substantially overlap higher scoring motifs are removed.</sentence>
					<sentence id="S3.74">The filtered lists of motifs from the three programs are finally merged by Sig score.</sentence>
					<sentence id="S3.75">Repetitive motifs are identified and removed during all stages.</sentence>
					<sentence id="S3.76">Each of SCOPE's three component algorithms seeks to maximize the same objective function over a different class of motifs.</sentence>
					<sentence id="S3.77">Let M be a random variable over the full space of IUPAC words.</sentence>
					<sentence id="S3.78">The statistical significance p(M = m) of a particular word m is determined by the distribution of M over the entire space of upstream sequences in the given species.</sentence>
					<sentence id="S3.79">In general, we seek to maximize -log(p(M = m)).</sentence>
					<sentence id="S3.80"><xcope id="X3.80.1"><xcope id="X3.80.2">All values of M are <cue type="negation" ref="X3.80.1">not</cue>, however, equally <cue type="speculation" ref="X3.80.2">likely</cue> a priori</xcope></xcope>.</sentence>
					<sentence id="S3.81">For example, it is quite <xcope id="X3.81.1"><cue type="speculation" ref="X3.81.1">likely</cue> that there exists an extremely long sequence that is entirely unique to U</xcope>.</sentence>
					<sentence id="S3.82"><xcope id="X3.82.2"><xcope id="X3.82.3">Such a unique sequence <cue type="speculation" ref="X3.82.2">would</cue> <cue type="speculation" ref="X3.82.2">appear</cue> to be highly significant</xcope></xcope>, until we consider that we have in effect searched all <xcope id="X3.82.1"><cue type="speculation" ref="X3.82.1">possible</cue> sequences</xcope> until we found one that is unique.</sentence>
					<sentence id="S3.83">To correct for this multiple hypothesis testing problem, van Helden et al. 14 proposed using a Bonferroni correction, in which p(M = m) is penalized by the number of motifs N of length |m|:</sentence>
					<sentence id="S3.84">Sig = -log(p(M = m)&#183;N).</sentence>
					<sentence id="S3.85">Thus, if m = "ACGT", N = 44.</sentence>
					<sentence id="S3.86">We employed this same definition of Sig for BEAM, our algorithm that searches for non-degenerate motifs 10.</sentence>
					<sentence id="S3.87">Defining N for degenerate or bipartite motifs raises a significant conceptual challenge.</sentence>
					<sentence id="S3.88">Van Helden et al. 14 chose to use the same definition, but limited their search to a small number of degenerate bases.</sentence>
					<sentence id="S3.89">In contrast, we have <xcope id="X3.89.4"><cue type="speculation" ref="X3.89.4">proposed</cue> that <xcope id="X3.89.2"><xcope id="X3.89.3">all characters <cue type="speculation" ref="X3.89.2">should</cue> <cue type="negation" ref="X3.89.3">not</cue> be treated equally</xcope></xcope>, but <xcope id="X3.89.1"><cue type="speculation" ref="X3.89.1">should</cue> be penalized in proportion to the information provided by them</xcope></xcope> 1112.</sentence>
					<sentence id="S3.90">By this logic, "<xcope id="X3.90.1">"ACGT"" will <cue type="negation" ref="X3.90.1">not</cue> be penalized differently from ""ACNNNNGT"</xcope>", as both have the same number of bases that contribute any information to protein-DNA binding.</sentence>
					<sentence id="S3.91">Building on this intuition, one <xcope id="X3.91.1"><cue type="speculation" ref="X3.91.1">can</cue> argue that the characters ""A"" and ""not-A"" (IUPAC character ""B")" are roughly equivalent, while ""A or G"" (IUPAC character ""R")" is different from ""A"</xcope>" as there are six ways to define a combination of two bases, while only four ways to define a combination of one base or three bases.</sentence>
					<sentence id="S3.92">For motif m = m1m2...mn, we can therefore define N = &#8719; Choose(4, |mi|), where |mi| is the number of DNA bases covered by the IUPAC character mi.</sentence>
					<sentence id="S3.93">In the case were both orientations of the motif are considered, this number is adjusted to account for palindromes.</sentence>
					<sentence id="S3.94">The resulting Sig score thus penalizes motifs based on their length and degeneracy, enabling fair comparisons to be made between different motif classes.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S3.95">Testing</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S3.96">Evaluation of objective functions used by SCOPE</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.97">Each component algorithm in SCOPE efficiently searches its restricted search space, keeping SCOPE's runtime low (average runtime on our datasets were about one minute).</sentence>
					<sentence id="S3.98">This efficiency allowed us to explore several objective functions for scoring the statistical significance p(M = m) of motifs.</sentence>
					<sentence id="S3.99">These objective functions were as follows: position bias (based on the Kolmogorov-Smirnov, or KS, statistic), overrepresentation (a Poisson-based measure based on how often a motif occurs in U) and coverage (a Poisson-based measure based on how many upstream sequences contain the motif).</sentence>
					<sentence id="S3.100">For precise definitions, see Methods.</sentence>
					<sentence id="S3.101">To establish which objective function (or combination of functions) was most suitable, we tested each objective function independently of SCOPE, using a subset of the S. cerevisiae dataset.</sentence>
					<sentence id="S3.102">The measure used to assess the biological relevance of a motif was accuracy, a measure of the nucleotide level overlap between a motif and the known binding sites (for details see Methods).</sentence>
					<sentence id="S3.103">From each regulon from the SCPD database 15 we selected ten six-mers at random from the upstream sequences and ten six-mers at random from the collection of known binding sites for that regulon.</sentence>
					<sentence id="S3.104">For each of these sampled six-mers, we calculated accuracy with respect to the known binding sites.</sentence>
					<sentence id="S3.105">We also calculated the Sig score for each six-mer, using four objective functions (KS, overrepresentation, coverage and combined KS-overrepresentation).</sentence>
					<sentence id="S3.106">We then plotted Sig versus accuracy for each objective function, to determine which objective functions correlated most strongly with biological relevance (Figure 2).</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S3.107">Correlation between accuracy and Sig scores</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.108">Correlation between accuracy and Sig scores.</sentence>
					<sentence id="S3.109">Non-degenerate 6-mers from S. cerevisiae were scored according to Sig scores based on (a) Overrepresentation, (b) Overrepresentation-KS, (c) Coverage and (d) KS metrics of statistical significance.</sentence>
					<sentence id="S3.110">The 6-mers were randomly sampled from both the upstream regions and the known binding sites to ensure coverage or a wide range of accuracy.</sentence>
					<sentence id="S3.111">The x-axis plots the Bonferroni-corrected and log2 transformed Sig score for each metric.</sentence>
					<sentence id="S3.112">The red lines indicate the 95th Sig percentile.</sentence>
					<sentence id="S3.113">These plots demonstrate that overrepresentation is a closer approximation to biological relevance than coverage or KS alone.</sentence>
					<sentence id="S3.114">Adding KS to overrepresentation modestly improved the correlation by 13% (as compared to overrepresentation alone) to R2 = 0.28.</sentence>
					<sentence id="S3.115">To assess the degree of class separation achieved by the two objective functions, we ranked the sampled six-mers by Sig score, and calculated the percentage of motifs with high Sig scores (in the 95th percentile and above) that possessed a reasonable degree of overlap with the known binding sites (accuracy &#8805; 0.10).</sentence>
					<sentence id="S3.116">By the overrepresentation measure, 74.4% of high scoring motifs had accuracy = 0.10, while 79.1% of high scoring motifs by KS-overrepresentation had accuracy &#8805; 0.10.</sentence>
					<sentence id="S3.117">This analysis <xcope id="X3.117.2"><cue type="speculation" ref="X3.117.2">suggests</cue> that more complex objective functions <xcope id="X3.117.1"><cue type="speculation" ref="X3.117.1">may</cue> provide a better estimate of biological significance than the overrepresentation objective functions commonly used</xcope></xcope>.</sentence>
					<sentence id="S3.118">We thus chose to run SCOPE using the overrepresentation-KS combined objective function on the S. cerevisiae dataset, in which the upstream regions are of fixed length.</sentence>
					<sentence id="S3.119">We used the overrepresentation objective function for the other species, as our upstream definitions for those species were of variable length due to the available annotations.</sentence>
					<sentence id="S3.120">Because identifying the genomic positions of highly degenerate bipartite motifs is prohibitively slow, initial rankings of motifs for SPACER were computed using the overrepresentation objective function, and the overrepresentation-KS objective function was used only to produce the final ordering and scores.</sentence>
					<sentence id="S3.121">Although the KS objective function is computationally expensive (linear in the frequency of the motif in the genome), the SCOPE algorithms all aggressively limit the search space, thereby making <xcope id="X3.121.1">the use of this objective function &#8211; and exploration of other complex objective functions &#8211; <cue type="speculation" ref="X3.121.1">possible</cue></xcope>.</sentence>
					<sentence id="S3.122">The surprisingly low correlations between Sig and accuracy <xcope id="X3.122.2"><cue type="speculation" ref="X3.122.2">may</cue> <xcope id="X3.122.1"><cue type="speculation" ref="X3.122.1">indicate that</cue> the objective functions employed by motif finding programs are only a first approximation to biological significance</xcope></xcope>.</sentence>
					<sentence id="S3.123">Indeed, previous studies have reported little or <xcope id="X3.123.1"><cue type="negation" ref="X3.123.1">no</cue> correlation</xcope> between the significance measures of various motif finders and measures of accuracy 416.</sentence>
					<sentence id="S3.124">Further research into more biologically accurate objective functions <xcope id="X3.124.1"><cue type="speculation" ref="X3.124.1">may</cue> yield better performance for motif discovery algorithms</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S3.125">Evaluation of SCOPE performance and ensemble learning scheme</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.126">We first assessed the performance of the optimized SCOPE framework on synthetic datasets (for details, see Additional file 1, section S2).</sentence>
					<sentence id="S3.127">SCOPE performed well on the synthetic datasets, correctly identifying 92% of planted motifs that are over-represented relative to background (those motifs with a Sig score of greater than 5; Figure 3).</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S3.128">Performance at different overrepresentation Sig values on synthetic data</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.129">Performance at different overrepresentation Sig values on synthetic data.</sentence>
					<sentence id="S3.130">A motif was "found" if the top scoring motif returned by SCOPE overlapped the planted motif by at least 50%.</sentence>
					<sentence id="S3.131">Different Sig values were achieved by varying the number of upstream regions, the number of motifs per upstream region, and the number of extraneous upstream regions <xcope id="X3.131.1"><cue type="negation" ref="X3.131.1">without</cue> planted motifs</xcope>.</sentence>
					<sentence id="S3.132">A Sig value of 0 implies that one motif of that significance is expected by chance.</sentence>
					<sentence id="S3.133">While synthetic test sets are useful in algorithmic development and initial testing, the results of such tests must be taken with a grain of salt, as they are highly dependent on the model used to generate the test sets 6.</sentence>
					<sentence id="S3.134">We therefore tested SCOPE on an extensive array of regulons with known binding sites (for details of datasets, see Additional file 1, section S3).</sentence>
					<sentence id="S3.135">We ran SCOPE on each regulon and, following the scoring methodology used by Sinha and Tompa 6, we computed the accuracy for each of the top three motifs reported by SCOPE against the known binding sites.</sentence>
					<sentence id="S3.136">The motifs reported by SCOPE overlap to a large extent with the published cis-regulatory elements (as discussed in Additional file 1, section S3, a difference of one base pair length between the reported motif and the published cis-regulatory element results in an expected accuracy of about 0.25).</sentence>
					<sentence id="S3.137">SCOPE was run on 78 regulons from S. cerevisiae, B. subtilis, E. coli and D. melanogaster.</sentence>
					<sentence id="S3.138">On these datasets, SCOPE's average accuracy was 0.28, 0.29, 0.16, and 0.08 respectively.</sentence>
					<sentence id="S3.139">SCOPE's reported accuracy was significantly higher than any of its component algorithms (Table 1).</sentence>
					<sentence id="S3.140">Indeed, we found that SCOPE increased accuracy by 31&#8211;44% over BEAM, PRISM or SPACER alone.</sentence>
					<sentence id="S3.141">This improvement was achieved by combining BEAM's high positive predictive value (PPV) with PRISM's high sensitivity (Figure 4).</sentence>
					<sentence id="S3.142">Sensitivity is defined here as the fraction of the known binding sites (at the nucleotide level) predicted by the motif finder, and PPV is defined as the fraction of nucleotides predicted by the motif finder that correspond to the known binding sites (see Methods for details).</sentence>
				</DocumentPart>
				<DocumentPart type="TableLegend">
					<sentence id="S3.143">Summary results for performance comparisons between SCOPE and its component algorithms, on all regulons.</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.144">A "Win" is a regulon for which a program had the highest accuracy and that accuracy was at least 0.10.</sentence>
					<sentence id="S3.145">Programs in a two-way tie are credited with 0.5 wins each, so by construction, SCOPE can at best share a win with one of the other programs.</sentence>
					<sentence id="S3.146">A perfect winner-take-all ensemble method <xcope id="X3.146.1"><cue type="speculation" ref="X3.146.1">would</cue> have the same number of wins as all the component algorithms combined</xcope>.</sentence>
					<sentence id="S3.147">A "clear win (loss)" is a regulon for which SCOPE's accuracy was at least 0.10 higher (lower) than the other program.</sentence>
					<sentence id="S3.148">The p-value reported for the paired t-test was Bonferroni-corrected to account for multiple (three) comparisons.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S3.149">Average and standard error of sensitivity and PPV for the component algorithms of SCOPE on all 78 regulons</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.150">Average and standard error of sensitivity and PPV for the component algorithms of SCOPE on all 78 regulons.</sentence>
					<sentence id="S3.151">Bars represent standard error.</sentence>
					<sentence id="S3.152">An ensemble motif finder with a learning rule that is <xcope id="X3.152.1"><cue type="negation" ref="X3.152.1">no</cue> better than random</xcope> will provide an accuracy that is equal to the average of its three component algorithms.</sentence>
					<sentence id="S3.153">To provide a basis for evaluating the performance of SCOPE's learning rule, we constructed an ensemble learning method (referred to here as BASELINE) from the results of BEAM, PRISM and SPACER, by randomly selecting one of the accuracies from these three programs for each regulon.</sentence>
					<sentence id="S3.154">Over 120,000 trials, BASELINE's average performance on this dataset was 0.176 with a standard deviation of 0.013.</sentence>
					<sentence id="S3.155">BASELINE's average score never exceeded that of SCOPE (p &lt; 8.25 &#215; 10-6).</sentence>
					<sentence id="S3.156">When compared to its component algorithms, SCOPE picked the highest accuracy motif in 66% of the cases (as opposed to 33% for a random selection between three algorithms).</sentence>
					<sentence id="S3.157">These results <xcope id="X3.157.2"><cue type="speculation" ref="X3.157.2">suggest</cue> that SCOPE's learning rule is highly effective, though <xcope id="X3.157.1">it <cue type="speculation" ref="X3.157.1">may</cue> certainly be improved further</xcope></xcope>.</sentence>
					<sentence id="S3.158">Of course, SCOPE's learning rule is extremely simple, and more complex learning rules <xcope id="X3.158.1"><cue type="speculation" ref="X3.158.1">may</cue> allow SCOPE to approach its theoretical upper bound</xcope>.</sentence>
					<sentence id="S3.159">One rule that <xcope id="X3.159.1"><cue type="speculation" ref="X3.159.1">may</cue> prove effective</xcope> is to weight the output of each algorithm according to (for example) the frequency of occurrence of each class of motif (non-degenerate, short degenerate or long degenerate) in the species or by learning the appropriate weights on a representative training set, creating, in effect, a Na&#239;ve Bayesian Network.</sentence>
					<sentence id="S3.160">The training of a more complex learning rule must, however, be performed in a cross-validation framework, and the size of the available dataset of regulons will place a practical limit on the complexity of the learning rule that can be devised.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S3.161">Comparison with other motif finding programs</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.162">To provide a frame of reference for SCOPE's performance, we ran ten other popular motif finders on these datasets (for details and references see Table 2).</sentence>
					<sentence id="S3.163">We ran all programs directly from their websites, leaving all parameters at their defaults.</sentence>
					<sentence id="S3.164">The only parameter that we specified (where available) was the species from which the background sequences were derived.</sentence>
					<sentence id="S3.165">Thus, <xcope id="X3.165.1">the results of this performance comparison <cue type="speculation" ref="X3.165.1">may</cue> be interpreted as a comparison against other motif finders when those motif finders are run using their default values</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="TableLegend">
					<sentence id="S3.166">Motif discovery algorithms used in the performance comparison.</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.167">Nuisance parameters are parameters that <xcope id="X3.167.2"><cue type="speculation" ref="X3.167.2">cannot</cue> be precisely defined <xcope id="X3.167.1"><cue type="negation" ref="X3.167.1">without</cue> knowledge of the true binding sites (such as motif length, number of occurrences and orientation)</xcope></xcope>.</sentence>
					<sentence id="S3.168">For MotifSampler and wConsensus, the lower part of the range indicates required parameters, while the upper part indicates the total number of parameters, including "power user" parameters that <xcope id="X3.168.1">the program authors stress <cue type="speculation" ref="X3.168.1">should</cue> typically be left as default</xcope>.</sentence>
					<sentence id="S3.169">Motif model abbreviations: cons = consensus; PWM = position weight matrix; mis = consensus with predefined number of allowed non-position-specific mismatches.</sentence>
					<sentence id="S3.170">SCOPE has <xcope id="X3.170.1"><cue type="negation" ref="X3.170.1">no</cue> user-adjustable parameters</xcope>, although its component algorithms do contain a number of internal parameters "("hyperparameters")" that govern their search over common nuisance parameters.</sentence>
					<sentence id="S3.171">On synthetic datasets, we found SCOPE's component algorithms to be quite robust to the settings of these hyperparameters.</sentence>
					<sentence id="S3.172">We have therefore fixed those parameters to reasonable values and do <xcope id="X3.172.1"><cue type="negation" ref="X3.172.1">not</cue> expose them to the user</xcope> 101112.</sentence>
					<sentence id="S3.173">This construction means that SCOPE can only run in a default configuration.</sentence>
					<sentence id="S3.174">We compared the motif finding programs using the criteria set forth in Sinha and Tompa, including average accuracy and the number of total wins (highest accuracy on a regulon, where that accuracy is at least 0.1) 6.</sentence>
					<sentence id="S3.175">On this dataset, SCOPE had the highest score by both criteria (Figure 5a).</sentence>
					<sentence id="S3.176">The cumulative distribution of accuracy shows that SCOPE had the most high-scoring motifs at every level (Figure 5b).</sentence>
					<sentence id="S3.177">When we looked at the number of clear head-to-head wins (such a win is taken to occur when the difference in accuracy between SCOPE and another motif finder is greater than 0.1 6), we found that SCOPE scored a clear majority (82%) of clear head-to-head wins (Figure 5c).</sentence>
					<sentence id="S3.178">The average accuracies of BEAM, PRISM and SPACER on this dataset were similar to those of the ten other programs.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S3.179">Performance comparisons</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.180">Performance comparisons.</sentence>
					<sentence id="S3.181">(a) Mean and standard error of accuracy for each of 78 regulons.</sentence>
					<sentence id="S3.182">(b) Cumulative distribution of accuracy for each program.</sentence>
					<sentence id="S3.183">(c) Fraction of regulons with a clear outcome (margin of difference in accuracy between two programs was greater than 0.10) won by SCOPE.</sentence>
					<sentence id="S3.184">Program abbreviations and details in Table 2; performance details in tables S1 and S2 in Additional file 1.</sentence>
					<sentence id="S3.185">A formal statistical analysis found that SCOPE's performance margin over the other motif finders run on this dataset was statistically significant at p &lt; 10-5 (for details, see Additional file 1, section S3).</sentence>
					<sentence id="S3.186">Corroborating the results of previously published performance comparisons @14567@, <xcope id="X3.186.1"><cue type="negation" ref="X3.186.1">none</cue> of the other programs showed a statistically significant difference relative to the other nine</xcope>.</sentence>
					<sentence id="S3.187">Similarly, <xcope id="X3.187.1"><cue type="negation" ref="X3.187.1">none</cue> of SCOPE's component algorithms outperformed the other ten programs on this dataset by a statistically significant margin</xcope>.</sentence>
					<sentence id="S3.188">SCOPE's high accuracy was a reflection of both high PPV and high sensitivity (Figure 6a; see Methods for a precise definition).</sentence>
					<sentence id="S3.189">By these measures, SCOPE was the only program that scored highly in both sensitivity and PPV (ranking first and second respectively).</sentence>
					<sentence id="S3.190">In contrast, <xcope id="X3.190.1"><cue type="negation" ref="X3.190.1">none</cue> of the other motif finders that performed well by one criterion performed well by the other</xcope>, as shown by the average ranks for each motif finder over both sensitivity and PPV (Figure 6b).</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S3.191">(a) Average and standard error of sensitivity and PPV for each program on all 78 regulons</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.192">(a) Average and standard error of sensitivity and PPV for each program on all 78 regulons.</sentence>
					<sentence id="S3.193">In cases where the program <xcope id="X3.193.1"><cue type="negation" ref="X3.193.1">failed</cue> to return a result</xcope>, the sensitivity is 0 and the PPV is undefined.</sentence>
					<sentence id="S3.194"><xcope id="X3.194.2">Cases where a program did <xcope id="X3.194.1"><cue type="negation" ref="X3.194.1">not</cue> support the species</xcope> were <cue type="negation" ref="X3.194.2">not</cue> included</xcope>.</sentence>
					<sentence id="S3.195">(b) Ranks on this plot were computed by taking the average of sensitivity and PPV ranks for each program.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S3.196">Performance in the presence of extraneous upstream sequences</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.197">In practice, microarray co-expression data are often used to identify genes in a particular regulon.</sentence>
					<sentence id="S3.198">This approach identifies genes that are either directly or indirectly regulated by the transcription factor of interest.</sentence>
					<sentence id="S3.199">Therefore, sets of genes identified from co-expression data <xcope id="X3.199.1"><cue type="speculation" ref="X3.199.1">may</cue> often contain multiple extraneous upstream sequences</xcope>.</sentence>
					<sentence id="S3.200">Adding sequences that do <xcope id="X3.200.1"><cue type="negation" ref="X3.200.1">not</cue> contain binding sites</xcope> decreases the signal-to-noise ratio, making motif finding more difficult 4.</sentence>
					<sentence id="S3.201">We thus tested SCOPE's performance on regulons containing additional extraneous upstream sequences.</sentence>
					<sentence id="S3.202">For all 33 regulons in the SCPD dataset, we added randomly selected upstream S. cerevisiae sequences such that the total number of extraneous sequences was between 0.5 and 4 times the number of true upstream sequences in the regulon.</sentence>
					<sentence id="S3.203">SCOPE's accuracy on this dataset was remarkably stable in the presence of extraneous sequences.</sentence>
					<sentence id="S3.204">Figure 7 shows the aggregate results of this test, with the SCPD regulons divided into three groups based on SCOPE's accuracy on the true regulon.</sentence>
					<sentence id="S3.205">For each set of regulons, SCOPE's performance decayed gradually as increasing numbers of extraneous genes were added to the regulon.</sentence>
					<sentence id="S3.206">These results were consistent with the relationship between the Sig score and performance on synthetic datasets (Figure 2).</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S3.207">Robustness of SCOPE performance on S. cerevisiae regulons containing extraneous upstream sequences</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.208">Robustness of SCOPE performance on S. cerevisiae regulons containing extraneous upstream sequences.</sentence>
					<sentence id="S3.209">Increasing quantities of randomly selected upstream regions were added to each regulon.</sentence>
					<sentence id="S3.210">The bold red line is the average across all regulons, while each of the other lines represent the performance of SCOPE on one-third of the total S. cerevisiae regulons.</sentence>
					<sentence id="S3.211">The y-axis shows the average accuracy for each group of regulons.</sentence>
					<sentence id="S3.212">The x-axis shows the ratio of extraneous upstream sequences to bona fide ones.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S3.213">Discussion</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.214">The field of motif finding is saturated with a large number of algorithms representing myriad search strategies, objective functions and motif models.</sentence>
					<sentence id="S3.215">Yet remarkably, performance comparisons consistently reveal disappointing performance for motif finders and <xcope id="X3.215.1"><cue type="negation" ref="X3.215.1">fail</cue> to find any statistical significance between them</xcope>.</sentence>
					<sentence id="S3.216">A brief survey of the per-regulon results of these performance comparisons yields two key observations: (1) there are many regulons for which a large number of programs find a small portion of the binding sites (though not necessarily the same portion); and (2) every program has a respectable number of "wins" (i.e. every program is the best existing program on some handful of regulons 145678.</sentence>
					<sentence id="S3.217">Such observations are common in many machine learning applications, and are the direct result of complex search spaces that force restrictions on either the search strategy or the representation of the solution space (in this case, the motif model used to represent the motifs).</sentence>
					<sentence id="S3.218">For example, YMF and RSAT are guaranteed to find the optimal solutions in their motif space (fixed-length motifs with limited degeneracies), but that space is limited to the point that optimality provides no clear advantage over the other methods.</sentence>
					<sentence id="S3.219">Conversely, the PWM-based methods have an apparently more powerful motif model 17, but their search strategies <xcope id="X3.219.1"><cue type="negation" ref="X3.219.1">cannot</cue> guarantee optimality</xcope> and often terminate at local optima.</sentence>
					<sentence id="S3.220">The HLK ensemble method 4 successfully exploits the first key observation above.</sentence>
					<sentence id="S3.221">By running the same (stochastic) algorithm multiple times and using a voting method, those subsequences of the binding sites that are repeatedly reported become clear while the spurious bases are eliminated.</sentence>
					<sentence id="S3.222">Hu and colleagues report that this method increased accuracy and <xcope id="X3.222.2"><cue type="speculation" ref="X3.222.2">proposed</cue> that their approach <xcope id="X3.222.1"><cue type="speculation" ref="X3.222.1">may</cue> prove effective when running different algorithms as well</xcope></xcope> 4.</sentence>
					<sentence id="S3.223">The limitation arises, however, in regulons where only one program has a high accuracy and the others <xcope id="X3.223.1"><cue type="negation" ref="X3.223.1">fail</cue> to find any portion of the binding sites</xcope>.</sentence>
					<sentence id="S3.224">In such cases, it is <xcope id="X3.224.1"><cue type="speculation" ref="X3.224.1">likely</cue> that a voting-based ensemble will follow the crowd and fail to find the true binding site</xcope>.</sentence>
					<sentence id="S3.225">The second observation, that all motif finders win some number of regulons and often perform roughly the same on average, is broadly consistent with a theorem in the Machine Learning field referred to as the No Free Lunch Theorem 1819.</sentence>
					<sentence id="S3.226">Briefly, this theorem states that, averaged over all datasets, the performance of all search algorithms are exactly the same, with the corollary that two algorithms will have the exact same number of wins in relation to each other.</sentence>
					<sentence id="S3.227">In practice, this theorem argues for the use of specialized domain knowledge 20, where available, and <xcope id="X3.227.3"><cue type="speculation" ref="X3.227.3">may</cue> <xcope id="X3.227.2"><cue type="speculation" ref="X3.227.2">suggest</cue> that similar average performance across a diversity of approaches is an <xcope id="X3.227.1"><cue type="speculation" ref="X3.227.1">indication</cue> of the diversity of the datasets themselves</xcope></xcope></xcope>.</sentence>
					<sentence id="S3.228">Thus, motif finders designed for one class of motifs will win on regulons containing those motifs, but will perform poorly on other regulons, while more general motif finders will <xcope id="X3.228.1"><cue type="speculation" ref="X3.228.1">tend</cue> to have more consistently mediocre performance</xcope>.</sentence>
					<sentence id="S3.229">In this light, <xcope id="X3.229.3">SCOPE <cue type="speculation" ref="X3.229.3">can</cue> be seen as leveraging the second key observation by embracing the No Free Lunch Theorem</xcope>:: <xcope id="X3.229.2"><cue type="negation" ref="X3.229.2">rather than</cue> boost average performance by taking the average results of three general purpose algorithms</xcope>, SCOPE uses highly specialized algorithms and <xcope id="X3.229.1"><cue type="speculation" ref="X3.229.1">assumes</cue> each will perform strongly on some regulons and weakly on others (and that the unified scoring metric can tell the difference)</xcope>.</sentence>
					<sentence id="S3.230">The working <xcope id="X3.230.1"><cue type="speculation" ref="X3.230.1">hypothesis</cue> is, in effect, that the local maxima are predictable (corresponding to one of three motif classes) and exploitable</xcope> (we can find the local maxima in each class and choose whichever is higher).</sentence>
					<sentence id="S3.231">Consistent with this hypothesis, there was very little overlap among the component algorithms of SCOPE (each wins about 20 of the 78 regulons, with very few ties) and, by taking the maximum score from those three local maxima, SCOPE tended to choose the motif with the highest accuracy in a clear majority of the cases (66%, compared to 33% for a random learning rule).</sentence>
					<sentence id="S3.232">Furthermore, SCOPE not only significantly outperformed its components on this dataset, it also outperformed a number of general purpose algorithms that seek to find the global maximum in a single pass.</sentence>
					<sentence id="S3.233">Of course, based on the No Free Lunch Theorem, SCOPE's performance averaged over all theoretically possible datasets will still converge to that of the other motif finding approaches (including random guessing).</sentence>
					<sentence id="S3.234">As the physical properties of transcription factors will inevitably constrain the structure of their binding sites, biologically relevant datasets comprise a subset of the space of all theoretically possible sequences.</sentence>
					<sentence id="S3.235">Our test set of 78 regulons was selected in a blinded manner (for details, see Additional file 1, section S3), thus these results <xcope id="X3.235.1"><cue type="speculation" ref="X3.235.1">suggest</cue> the generalizability of SCOPE's use of domain knowledge on biologically relevant datasets from these species</xcope>.</sentence>
					<sentence id="S3.236"><xcope id="X3.236.1">These observations are <cue type="negation" ref="X3.236.1">not</cue> offered as definitive proof that there are only three classes of motifs</xcope>; rather, they show that power can be gained by identifying distinct motif classes and combining specialized algorithms with a unified scoring rule.</sentence>
					<sentence id="S3.237">It is <xcope id="X3.237.2"><cue type="speculation" ref="X3.237.2">possible</cue> that <xcope id="X3.237.1">more power <cue type="speculation" ref="X3.237.1">could</cue> be gained by identifying other distinct motif classes and adding algorithms that explicitly search for those classes</xcope></xcope>.</sentence>
					<sentence id="S3.238">For example, Zinc finger transcription factors have been demonstrated to bind three triplets of nucleotides which overlap at their third base positions 21.</sentence>
					<sentence id="S3.239">This observation <xcope id="X3.239.1"><cue type="speculation" ref="X3.239.1">could</cue> be leveraged by a search algorithm that explicitly searches for motifs matching this unique structure</xcope>.</sentence>
					<sentence id="S3.240">Thus, <xcope id="X3.240.1">all nondegenerate triplets in a set of upstream regions <cue type="speculation" ref="X3.240.1">could</cue> be scored</xcope> and the highest-scoring triplets combined into a single five-mer with a two-base degeneracy (corresponding to the IUPAC characters R,Y, W, S, K or M) at the middle position.</sentence>
					<sentence id="S3.241"><xcope id="X3.241.1">The highest-scoring five-mers <cue type="speculation" ref="X3.241.1">could</cue> then be combined with the highest scoring triplets to generate a seven-mer with two-base degeneracies at positions three and five</xcope>.</sentence>
					<sentence id="S3.242">Provided the appropriate Bonferroni correction is applied for this new class of motifs, <xcope id="X3.242.1">these motifs <cue type="speculation" ref="X3.242.1">may</cue> be easily compared with the results from BEAM, PRISM and SPACER</xcope>, thereby extending the SCOPE ensemble to include a fourth class of motifs.</sentence>
					<sentence id="S3.243">We note, however, that as more methods are added to SCOPE, it will be increasingly difficult to devise a scoring metric that can accurately choose the best result from among the components.</sentence>
					<sentence id="S3.244">SCOPE <xcope id="X3.244.1"><cue type="speculation" ref="X3.244.1">may</cue> also serve as a complementary approach to the HLK method</xcope>.</sentence>
					<sentence id="S3.245">For example, the parameters of many methods can be set to search for specific classes of motifs (such as bipartite versus non-bipartite motifs).</sentence>
					<sentence id="S3.246">Thus, analogous to the ensemble method described in this paper, one <xcope id="X3.246.1"><cue type="speculation" ref="X3.246.1">may</cue> build a hierarchical ensemble that first searches each motif class by the HLK method using a number of algorithms or random restarts, and then uses the SCOPE method to choose the best result from among the motif classes</xcope>.</sentence>
					<sentence id="S3.247">One constraint associated with such an approach is the run-time.</sentence>
					<sentence id="S3.248">A second constraint associated with a hierarchical ensemble learning method is the multiplicative increase in the number of parameters associated with it, though <xcope id="X3.248.1">this problem <cue type="speculation" ref="X3.248.1">may</cue> be ameliorated by the use of parameter-free algorithms that employ restricted search spaces</xcope>.</sentence>
					<sentence id="S3.249">An important factor to consider when taking the best of multiple runs is the relative size of the search space.</sentence>
					<sentence id="S3.250">Certainly to maintain statistical validity, some correction must be made for multiple hypothesis testing.</sentence>
					<sentence id="S3.251">Furthermore, the effects of multiple testing <xcope id="X3.251.1"><cue type="speculation" ref="X3.251.1">may</cue> bias the results in favor of one of the component algorithms</xcope>.</sentence>
					<sentence id="S3.252">To ensure statistical validity and avoid such a bias, we developed a simple Bonferroni-like correction, which penalized every proposed motif proportional to its length and degree of degeneracy, resulting in a modest improvement of 8% in SCOPE's accuracy.</sentence>
					<sentence id="S3.253">Although our test set of 78 regulons gave us enough power to find significance between SCOPE and its components or other algorithms, it did <xcope id="X3.253.1"><cue type="negation" ref="X3.253.1">not</cue> provide enough power to disentangle the effects of small improvements (such as the Bonferroni correction, the objective function that takes position bias into account, or scoring motifs based off one or both strands), especially in the rigorous cross-validation framework necessary to decipher precisely which aspects contribute significantly to the performance</xcope>.</sentence>
					<sentence id="S3.254">Nevertheless, as larger datasets become available, SCOPE's efficient search strategy makes it an ideal platform for exploring the effect of focused improvements to the motif finding approach described, such as the use of complex objective functions that <xcope id="X3.254.1"><cue type="speculation" ref="X3.254.1">may</cue> better approximate biological significance</xcope>.</sentence>
					<sentence id="S3.255">The comparisons to other motif finding programs in this study are provided to place SCOPE's performance in the broader context of the motif finding field, particularly when viewed from the standpoint of the practicing "bench" biologist.</sentence>
					<sentence id="S3.256">Any performance comparison must be interpreted with caution, since the results are highly dependent on the dataset used, the conditions of the testing and the metrics used for evaluation.</sentence>
					<sentence id="S3.257">With this in mind, we sought to evaluate a wide representation of motif finders on a large number of regulons using performance metrics consistent with previous studies 67.</sentence>
					<sentence id="S3.258">To the best of our knowledge, this dataset represents the largest set of biologically relevant regulons used for performance comparisons to date.</sentence>
					<sentence id="S3.259">Whereas previous performance comparisons attempt to optimize the parameters of the programs in question 467 or allow expert users to tune their own programs and manually filter both the input and output 5 we intentionally made our comparisons between programs <xcope id="X3.259.1"><cue type="negation" ref="X3.259.1">without</cue> manually optimizing any parameters for any species so as to emulate typical use conditions</xcope>.</sentence>
					<sentence id="S3.260">Our comparison thus complements the recent large scale study of Tompa et al., who gauge performance under optimal conditions on semi-synthetic data sets 5, as well as the study of Hu et al., who explore the effect of parameter optimization on a handful of popular motif finders 4.</sentence>
					<sentence id="S3.261">Although the present philosophy of performance comparison implicitly benefits SCOPE, which has <xcope id="X3.261.1"><cue type="negation" ref="X3.261.1">no</cue> parameters to optimize</xcope>, it is arguably the most relevant comparison possible for the typical biologist.</sentence>
					<sentence id="S3.262">Although previous studies have shown the <xcope id="X3.262.1"><cue type="speculation" ref="X3.262.1">potential</cue> importance of choosing parameters carefully</xcope> 46, we note that the results we obtained under default settings were quite similar to those reported in previous studies (for details, see Additional file 1, section S3).</sentence>
					<sentence id="S3.263">Arguably, systematic parameter optimization for each of these programs <xcope id="X3.263.1"><cue type="speculation" ref="X3.263.1">may</cue> well yield higher accuracy scores than those reported here</xcope>.</sentence>
					<sentence id="S3.264">However, in order to avoid the pitfall of overfitting to the dataset, such parameter optimization must be performed using cross-validation or some other resampling technique 92223.</sentence>
					<sentence id="S3.265">We note that all the motif finders tested, including SCOPE, performed poorly on the Drosophila dataset.</sentence>
					<sentence id="S3.266">Although SCOPE had the highest accuracy on this dataset, that accuracy was significantly less than on the bacterial and yeast data.</sentence>
					<sentence id="S3.267">Especially poor performance on Drosophila was also reported in the Tompa et al. performance comparison, <xcope id="X3.267.2"><cue type="speculation" ref="X3.267.2">indicating that</cue> <xcope id="X3.267.1">this difficulty is <cue type="negation" ref="X3.267.1">not</cue> limited to the current dataset</xcope></xcope> 5.</sentence>
					<sentence id="S3.268">One <xcope id="X3.268.1"><cue type="speculation" ref="X3.268.1">possible</cue> cause of poor performance in this study is that the ""regulons"" are derived from enhancer regions defined in an earlier computational paper</xcope> 24.</sentence>
					<sentence id="S3.269">Whereas a background set of promoter regions is easy to identify, it is <xcope id="X3.269.1"><cue type="speculation" ref="X3.269.1">not clear</cue> how to define a reasonable genomic sample of enhancers</xcope>.</sentence>
					<sentence id="S3.270">Thus, the background sequences used by SCOPE and the other programs <xcope id="X3.270.2"><cue type="speculation" ref="X3.270.2">may</cue> <xcope id="X3.270.1"><cue type="negation" ref="X3.270.1">not</cue> be representative of the ""true"" background model of enhancers, leading to inaccurate statistics</xcope></xcope>.</sentence>
					<sentence id="S3.271">The persistently poor performance of motif finders on Drosophila regulons thus highlights the importance of using well-defined background sequences to calibrate the statistics of the objective functions being optimized.</sentence>
					<sentence id="S3.272">Recently, algorithms have been reported that predict enhancer regions on a genome wide scale [2425262728].</sentence>
					<sentence id="S3.273">It is <xcope id="X3.273.2"><cue type="speculation" ref="X3.273.2">possible</cue> that using such algorithms to define a collection of background enhancer sequences <xcope id="X3.273.1"><cue type="speculation" ref="X3.273.1">may</cue> improve the performance of SCOPE, as well as that of the other motif finders, on Drosophila</xcope></xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S3.274">Conclusion</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.275">Ensemble methods hold the potential for providing improvements in motif finding accuracy <xcope id="X3.275.1"><cue type="negation" ref="X3.275.1">without</cue> resorting to the use of additional data (such as phylogenetic information or characterization of the domain structure of the transcription factor), which are not always available</xcope>.</sentence>
					<sentence id="S3.276">Typically, ensemble learning methods are plagued with certain liabilities, such as increased runtimes, logistical complexity and a multiplicity of nuisance parameters, all of which grow with the number of programs run.</sentence>
					<sentence id="S3.277">In the machine learning field, ensemble methods have coexisted for many years with non-ensemble methods, with no clear superiority having been established between the two.</sentence>
					<sentence id="S3.278">SCOPE serves as a proof-of-concept, demonstrating an efficient and effective approach to ensemble-based motif finding.</sentence>
					<sentence id="S3.279">By dividing the search space into tractable domains, SCOPE mitigates the potential liabilities associated with ensemble methods, resulting in a program that is capable of finding cis-regulatory elements of arbitrary length, degree of degeneracy, motif orientation and frequency of occurrence.</sentence>
					<sentence id="S3.280">Its strong performance, rapid runtime and freedom from nuisance parameters make it a simple and effective tool for the biologist.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S3.281">Methods</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S3.282">Accuracy, Sensitivity and Positive Predictive Value</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.283">Each algorithm's accuracy for each regulon was measured via the Phi score (also referred to as nucleotide level performance coefficient, or nPC, in previous performance comparisons 45611.</sentence>
					<sentence id="S3.284">This metric, first proposed by Pevzner and Sze 29, measures the degree of overlap between the actual instances of two motifs (or sets of motifs) m1 and m2 in the set of co-regulated upstream sequences.</sentence>
					<sentence id="S3.285">The Phi score can be defined as follows: let U be a unique numbering of all the bases in the upstream sequences of a given gene set, and IU(m) &#8838; U be the set of bases that are covered by actual instances of m in U.</sentence>
					<sentence id="S3.286">Phi is then defined as the ratio of the number of bases occupied by the actual instances of both the motifs, to the total number of bases occupied by the actual instances of either of the two motifs:</sentence>
					<sentence id="S3.287">&#934;U(m1, m2) = [IU(m1) &#8745; IU(m2)]/[IU(m1) &#8746; IU(m2)].</sentence>
					<sentence id="S3.288">This metric therefore takes both false positives and false negatives into account at the level of the individual bases that are actually covered by the motif.</sentence>
					<sentence id="S3.289">As in Hu et al. 4, we define accuracy to be the Phi score between the known and predicted binding sites.</sentence>
					<sentence id="S3.290">Changing the denominator of the Phi equation to be IU(mi) yields the sensitivity (if mi represents the true binding sites) or the positive predictive value (PPV, if mi represents the reported binding sites).</sentence>
					<sentence id="S3.291">See Additional file 1, section S3, for a discussion on the use of Phi score for measuring accuracy.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S3.292">Objective functions for Statistical Significance</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.293">In line with other motif finders, we have used statistical significance as a surrogate for biological significance.</sentence>
					<sentence id="S3.294">Since <xcope id="X3.294.2">the latter <cue type="negation" ref="X3.294.2">cannot</cue> be defined <xcope id="X3.294.1"><cue type="negation" ref="X3.294.1">without</cue> data that obviates the need for computational motif finding</xcope></xcope>, objective functions that approximate biological significance are critical.</sentence>
					<sentence id="S3.295">In this section, we detail the objective functions we used and their effect on SCOPE's performance.</sentence>
					<sentence id="S3.296">For any motif m, each objective function provides a definition for p(m), the probability of observing a motif with the same sufficient statistics as m assuming a particular null model.</sentence>
					<sentence id="S3.297">This p-value is used in the computation of the Sig score (see Results).</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S3.298">Overrepresentation</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.299">The most common statistical test in motif finding is based on overrepresentation, which can be roughly defined as the probability that a motif m that is observed C(m) times in the regulon would occur at least C(m) times in a random collection of the same number of genes.</sentence>
					<sentence id="S3.300">In the context of consensus motifs, overrepresentation is expressed in terms of a multinomial model, in which each position i in each gene j is a random variable Xij that can take on any motif allowed by the particular motif model.</sentence>
					<sentence id="S3.301">The probability of seeing m at least C(m) times in the regulon can be approximated by the Poisson distribution: p(m) = &#8721;k&#8805;C(m) [(&#955;ke-&#955;)/k! ] where &#955; is the expectation of C(m) with respect to the null motif distribution and the number of positions in the regulon.</sentence>
					<sentence id="S3.302">A detailed justification of this approach was given by Carlson et al. 11.</sentence>
					<sentence id="S3.303">The expectation &#955; is most accurately modeled using Maximum Likelihood Estimators (MLEs) computed as the actual proportion of any given motif in the complete set of all upstream sequences in the genome 10.</sentence>
					<sentence id="S3.304">These MLEs are implemented as lookups of exact substrings, which can be performed efficiently using a suffix array data structure 101112.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S3.305">Coverage</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.306">A simple modification to the overrepresentation objective function is coverage, which is identical to overrepresentation with the modification that C(m) is the number of upstream regions in the regulon that have one or more instances of m and &#955;, the expectation of C(m), is determined from the proportion of upstream regions in the genome that contain the motif.</sentence>
					<sentence id="S3.307">While this objective function prevents a single upstream region from dominating a motif's score, it <xcope id="X3.307.2"><cue type="negation" ref="X3.307.2">fails</cue> to account for multiple instances of a binding site in a single gene that <xcope id="X3.307.1"><cue type="speculation" ref="X3.307.1">may</cue> arise due to cooperative binding</xcope></xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S3.308">Positional bias</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.309">Transcription factors often require their binding sites to be located in a restricted range relative to the start of transcription.</sentence>
					<sentence id="S3.310">One well known example is TBP (TATA-binding protein), which localizes the RNA polymerase complex by binding the TATA-box motif roughly 25 bases upstream of the transcription start site 30.</sentence>
					<sentence id="S3.311">While other examples of binding sites with positional restrictions are well known, few motif finders incorporate position in their scoring function.</sentence>
					<sentence id="S3.312">In the case where all upstream regions are the same length, the Kolmogorov-Smirnov (KS) statistic provides a natural test for positional bias.</sentence>
					<sentence id="S3.313">The Kolmogorov-Smirnov (KS) statistic is a non-parametric statistic that measures the probability that two samples are drawn from the same distribution.</sentence>
					<sentence id="S3.314">Let X be the sample that we wish to compare to some reference sample Y.</sentence>
					<sentence id="S3.315">The KS statistic is defined to be the maximum absolute difference between the unbiased cumulative distribution functions of X and Y.</sentence>
					<sentence id="S3.316">The KS statistic has a well-defined distribution from which a p-value can be easily computed.</sentence>
					<sentence id="S3.317">Kuiper's variation was used to increase sensitivity in the tails of the distribution 31.</sentence>
					<sentence id="S3.318">In the context of motifs, we defined the test sample X for a motif m to be the set of starting positions (with respect to transcription start sites) of m in the regulon.</sentence>
					<sentence id="S3.319">The reference sample Y is defined as the set of starting positions of m in all upstream regions in the genome.</sentence>
					<sentence id="S3.320">Thus, pKS(m) is a measure of how m is localized differently in the regulon than in the genome as a whole.</sentence>
					<sentence id="S3.321">It is also <xcope id="X3.321.2"><cue type="speculation" ref="X3.321.2">possible</cue> to define Y as the uniform distribution</xcope>; however, we found that many motifs had non-uniform distributions throughout all upstream regions of the genome, <xcope id="X3.321.1"><cue type="speculation" ref="X3.321.1">possibly</cue> as an artifact of the non-uniform AT/CG distributions in upstream regions</xcope> 32.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S3.322">Combining overrepresentation and positional bias</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.323">Since overrepresentation and KS are independent, the probabilities can simply be multiplied together to yield the probability of randomly sampling a motif with a given degree of overrepresentation and positional bias.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S3.324">Motif orientation</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S3.325">Many transcription factors will bind motifs on either DNA strand.</sentence>
					<sentence id="S3.326">Others, such as the general transcription factor TBP (TATA-Binding Protein), require a specific orientation and will only function if bound to motifs on a specific DNA strand 30.</sentence>
					<sentence id="S3.327">In scoring a motif m, a choice must therefore be made as to <xcope id="X3.327.2"><cue type="speculation" ref="X3.327.2">whether or not</cue> <xcope id="X3.327.1">the reverse complement mR of m will be <cue type="speculation" ref="X3.327.1">considered</cue> to be the same motif as m</xcope></xcope>.</sentence>
					<sentence id="S3.328">Most programs assume motif orientation does <xcope id="X3.328.1"><cue type="negation" ref="X3.328.1">not</cue> matter</xcope> and so define m = mR.</sentence>
					<sentence id="S3.329">Such an assumption <xcope id="X3.329.1"><cue type="speculation" ref="X3.329.1">may</cue> be overly generous</xcope> &#8211; as the TBP example above makes clear, the transcriptional machinery of a cell is clearly able to differentiate between the two strands.</sentence>
					<sentence id="S3.330">We thus chose to attach a flag to each motif, indicating <xcope id="X3.330.2"><cue type="speculation" ref="X3.330.2">whether or not</cue> <xcope id="X3.330.1">the motif <cue type="speculation" ref="X3.330.1">should</cue> be orientation-neutral</xcope></xcope>.</sentence>
					<sentence id="S3.331">BEAM and SPACER thus enumerate and evaluate all motifs with both values of this flag.</sentence>
					<sentence id="S3.332">SCOPE reports that orientation does matter (i.e. m &#8800; mR) for 17% of the regulons in our biological test set.</sentence>
				</DocumentPart>
		</Document>
		<Document type="Biological_full_article">
			<DocID type="BMC_ID">1471-2105-8-259</DocID>
				<DocumentPart type="Title">
					<sentence id="S4.1">Reuse of structural domain&#8211;domain interactions in protein networks</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.2">Abstract</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S4.3">Background</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.4"><xcope id="X4.4.1">Protein interactions are <cue type="speculation" ref="X4.4.1">thought</cue> to be largely mediated by interactions between structural domains</xcope>.</sentence>
					<sentence id="S4.5">Databases such as iPfam relate interactions in protein structures to known domain families.</sentence>
					<sentence id="S4.6">Here, we investigate how the domain interactions from the iPfam database are distributed in protein interactions taken from the HPRD, MPact, BioGRID, DIP and IntAct databases.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S4.7">Results</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.8">We find that known structural domain interactions can only explain a subset of 4&#8211;19% of the available protein interactions, nevertheless this fraction is still significantly bigger than expected by chance.</sentence>
					<sentence id="S4.9">There is a correlation between the frequency of a domain interaction and the connectivity of the proteins it occurs in.</sentence>
					<sentence id="S4.10">Furthermore, a large proportion of protein interactions can be attributed to a small number of domain interactions.</sentence>
					<sentence id="S4.11">We conclude that many, but not all, domain interactions constitute reusable modules of molecular recognition.</sentence>
					<sentence id="S4.12">A substantial proportion of domain interactions are conserved between E. coli, S. cerevisiae and H. sapiens.</sentence>
					<sentence id="S4.13">These domains are related to essential cellular functions, <xcope id="X4.13.1"><cue type="speculation" ref="X4.13.1">suggesting</cue> that many domain interactions were already present in the last universal common ancestor</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S4.14">Conclusion</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.15">Our results support the concept of domain interactions as reusable, conserved building blocks of protein interactions, but also highlight the limitations currently imposed by the small number of available protein structures.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S4.16">Background</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.17">One way to understand a protein's function is to look at its composition of conserved domains.</sentence>
					<sentence id="S4.18">Such families of related sequence regions, collected in the Pfam database 1, usually constitute structurally and functionally conserved modules.</sentence>
					<sentence id="S4.19">It is <xcope id="X4.19.1"><cue type="speculation" ref="X4.19.1">assumed</cue> that binding interfaces, too, are conserved evolutionary modules that are reused between proteins of different functions and retained during evolution</xcope> 23.</sentence>
					<sentence id="S4.20">Therefore, domain&#8211;domain interactions are often regarded as the currency of protein&#8211;protein interactions.</sentence>
					<sentence id="S4.21">Based on this assumption, Ng et al. described an approach to predict domain&#8211;domain interactions using literature curation, evolutionary history and the distribution of domains in protein interactions 4.</sentence>
					<sentence id="S4.22">Wuchty et al. compared the relationship between this set of predicted interacting domain pairs to the domain coocurrence network 5.</sentence>
					<sentence id="S4.23">More recently, other groups have come up with sophisticated statistical methods to estimate <xcope id="X4.23.2"><cue type="speculation" ref="X4.23.2">putatively</cue> interacting domain pairs</xcope>, based on the <xcope id="X4.23.1"><cue type="speculation" ref="X4.23.1">assumption</cue> of domain reusability</xcope> 678910.</sentence>
					<sentence id="S4.24">However, <xcope id="X4.24.1"><cue type="negation" ref="X4.24.1">none</cue> of these approaches offers structural evidence that the predicted domain pairs are able to form an interaction</xcope>.</sentence>
					<sentence id="S4.25">For complexes with known structure, it has been shown that domains can mediate interactions 1112.</sentence>
					<sentence id="S4.26">Such interactions between pairs of domains are stored in the iPfam database 13.</sentence>
					<sentence id="S4.27">The structural evidence <xcope id="X4.27.2"><cue type="speculation" ref="X4.27.2">lends strong support</cue> to the <xcope id="X4.27.1"><cue type="speculation" ref="X4.27.1">inferred</cue> domain pair</xcope></xcope>, resulting in a high confidence set of domain pairs.</sentence>
					<sentence id="S4.28">Unfortunately, the selection of complexes in the PDB database of protein structures 14 is rather small and biased 15.</sentence>
					<sentence id="S4.29">There is often only a single structure that shows a certain protein pair to interact, while other complexes like haemoglobin have been crystalized dozens of times.</sentence>
					<sentence id="S4.30">This makes it difficult to assess <xcope id="X4.30.1"><cue type="speculation" ref="X4.30.1">whether</cue> some domain pairs act as reusable modules in protein interactions from PDB data alone</xcope>.</sentence>
					<sentence id="S4.31">High-throughput experiments 161718 and extensive literature curation efforts 19 have yielded large databases of protein interactions 2021222324.</sentence>
					<sentence id="S4.32">Despite the continuing growth of protein interaction databases, even <xcope id="X4.32.1">the best studied protein interaction network of S. cerevisiae is <cue type="speculation" ref="X4.32.1">thought</cue> to be incomplete and inaccurate</xcope> 252627.</sentence>
					<sentence id="S4.33">Given that this network already comprises around 60000 interactions, questions arise as to how such networks have evolved and how they are organised.</sentence>
					<sentence id="S4.34">Furthermore, methods for assessing the quality of high-throughput experimental results are in high demand due to the error prone nature of the methods used.</sentence>
					<sentence id="S4.35">In this study, we investigate how pairs of protein families taken from iPfam are distributed in experimental protein interactions from five major model species.</sentence>
					<sentence id="S4.36">This allows us to <xcope id="X4.36.1"><cue type="speculation" ref="X4.36.1">address a number of questions</cue>:: what proportion of each organism's protein interaction network, its interactome, can be attributed to a known domain&#8211;domain interaction</xcope>?</sentence>
					<sentence id="S4.37">How conserved are domain&#8211;domain pairs between species, and <xcope id="X4.37.1">how many interacting domain pairs are still <cue type="speculation" ref="X4.37.1">unknown</cue></xcope>?</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S4.38">Results</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.39">iPfam domain pairs are overrepresented in experimental protein interactions</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.40">We analysed the distribution of Pfam families known to interact from a PDB structure (iPfam domain pairs) in experimentally derived protein interactions (experimental interactions).</sentence>
					<sentence id="S4.41">The experimental interactions were filtered to only include interactions with exactly two partners (see Methods).</sentence>
					<sentence id="S4.42">The fraction of experimental interactions that contain at least one iPfam domain pair is referred to as the iPfam coverage.</sentence>
					<sentence id="S4.43">Accordingly, the fraction of experimental interactions that contains any pair of Pfam domains (excluding the iPfam domain pairs) is called the Pfam coverage.</sentence>
					<sentence id="S4.44">Figure 1 shows the Pfam and iPfam coverage for the analysed species as a column chart.</sentence>
					<sentence id="S4.45">The number of resolved protein interactions varies greatly between species, as does the size of the underlying proteome (see Table 1).</sentence>
					<sentence id="S4.46">The Pfam coverage, coloured red in Figure 1, lies between 49.46% and 66.73%.</sentence>
					<sentence id="S4.47">Given that 74% of all UniProt proteins contain at least one Pfam match, this is <xcope id="X4.47.1"><cue type="negation" ref="X4.47.1">not</cue> by itself surprising</xcope>.</sentence>
					<sentence id="S4.48">The iPfam coverage, shown in blue, is much smaller, ranging from 2.92% in D. melanogaster to 19.02% in H. sapiens.</sentence>
					<sentence id="S4.49">In S. cerevisiae the species with the most comprehensively studied interactome, the iPfam coverage is 4.47%.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S4.50">Comparison of coverage of iPfam domain pairs on protein interactions</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.51">Comparison of coverage of iPfam domain pairs on protein interactions.</sentence>
					<sentence id="S4.52">For each species, the height of the column reflects the number of known protein&#8211;protein interactions in the data set.</sentence>
					<sentence id="S4.53">The columns are split according to the proportion of interactions that contain an iPfam domain pair (blue), that contain any other Pfam domains on both proteins (red), and those that contain <xcope id="X4.53.1"><cue type="negation" ref="X4.53.1">no</cue> Pfam domain pair</xcope> (yellow).</sentence>
				</DocumentPart>
				<DocumentPart type="TableLegend">
					<sentence id="S4.54">iPfam domain pair coverage on protein interactions</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.55">For each species, we list the size of the proteome as defined in Integr8 and the fraction of this proteome that is represented in the protein interaction sets, followed by the total number of binary protein interactions and the fraction of those that contain an iPfam domain pair.</sentence>
					<sentence id="S4.56">The last two columns show the results of the network shuffling experiments.</sentence>
					<sentence id="S4.57">The relatively low iPfam coverage is by itself a disappointing finding.</sentence>
					<sentence id="S4.58">However, the fact that only a small fraction of protein interactions contain known domain pairs <xcope id="X4.58.1"><cue type="speculation" ref="X4.58.1">could</cue> be a result of the scarcity of available structures of protein complexes</xcope>.</sentence>
					<sentence id="S4.59">Therefore, we asked <xcope id="X4.59.3"><cue type="speculation" ref="X4.59.3">whether</cue> the observed iPfam coverage is larger than <xcope id="X4.59.2"><cue type="speculation" ref="X4.59.2">would</cue> be <xcope id="X4.59.1"><cue type="speculation" ref="X4.59.1">expected</cue> by chance</xcope></xcope></xcope>.</sentence>
					<sentence id="S4.60">To test this, we created 1000 random networks per species using the algorithm described in Methods.</sentence>
					<sentence id="S4.61">We then calculated the iPfam coverage on the protein interactions in each randomised network.</sentence>
					<sentence id="S4.62">Mean and standard deviations of the randomisation experiments are shown in Table 1.</sentence>
					<sentence id="S4.63"><xcope id="X4.63.1"><cue type="negation" ref="X4.63.1">No</cue> P value (see Methods) was greater than 1.84</xcope> &#183; 10-06.</sentence>
					<sentence id="S4.64">This proves that the observed iPfam coverage is significantly higher than expected and iPfam domain pairs are enriched in real experimental protein interactions.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.65">Few iPfam domain pairs are responsible for a majority of the coverage</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.66">To understand why iPfam domain pairs occur more often in experimental interactions than expected by chance, we analysed the two largest data sets, S. cerevisiae and H. sapiens in more detail.</sentence>
					<sentence id="S4.67">In the following paragraph, we will call the experimental interactions that contain an iPfam domain pair the covered experimental interactions.</sentence>
					<sentence id="S4.68">In Figure 2, we compare the distribution of iPfam domain pairs on the number of experimental interactions for E. coli, S. cerevisiae and H. sapiens.</sentence>
					<sentence id="S4.69">This plot reflects how many iPfam domain pairs cover how many experimental interactions.</sentence>
					<sentence id="S4.70">Domain pairs that cluster to the left of the plot can be called specific domain pairs, as they only occur in very few covered experimental interactions.</sentence>
					<sentence id="S4.71">Conversely, domain pairs that cluster to the right of the plot occur in a large number of different covered experimental interactions and can be called promiscuous domain pairs.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S4.72">Frequencies of iPfam domain pairs in E. coli, S. cerevisiae and H. sapiens protein interactions</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.73">Frequencies of iPfam domain pairs in E. coli, S. cerevisiae and H. sapiens protein interactions.</sentence>
					<sentence id="S4.74">Each point in this graph represents a set of protein interactions.</sentence>
					<sentence id="S4.75">The abscissa reflects the number of interactions in each set that contain the same iPfam domain pair.</sentence>
					<sentence id="S4.76">The ordinate shows the number of distinct such sets, each defined by a different iPfam domain pair.</sentence>
					<sentence id="S4.77">In both H. sapiens (blue) and S. cerevisiae (green) a small number of iPfam domain pairs covers a large fraction of the interactome, whereas in E. coli, <xcope id="X4.77.1"><cue type="negation" ref="X4.77.1">no</cue> iPfam domain occurs in more than 4 experimental interactions at a time</xcope>.</sentence>
					<sentence id="S4.78">Dotted lines denote fitted monomial functions, showing that the distributions follow a power law.</sentence>
					<sentence id="S4.79">All three distributions in Figure 2 resemble a power law distribution, according to the good fit of log-linear functions (log(f(x)) = k log x + log a) shown as dotted lines.</sentence>
					<sentence id="S4.80">The slopes k of the H. sapiens and S. cerevisiae distributions are very similar (-1.53 and -1.60, respectively), while E. coli has a markedly smaller slope (-2.78).</sentence>
					<sentence id="S4.81">This <xcope id="X4.81.1"><cue type="speculation" ref="X4.81.1">suggests</cue> that the ratio of specific to promiscuous iPfam domain pairs is very similar in S. cerevisiae and H. sapiens, whereas E. coli features fewer multiply reoccurring iPfam domain pairs</xcope>.</sentence>
					<sentence id="S4.82">The power law distribution of iPfam frequencies <xcope id="X4.82.2"><cue type="speculation" ref="X4.82.2">implies</cue> that the majority of covered protein interactions <xcope id="X4.82.1"><cue type="speculation" ref="X4.82.1">can</cue> be attributed to a minority of iPfam domain pairs</xcope></xcope>.</sentence>
					<sentence id="S4.83">51.7% of the iPfam domain pairs in S. cerevisiae and 45.3% in H. sapiens are seen in just one experimental interaction.</sentence>
					<sentence id="S4.84">Conversely, 92.4% of H. sapiens and 85.4% of S. cerevisiae covered experimental interactions contain an iPfam domain pair that occurs more than once.</sentence>
					<sentence id="S4.85">Even more, half of the covered experimental interactions in H. sapiens contain an iPfam domain pair that occurs in more than 16 different experimental interactions (5 for S. cerevisiae).</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.86">Degree distribution and iPfam domain pair frequency are correlated</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.87">We <xcope id="X4.87.3"><cue type="speculation" ref="X4.87.3">reasoned</cue> that if there are iPfam domain pairs that act as reusable modules in protein interactions, then <xcope id="X4.87.1"><xcope id="X4.87.2">highly connected proteins <cue type="speculation" ref="X4.87.1">should</cue> also be more <cue type="speculation" ref="X4.87.1">likely</cue> to contain promiscuous iPfam domain pairs and vice-versa</xcope></xcope></xcope>.</sentence>
					<sentence id="S4.88">For each node (i.e. protein) in the filtered H. sapiens and S. cerevisiae protein interaction network, we calculated its degree, defined as the number of adjacent edges (i.e. interactions).</sentence>
					<sentence id="S4.89">At the same time, we counted the number of iPfam domain pairs on the adjacent edges.</sentence>
					<sentence id="S4.90">In Figure 3, we plot the mean number of iPfam domain pairs relative to the degree of the node.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S4.91">Average frequency of iPfam domain pairs relative to degree of node</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.92">Average frequency of iPfam domain pairs relative to degree of node.</sentence>
					<sentence id="S4.93">Each point represents a protein in the interaction networks of H. sapiens (blue) and S. cerevisiae (green).</sentence>
					<sentence id="S4.94">For each protein, we calculate the degree, defined as the number of interactions the protein is involved in.</sentence>
					<sentence id="S4.95">On the y-axis, we show the average number of iPfam domain pairs in edges adjacent to proteins of degree x.</sentence>
					<sentence id="S4.96">We calculated a Spearman correlation of 0.68 and 0.71, for H. sapiens and S. cerevisiae.</sentence>
					<sentence id="S4.97">The correlation is outlined by dotted lines.</sentence>
					<sentence id="S4.98">We find that for proteins from a degree of 1 to 50, there is strong correlation in both H. sapiens and S. cerevisiae (Spearman correlation coefficients of 0.68 and 0.71, respectively) between degree and number of iPfam domain pairs on adjacent edges.</sentence>
					<sentence id="S4.99">For the 1.2% of proteins in H. sapiens and 6.4% in S. cerevisiae which have a degree higher than 50, the correlation gradually diminishes.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.100">Promiscuous domain pairs</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.101">Additional file 1 contains a list of all iPfam domain pairs and their frequencies in the experimental protein interactions, while Additional file 4 lists the frequencies of the single domains.</sentence>
					<sentence id="S4.102">Interactions between protein kinase domains (Pkinase, Pfam acc. PF00069 and Pkinase_Tyr, Pfam acc. PF07714) are the most frequent iPfam domain pairs, as well as interactions involving recognition domains such as SH2 or SH3.</sentence>
					<sentence id="S4.103">In S. cerevisiae, the Proteasome family (Pfam acc. PF00227, a family of peptidases) and WD40 (Pfam acc. PF00400, a repeat involved in multimer assembly) are also amongst the five most frequent iPfam domain pairs.</sentence>
					<sentence id="S4.104">As expected, <xcope id="X4.104.1">more frequent domains are also more <cue type="speculation" ref="X4.104.1">likely</cue> to be found as pairs in interacting proteins</xcope>.</sentence>
					<sentence id="S4.105">It should be noted however that in the PDB structures, some of the observed domain pairs (Pkinase_Tyr &#8596; SH3_1, Pkinase_C &#8596; Pkinase and others) are only seen to interact within one protein (intrachain interactions) as opposed to interactions between two distinct proteins (interchain interaction).</sentence>
					<sentence id="S4.106">The table in Additional file 5 lists the number of PDB structures for each iPfam domain pair, distinguishing between intrachain and interchain interactions.</sentence>
					<sentence id="S4.107">Looking for example at the covered experimental interactions in H. sapiens(Additional file 1), only 8 out of the 100 most frequent iPfam domain pairs are seen in intrachain interactions exclusively, while 61 are exclusive to interchain interactions and 31 are seen in both.</sentence>
					<sentence id="S4.108">A <xcope id="X4.108.1"><cue type="speculation" ref="X4.108.1">possible</cue> explanation for the occurrence of purely intrachain iPfam domain pairs in the covered experimental interactions is that they frequently cooccur together on the same protein with other iPfam domain pairs</xcope>.</sentence>
					<sentence id="S4.109">A list of all combinations of iPfam domains (the domain architecture) on interacting proteins is given in Additional file 2.</sentence>
					<sentence id="S4.110">It reveals that certain iPfam domains such as SH2, SH3_1 or Pkinase_tyr frequently occur in the same architecture.</sentence>
					<sentence id="S4.111"><xcope id="X4.111.3"><cue type="negation" ref="X4.111.3">Without</cue> further experiments</xcope>, we <xcope id="X4.111.1"><cue type="speculation" ref="X4.111.1">cannot assign</cue> the correct interacting domains <cue type="speculation" ref="X4.111.1">with certainty</cue></xcope>.</sentence>
					<sentence id="S4.112">This highlights a basic assumption of this study that <xcope id="X4.112.1"><cue type="speculation" ref="X4.112.1">could</cue> be a source of error</xcope>.</sentence>
					<sentence id="S4.113">We <xcope id="X4.113.1"><cue type="speculation" ref="X4.113.1">assume</cue> that interacting proteins that contain an iPfam domain pair interact through these domains</xcope>.</sentence>
					<sentence id="S4.114">This, of course, is not necessarily the case.</sentence>
					<sentence id="S4.115">Although it has been shown that sequence similarity is linked to the mode of interaction 28, not every protein interaction that contains an iPfam domain pair is necessarily mediated by exactly this domain pair.</sentence>
					<sentence id="S4.116">To gain a rough estimate of the false positive rate due to this assumption, we counted how many protein pairs in the PDB contain an iPfam domain pair that does <xcope id="X4.116.1"><cue type="negation" ref="X4.116.1">not</cue> mediate an interaction in one complex structure</xcope> but does so in another.</sentence>
					<sentence id="S4.117">3671 out of a total of 5380 interacting protein pairs from the PDB contain an iPfam domain pair that does <xcope id="X4.117.1"><cue type="negation" ref="X4.117.1">not</cue> interact in one complex structure</xcope> but does so in another.</sentence>
					<sentence id="S4.118">This means that for more than 32% of the protein interactions in the PDB, the iPfam domain pair assignment is correct.</sentence>
					<sentence id="S4.119">For the remaining 68%, the iPfam domain pair assignments are wrong in one case but correct in another.</sentence>
					<sentence id="S4.120"><xcope id="X4.120.3">The real false positive rate is <cue type="speculation" ref="X4.120.3">likely</cue> to be smaller</xcope>, because some iPfam domain pairs <xcope id="X4.120.2"><cue type="speculation" ref="X4.120.2">might</cue> still independently mediate an interaction with a different, <xcope id="X4.120.1"><cue type="speculation" ref="X4.120.1">possibly</cue> unknown</xcope>, partner protein</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.121">iPfam domain pairs are enriched in S. cerevisiae complexes</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.122">We tested <xcope id="X4.122.1"><cue type="speculation" ref="X4.122.1">whether</cue> iPfam domain pairs are enriched in known protein complexes from S. cerevisiae</xcope>.</sentence>
					<sentence id="S4.123">This is interesting firstly because <xcope id="X4.123.1">domain&#8211;domain interactions are <cue type="speculation" ref="X4.123.1">thought</cue> to be more common in obligate interactions</xcope>.</sentence>
					<sentence id="S4.124">Secondly, the described modularity of known S. cerevisiae complexes lends support to the <xcope id="X4.124.1"><cue type="speculation" ref="X4.124.1">assumption</cue> that the underlying iPfam domain pairs are modular</xcope>.</sentence>
					<sentence id="S4.125">In fact, we find a two-fold enrichment for iPfam domain pairs in the complexes described by Gavin et al. 29.</sentence>
					<sentence id="S4.126">From the 294 binary protein interactions in this data set, 24 contained an iPfam domain pair, which corresponds to a coverage of 8.16% (P value 2.7 &#183; 10-47).</sentence>
					<sentence id="S4.127">We also analysed the full dataset of protein complexes.</sentence>
					<sentence id="S4.128">From 491 complexes described by Gavin et al., 157 contained at least one pair of proteins with an iPfam domain pair (31.9%).</sentence>
					<sentence id="S4.129">In total we found 617 pairs of proteins that contained an iPfam domain pair.</sentence>
					<sentence id="S4.130">Interestingly, we find that the distribution of iPfam domain pairs on complexes is uneven.</sentence>
					<sentence id="S4.131">When we drew 617 protein pairs randomly from all possible protein pairs in the complexes, we covered 192 complexes on average, with a standard deviation of 7.22.</sentence>
					<sentence id="S4.132">The probability of covering only 157 complexes is just 6.24 &#183; 10-07.</sentence>
					<sentence id="S4.133">Thus, some complexes contain a greater number of iPfam domain pairs, while other complexes do <xcope id="X4.133.1"><cue type="negation" ref="X4.133.1">not</cue> contain any at all</xcope>.</sentence>
					<sentence id="S4.134">This <xcope id="X4.134.1"><cue type="speculation" ref="X4.134.1">suggests</cue> that some sets of domain pairs are specific to certain complexes or pathways</xcope>.</sentence>
					<sentence id="S4.135">Typical examples are the RNA polymerase II complex (IntAct id: EBI-815049) or the U1 snRNP complex which contain numerous iPfam domain pairs that are specific to these complexes.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.136">iPfam domain pairs are conserved between species</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.137">Within the 3 to 19% of experimental interactions covered by iPfam, we analysed the conservation of iPfam domain pairs between species.</sentence>
					<sentence id="S4.138">We call an iPfam domain pair conserved when the same pair is observed in experimental interactions of two different species.</sentence>
					<sentence id="S4.139">The matrix in Table 2 shows the pair-wise conservation of iPfam domain pairs.</sentence>
					<sentence id="S4.140">For each species, a maximum of 40% to 90% of iPfam domain pairs can also be found in another species, although not all overlaps are as large.</sentence>
				</DocumentPart>
				<DocumentPart type="TableLegend">
					<sentence id="S4.141">Matrix of mutual shared iPfam domain pairs</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.142">The Table shows the number of co-occurences of iPfam domain pairs between two species.</sentence>
					<sentence id="S4.143">The right-most column lists the total number of unique iPfam pairs found in each species' experimental interactions.</sentence>
					<sentence id="S4.144">Figure 4 shows a Venn diagram of the mutual overlaps between the two eukaryotes S. cerevisiae and H. sapiens and the prokaryote E. coli.</sentence>
					<sentence id="S4.145">While the eukaryotes share 524 domain pairs, only 158 iPfam domain pairs are shared between S. cerevisiae and E. coli, and only 135 between E. coli and H. sapiens.</sentence>
					<sentence id="S4.146">Remarkably, 53% of the observed iPfam domain pairs in E. coli are also observed in one of the two eukaryotes, and 107 iPfam domain pairs are even conserved amongst all three species.</sentence>
					<sentence id="S4.147">The iPfam domains in these pairs are related to housekeeping activities such as translation, replication or ATP synthesis.</sentence>
					<sentence id="S4.148">Additional file 3 contains a list of the conserved iPfam domain pairs.</sentence>
				</DocumentPart>
				<DocumentPart type="TableLegend">
					<sentence id="S4.149">Venn diagramm showing the fractions of iPfam domain pairs found in the E. coli, S. cerevisiae and H. sapiens binary protein interaction sets</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.150">Venn diagramm showing the fractions of iPfam domain pairs found in the E. coli, S. cerevisiae and H. sapiens binary protein interaction sets.</sentence>
					<sentence id="S4.151">The three circles represent the iPfam domain pairs observed in the respective species.</sentence>
					<sentence id="S4.152">The overlaps denote co-observed iPfam domain pairs.</sentence>
					<sentence id="S4.153">The grey set in the background represents iPfam domain pairs <xcope id="X4.153.1"><cue type="negation" ref="X4.153.1">not</cue> found in the three species</xcope>.</sentence>
					<sentence id="S4.154">We also compared the iPfam domain pair frequencies between H. sapiens and S. cerevisiae directly.</sentence>
					<sentence id="S4.155">We derive a Spearman correlation coefficient of 0.50 for the frequencies of all 524 iPfam domain pairs that are conserved between S. cerevisiae and H. sapiens.</sentence>
					<sentence id="S4.156">To test <xcope id="X4.156.1"><cue type="speculation" ref="X4.156.1">whether</cue> the correlation is an artefact of the distribution of the values</xcope>, we recalculated the correlation 1000 times, each time shuffling one distribution randomly.</sentence>
					<sentence id="S4.157">From these random results, we derive a P value of 3.6 &#183; 10-30 that the observed correlation is random.</sentence>
					<sentence id="S4.158">This <xcope id="X4.158.2"><cue type="speculation" ref="X4.158.2">suggests</cue> that iPfam domain pairs with a large number of occurrences in one species <xcope id="X4.158.1"><cue type="speculation" ref="X4.158.1">tend</cue> also to be more frequent in the other</xcope></xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.159">Predicting the total number of iPfam domain pairs in nature</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.160">Our analysis allow us to estimate how many iPfam domain pairs <xcope id="X4.160.1"><cue type="speculation" ref="X4.160.1">would</cue> eventually cover all protein interactions</xcope>.</sentence>
					<sentence id="S4.161">This corresponds to the predictions made by Aloy and Russel 2.</sentence>
					<sentence id="S4.162">Similar to their approach, we make a linear estimation with the following factors:</sentence>
					<sentence id="S4.163">&#967;S The number of iPfam domain pairs observed in species S</sentence>
					<sentence id="S4.164">&#952;S The number of observed interactions in species S that contain an iPfam domain pair</sentence>
					<sentence id="S4.165">&#920;S The total number of observed interactions in species S</sentence>
					<sentence id="S4.166">&#968;S The number of proteins from species S that are seen in an interaction screen</sentence>
					<sentence id="S4.167">&#936;S The proteome size for species S</sentence>
					<sentence id="S4.168">&#958;S The number of Pfam domains observed in all protein of species S</sentence>
					<sentence id="S4.169">&#926; The total number of known Pfam domains</sentence>
					<sentence id="S4.170">We denote the estimated number of iPfam domain pairs in species S with x^S.</sentence>
					<sentence id="S4.171">The formula we apply is x^S=&#967;S&#8901;&#920;S&#952;S&#8901;&#936;S&#968;S</sentence>
					<sentence id="S4.172">This means we scale the observed number of iPfam domain pairs to cover all observed interactions.</sentence>
					<sentence id="S4.173">We then use the relative proteome coverage to estimate the total number of iPfam domain pairs in all proteins.</sentence>
					<sentence id="S4.174">Finally, we follow the argument of Aloy and Russel that the number of Pfam families seen in species S indicates the fraction of the protein universe represented in the species.</sentence>
					<sentence id="S4.175">We therefore predict the total number of iPfam domain pairs x^ as x^=x^S&#8901;&#926;&#958;S.</sentence>
					<sentence id="S4.176">Both parameters and results of the calculation are shown in Table 3.</sentence>
					<sentence id="S4.177">The estimates for the total number of iPfam domain pairs ranges from 33813 to 120511, with an average of 76918.</sentence>
				</DocumentPart>
				<DocumentPart type="TableLegend">
					<sentence id="S4.178">Prediction of total number of iPfam domain pairs</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.179">&#920;S The total number of observed interactions in species S</sentence>
					<sentence id="S4.180">&#952;S The number of observed interactions in species S that contain an iPfam domain pair</sentence>
					<sentence id="S4.181">&#936;S The proteome size for species S</sentence>
					<sentence id="S4.182">&#968;S The number of proteins from species S that are seen in an interaction screen</sentence>
					<sentence id="S4.183">&#967;S The number of iPfam domain pairs observed in species S</sentence>
					<sentence id="S4.184">The predicted total number of iPfam domain pairs in species S</sentence>
					<sentence id="S4.185">&#926; The total number of known Pfam domains</sentence>
					<sentence id="S4.186">&#950;S The number of Pfam domains observed in all protein of species S</sentence>
					<sentence id="S4.187">The estimated total number of iPfam domains in all species</sentence>
					<sentence id="S4.188">Prediction results are shown in bold font.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S4.189">Discussion</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.190">iPfam coverage is low</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.191">The coverage of iPfam on experimentally derived protein interactions is low.</sentence>
					<sentence id="S4.192">For S. cerevisiae, the species with the best mapped interactome, only 4.47% of the protein interactions contain an iPfam domain pair.</sentence>
					<sentence id="S4.193">Even in H. sapiens, where we <xcope id="X4.193.2"><cue type="speculation" ref="X4.193.2">suspect</cue> a positive bias due to the overrepresentation of disease-related proteins in both the PDB and protein interaction databases</xcope>, 81% of protein interactions do <xcope id="X4.193.1"><cue type="negation" ref="X4.193.1">not</cue> contain an iPfam domain pair</xcope>.</sentence>
					<sentence id="S4.194">This reveals the limits of our understanding of the molecular structure of protein interactions.</sentence>
					<sentence id="S4.195">Figure 1 also shows that a majority of protein interactions contains at least one pair of Pfam domains.</sentence>
					<sentence id="S4.196"> While there is <xcope id="X4.196.1"><cue type="negation" ref="X4.196.1">no</cue> structural information about putative interactions between these pairs</xcope>, this fraction can already be analysed using statistical methods to identify putative domain interactions 7910.</sentence>
					<sentence id="S4.197">This in turn creates new targets for future structural genomics projects 30.</sentence>
					<sentence id="S4.198">Prioritising these targets according to the number of covered experimental interactions <xcope id="X4.198.1"><cue type="speculation" ref="X4.198.1">could</cue> increase the coverage of databases like iPfam quickly</xcope>.</sentence>
					<sentence id="S4.199">We find, however, that iPfam domain pairs occur significantly more often in experimental interactions than <xcope id="X4.199.1"><cue type="speculation" ref="X4.199.1">would</cue> be expected by chance</xcope>.</sentence>
					<sentence id="S4.200">This requires that at least a subset of the iPfam domain pairs are reused in several experimental interactions.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.201">iPfam domain pairs can act as modules</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.202">Despite the low overall coverage, iPfam domain pairs are found in more protein interactions than <xcope id="X4.202.1"><cue type="speculation" ref="X4.202.1">would</cue> be expected by chance</xcope> (see Table 1).</sentence>
					<sentence id="S4.203">This statistical overrepresentation <xcope id="X4.203.1"><cue type="speculation" ref="X4.203.1">suggests</cue> that certain iPfam domain pairs constitute modules of molecular recognition which are reused in different protein interactions</xcope> 2.</sentence>
					<sentence id="S4.204">In fact, we find a characteristic power law distribution when we plot the histogram of experimental interactions per iPfam domain pair, see Figure 2.</sentence>
					<sentence id="S4.205">This underlines that a few promiscuous iPfam domain pairs are responsible for the majority of the iPfam coverage.</sentence>
					<sentence id="S4.206"><xcope id="X4.206.1">These iPfam domain pairs are most <cue type="speculation" ref="X4.206.1">likely</cue> to be reusable modules</xcope>.</sentence>
					<sentence id="S4.207">In fact, we find the most frequent iPfam domain pairs to be recognition domains in signal transduction.</sentence>
					<sentence id="S4.208">Conversely, a large number of iPfam domain pairs are specific to a small number of protein interactions.</sentence>
					<sentence id="S4.209">This <xcope id="X4.209.1"><cue type="speculation" ref="X4.209.1">implies</cue> that recognition specificity amongst proteins is often achieved by maintaining an exclusive interacting domain pair</xcope>.</sentence>
					<sentence id="S4.210">This <xcope id="X4.210.1"><cue type="speculation" ref="X4.210.1">could</cue> pose a problem for purely statistical approaches to infer domain interactions</xcope>:: if for many interfaces the real interacting domain pair will only occur once in an interactome, it will be hard to elucidate this on a statistical basis.</sentence>
					<sentence id="S4.211">The concept of modularity of interacting domain pairs is furthermore supported by the positive correlation between the number of protein interactions an iPfam domain pair is seen in and the connectivity of the interacting proteins.</sentence>
					<sentence id="S4.212">We <xcope id="X4.212.2"><cue type="speculation" ref="X4.212.2">hypothesise</cue> that if during the course of evolution a protein is duplicated, <xcope id="X4.212.1">it is <cue type="speculation" ref="X4.212.1">likely</cue> to retain connections with other proteins which contain the same domain interaction modules</xcope></xcope>.</sentence>
					<sentence id="S4.213">It is clear, however, that even though recognition domains are reused in various proteins, their specificity is bound to be controlled.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.214">Many domain&#8211;domain interfaces remain to be resolved</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.215">We tried to estimate how many iPfam domain pairs exists in all interactomes.</sentence>
					<sentence id="S4.216">Our predictions lie almost an order of magnitude higher than the 10000 domain interaction types proposed by Aloy and Russel 2.</sentence>
					<sentence id="S4.217">While all such estimates should be taken with caution, our results show that at best 10% of all structural domain pairs are represented in iPfam.</sentence>
					<sentence id="S4.218">The statistical approaches described in the introduction can only cover a small fraction of this domaininteraction space.</sentence>
					<sentence id="S4.219">Riley et al. for example report only 3005 interacting domain pairs which could be inferred from protein interactions 7.</sentence>
					<sentence id="S4.220">Even under the <xcope id="X4.220.3"><cue type="speculation" ref="X4.220.3">assumption</cue> that many interactions involve short linear motifs, it <xcope id="X4.220.2"><cue type="speculation" ref="X4.220.2">seems</cue> <xcope id="X4.220.1"><cue type="speculation" ref="X4.220.1">likely</cue> that a large number of domain interactions remain to be resolved</xcope></xcope></xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.221">iPfam domain pairs are conserved during evolution</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.222">iPfam domain pairs are not only recurrent within the protein interaction network of one species.</sentence>
					<sentence id="S4.223"><xcope id="X4.223.1">They also <cue type="speculation" ref="X4.223.1">appear</cue> to be conserved between species</xcope>.</sentence>
					<sentence id="S4.224">In a small set of protein structures from S. cerevisiae, it has been shown that interacting domain pairs are more conserved than non-interacting domain pairs 10.</sentence>
					<sentence id="S4.225">Here, we call an iPfam domain pair conserved if there are protein interactions in two species which contain the same iPfam domain pair.</sentence>
					<sentence id="S4.226">In a recent study 31, Gandhi et al. have assessed the conservation of protein interactions as the co-occurrence of orthologous interacting proteins.</sentence>
					<sentence id="S4.227">They found only 16 orthologous interacting protein pairs that were conserved in S. cerevisiae, C. elegans, D. melanogaster and H. sapiens.</sentence>
					<sentence id="S4.228">Conversely, we find that 71 iPfam domain pairs are conserved in the experimental interactions of these species.</sentence>
					<sentence id="S4.229">Even between a prokaryote like E. coli and the two eukaryotes S. cerevisiae and H. sapiens there is a considerable proportion of conserved iPfam domain pairs, to the extent that 53% of the iPfam domain pairs from E. coli are also observed in a eukaryote (Table 2).</sentence>
					<sentence id="S4.230">107 domain pairs are shared between E. coli, S. cerevisiae and H. sapiens.</sentence>
					<sentence id="S4.231">These domains are predominantly related to transcription, translation and other basic essential cellular activities, which is in congruence with the findings of Gandhi et al.</sentence>
					<sentence id="S4.232">Although the low overall iPfam coverage hampers the interpretation of our results, it <xcope id="X4.232.1"><cue type="speculation" ref="X4.232.1">looks as</cue> if there has been a diversification of domain interactions from E. coli to H. sapiens</xcope>.</sentence>
					<sentence id="S4.233">While more than half of the iPfam domain pairs in E. coli have been retained throughout evolution, <xcope id="X4.233.1">numerous new ones <cue type="speculation" ref="X4.233.1">seem</cue> to have emerged in eukaryotic development</xcope>.</sentence>
					<sentence id="S4.234">The significant positive correlation in the frequency of iPfam domain pairs conserved between S. cerevisiae and H. sapiens also <xcope id="X4.234.3"><cue type="speculation" ref="X4.234.3">suggests</cue> that the binding interfaces are more often <xcope id="X4.234.2">kept <cue type="speculation" ref="X4.234.2">or</cue> even reused</xcope> <xcope id="X4.234.1"><cue type="negation" ref="X4.234.1">rather than</cue> lost</xcope> in the course of evolution</xcope>.</sentence>
					<sentence id="S4.235">Conversely, this also <xcope id="X4.235.3"><cue type="speculation" ref="X4.235.3">raises the question</cue> of <xcope id="X4.235.2"><cue type="speculation" ref="X4.235.2">whether</cue> one <xcope id="X4.235.1"><cue type="speculation" ref="X4.235.1">could</cue> establish a comprehensive set of domain interactions that were present in the last universal common ancestor</xcope></xcope></xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S4.236">Conclusion</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.237">In this study, we addressed the utility of current knowledge about structural domain interactions in order to interpret experimental protein interactions.</sentence>
					<sentence id="S4.238">Disappointingly, only a small fraction of all experimental interactions can be attributed to a known domain interaction.</sentence>
					<sentence id="S4.239">Within this subset of interactions, we nevertheless made several reassuring observations: structural domain pairs are enriched in experimental protein interactions.</sentence>
					<sentence id="S4.240">Some of the domain pairs <xcope id="X4.240.1"><cue type="speculation" ref="X4.240.1">seem</cue> to mediate a large number of protein interactions, thus acting as reusable connectors</xcope>.</sentence>
					<sentence id="S4.241">This property is also conserved between species.</sentence>
					<sentence id="S4.242">Taken as a whole, this further underlines that solving structures of protein complexes should be an important focus for future structural genomics projects.</sentence>
					<sentence id="S4.243">Targeting the most frequent domain pairs <xcope id="X4.243.1"><cue type="speculation" ref="X4.243.1">would</cue> increase the coverage of databases such as iPfam, shedding more light onto the molecular mechanisms underpinning cellular networks</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="Title">
					<sentence id="S4.244">Methods</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.245">Protein interaction data</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.246">The complete interaction sets from BioGRID 20, DIP 21, HPRD 22, IntAct 23 and MPact 24 were downloaded.</sentence>
					<sentence id="S4.247">A wide range of databases were used to cover as many distinct experimental data sets as possible.</sentence>
					<sentence id="S4.248">BioGRID for example contains a large manually curated set of protein interactions for S. cerevisiae 19.</sentence>
					<sentence id="S4.249">Similarily, HPRD hosts a set of manually curated protein interactions for H. sapiens.</sentence>
					<sentence id="S4.250">IntAct on the other hand contains results from high-throughput screens and integrates data from other protein interaction databases as part of the IMEx collaboration.</sentence>
					<sentence id="S4.251">The MPact database combines the manually curated S. cerevisiae protein complexes data set formerly known as the MIPS complexes with other high-throughput interaction experiments data.</sentence>
					<sentence id="S4.252">Taken together, these databases represent most of the protein interactions currently stored in machine-accessible form.</sentence>
					<sentence id="S4.253">Despite great efforts to unify access to protein interaction data 32, acquiring large data sets from diverse sources is still far from trivial and error prone.</sentence>
					<sentence id="S4.254">The PSI-MI XML data exchange format provided by the aforementioned databases was used to generate a local relational database of protein interactions.</sentence>
					<sentence id="S4.255">All entries were mapped to UniProt 33 by either relying on existing annotations from the source databases or by pair-wise sequence alignment to all UniProt proteins from the same species as the query protein.</sentence>
					<sentence id="S4.256">The direct sequence comparison was performed using pmatch, a very fast pairwise alignment algorithm developed by Richard Durbin (unpublished, source code available 34).</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.257">Species</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.258">To allow cross-species comparisons, the data were split into five distinct species sets: E. coli, S. cerevisiae, C. elegans, D. melanogaster and H. sapiens.</sentence>
					<sentence id="S4.259">It should be noted that the proportion of proteins for which an interaction is known varies greatly between the species, see Table 1.</sentence>
					<sentence id="S4.260">This <xcope id="X4.260.1"><cue type="speculation" ref="X4.260.1">might</cue> affect the results</xcope> if there is a systematic bias on the composition of a protein interaction set.</sentence>
					<sentence id="S4.261">To prevent bias from multiple alternative versions of the same protein, all interacting proteins were mapped to reference proteomes as defined by Integr8 35, again using pmatch.</sentence>
					<sentence id="S4.262">An average of &#8776; 16% of interaction entries were lost in the mapping process, either if <xcope id="X4.262.2"><cue type="negation" ref="X4.262.2">no</cue> sequence was provided with the original entry</xcope> or if <xcope id="X4.262.1"><cue type="negation" ref="X4.262.1">no</cue> significant matching sequence could be found in Integr8</xcope>.</sentence>
					<sentence id="S4.263">The total number of missing proteins will be lower, as several entries from different databases refer to the same sequence.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.264">iPfam</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.265">The iPfam database is derived from protein structures deposited in the PDB.</sentence>
					<sentence id="S4.266">Regions in every protein structure that match a Pfam domain are scanned for interactions with residues in another Pfam domain.</sentence>
					<sentence id="S4.267">All such interacting domain pairs are stored in a database together with detailed information on the residues involved 13.</sentence>
					<sentence id="S4.268">Every pair of Pfam families that are found to interact in a PDB structure are called an iPfam domain pair throughout the text.</sentence>
					<sentence id="S4.269">Single Pfam families that are part of an iPfam domain pair are then called iPfam domains.</sentence>
					<sentence id="S4.270">For example, in PDB entry 1k9a the two iPfam domains SH2 (Pfam accession PF00017) and Pkinase_Tyr (PF07714) interact, therefore they form an iPfam domain pair.</sentence>
					<sentence id="S4.271">In this study, iPfam version 21 was employed, containing 2837 iPfam domains, forming 4030 iPfam domain pairs.</sentence>
					<sentence id="S4.272">Figure 5 shows the species distribution of iPfam domain pairs.</sentence>
					<sentence id="S4.273">H. sapiens, E. coli and S. cerevisiae are clearly over-represented compared to the other 1113 species with less than 179 complex structures.</sentence>
					<sentence id="S4.274">Some iPfam domain pairs are seen to form interactions between distinct peptide chains in the structure (interchain), while others form an interaction between two distinct domains within the same chain (intrachain).</sentence>
					<sentence id="S4.275">In iPfam version 21, there are 3407 interchain and 1171 intrachain domain pairs, which means that 548 domain pairs mediate both inter- and intrachain interactions.</sentence>
					<sentence id="S4.276">In this analysis, both types of domain interactions were used equivalently, <xcope id="X4.276.2"><cue type="speculation" ref="X4.276.2">assuming</cue> that intrachain interactions <xcope id="X4.276.1"><cue type="speculation" ref="X4.276.1">can</cue> become interchain interactions and vice-versa as a result of a gene-fission/fusion events</xcope></xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S4.277">Species distribution of iPfam domain pairs</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.278">Species distribution of iPfam domain pairs.</sentence>
					<sentence id="S4.279">This pie chart shows how many iPfam domain pairs were found in PDB structures from each species.</sentence>
					<sentence id="S4.280">The total number is larger than the 4030 unique iPfam pairs in the database because an iPfam pair can be found in structures from several species.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.281">Filtering</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.282">There are many types of experiments used to derive protein interactions, with different properties and error rates.</sentence>
					<sentence id="S4.283">For this analysis, solely the properties of physically interacting proteins is of interest.</sentence>
					<sentence id="S4.284">Therefore, only interactions between exactly two proteins per experiment were considered.</sentence>
					<sentence id="S4.285">That means all protein complex data that were derived by co-purification methods were removed, unless a particular experiment had identified exactly two binding partners.</sentence>
					<sentence id="S4.286">All genetic interactions were also removed.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.287">Random networks</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.288">Randomised protein interaction networks with identical degree distributions were generated from the original filtered experimental interaction data for each species.</sentence>
					<sentence id="S4.289">In each randomisation step, a mapping is created that assigns every node a randomly chosen replacement node.</sentence>
					<sentence id="S4.290">In this way the edges of the network remain in place, while the nodes are shuffled randomly.</sentence>
					<sentence id="S4.291">It should be noted that the degree distribution per node is <xcope id="X4.291.1"><cue type="negation" ref="X4.291.1">not</cue> maintained</xcope>.</sentence>
					<sentence id="S4.292">Instead, this behaviour simulates a network with a high false positive rate.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S4.293">P values</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S4.294">P values for observations x were calculated as P(X &#8805; x) = f(x; &#956;, &#963;), where f(x; &#956;, &#963;) is the probability density function of the normal distribution with mean &#956; and standard deviation &#963;.</sentence>
					<sentence id="S4.295">&#956; and &#963; are estimated through the randomisation experiments.</sentence>
					<sentence id="S4.296">The density function thus provides the probability that a value less than or equal to x is observed by chance, given the distribution estimated by a random resampling method.</sentence>
					<sentence id="S4.297">Where appropriate, the inverse probability P(X &gt; x) = 1 - f(x; &#956;, &#963;) was applied.</sentence>
				</DocumentPart>
		</Document>
		<Document type="Biological_full_article">
			<DocID type="PMCID">PMC1064853</DocID>
				<DocumentPart type="Title">
					<sentence id="S5.1">Two Distinct E3 Ubiquitin Ligases Have Complementary Functions in the Regulation of Delta and Serrate Signaling in Drosophila</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S5.2">Abstract</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.3">Signaling by the Notch ligands Delta (Dl) and Serrate (Ser) regulates a wide variety of essential cell-fate decisions during animal development.</sentence>
					<sentence id="S5.4">Two distinct E3 ubiquitin ligases, Neuralized (Neur) and Mind bomb (Mib), have been shown to regulate Dl signaling in Drosophila melanogaster and Danio rerio, respectively.</sentence>
					<sentence id="S5.5">While the neur and mib genes are evolutionarily conserved, <xcope id="X5.5.1">their respective roles in the context of a single organism have <cue type="negation" ref="X5.5.1">not</cue> yet been examined</xcope>.</sentence>
					<sentence id="S5.6">We show here that the Drosophila mind bomb (D-mib) gene regulates a subset of Notch signaling events, including wing margin specification, leg segmentation, and vein determination, that are distinct from those events requiring neur activity.</sentence>
					<sentence id="S5.7">D-mib also modulates lateral inhibition, a neur- and Dl-dependent signaling event, <xcope id="X5.7.1"><cue type="speculation" ref="X5.7.1">suggesting</cue> that D-mib regulates Dl signaling</xcope>.</sentence>
					<sentence id="S5.8">During wing development, <xcope id="X5.8.2">expression of D-mib in dorsal cells <cue type="speculation" ref="X5.8.2">appears</cue> to be necessary and sufficient for wing margin specification</xcope>, <xcope id="X5.8.1"><cue type="speculation" ref="X5.8.1">indicating that</cue> D-mib also regulates Ser signaling</xcope>.</sentence>
					<sentence id="S5.9">Moreover, the activity of the D-mib gene is required for the endocytosis of Ser in wing imaginal disc cells.</sentence>
					<sentence id="S5.10">Finally, ectopic expression of neur in D-mib mutant larvae rescues the wing D-mib phenotype, <xcope id="X5.10.2"><cue type="speculation" ref="X5.10.2">indicating that</cue> Neur can compensate for the <xcope id="X5.10.1"><cue type="negation" ref="X5.10.1">lack</cue> of D-mib activity</xcope></xcope>.</sentence>
					<sentence id="S5.11">We conclude that D-mib and Neur are two structurally distinct proteins that have similar molecular activities but distinct developmental functions in Drosophila.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S5.12">Introduction</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.13">Cell-to-cell signaling mediated by receptors of the Notch (N) family has been implicated in various developmental decisions in organisms ranging from nematodes to mammals [1].</sentence>
					<sentence id="S5.14">N is well-known for its role in lateral inhibition, a key patterning process that organizes the regular spacing of distinct cell types within groups of equipotent cells.</sentence>
					<sentence id="S5.15">Additionally, N mediates inductive signaling between cells with distinct identities.</sentence>
					<sentence id="S5.16">In both signaling events, N signals via a conserved mechanism that involves the cleavage and release from the membrane of the N intracellular domain that acts as a transcriptional co-activator for DNA-binding proteins of the CBF1/Suppressor of Hairless/Lag-2 (CSL) family [2].</sentence>
					<sentence id="S5.17">Two transmembrane ligands of N are known in Drosophila, Delta (Dl) and Serrate (Ser) [3].</sentence>
					<sentence id="S5.18">Dl and Ser have distinct functions.</sentence>
					<sentence id="S5.19">For instance, Dl (but <xcope id="X5.19.1"><cue type="negation" ref="X5.19.1">not</cue> Ser)</xcope> is essential for lateral inhibition during early neurogenesis in the embryo [4].</sentence>
					<sentence id="S5.20">Conversely, Ser (but <xcope id="X5.20.1"><cue type="negation" ref="X5.20.1">not</cue> Dl)</xcope> is specifically required for segmental patterning [5].</sentence>
					<sentence id="S5.21">Some developmental decisions, however, require the activity of both genes: Dl and Ser are both required for the specification of wing margin cells during imaginal development [6,7,8,9,10].</sentence>
					<sentence id="S5.22"><xcope id="X5.22.2">These different requirements for Dl and Ser <cue type="speculation" ref="X5.22.2">appear</cue> to primarily result from their non-overlapping expression patterns <xcope id="X5.22.1"><cue type="negation" ref="X5.22.1">rather than</cue> from distinct signaling properties</xcope></xcope>.</sentence>
					<sentence id="S5.23">Consistent with this interpretation, <xcope id="X5.23.1">Dl and Ser have been <cue type="speculation" ref="X5.23.1">proposed</cue> to act redundantly in the sensory bristle lineage where they are co-expressed</xcope> ([11]; note, however, that results from another study have indicated a non-redundant function for Dl in the bristle lineage [12]).</sentence>
					<sentence id="S5.24">Furthermore, <xcope id="X5.24.1">Dl and Ser <cue type="speculation" ref="X5.24.1">appear</cue> to be partially interchangeable</xcope> because the forced expression of Ser can partially rescue the Dl neurogenic phenotype [13].</sentence>
					<sentence id="S5.25">Additionally, the ectopic expression of Dl can partially rescue the Ser wing phenotype [14].</sentence>
					<sentence id="S5.26">The notion that Dl and Ser have similar signaling properties has, however, recently been challenged by the observation that human homologs of Dl and Ser have distinct instructive signaling activity [15].</sentence>
					<sentence id="S5.27">Endocytosis has recently emerged as a key mechanism regulating the signaling activity of Dl.</sentence>
					<sentence id="S5.28">First, clonal analysis in Drosophila has <xcope id="X5.28.1"><cue type="speculation" ref="X5.28.1">suggested</cue> that dynamin-dependent endocytosis is required not only in signal-receiving cells but also in signal-sending cells to promote N activation</xcope> [16].</sentence>
					<sentence id="S5.29">Second, mutant Dl proteins that are endocytosis defective exhibit reduced signaling activity [17].</sentence>
					<sentence id="S5.30">Third, two distinct E3 ubiquitin ligases, Neuralized (Neur) and Mind bomb (Mib), have recently been shown to regulate Dl endocytosis and N activation in Drosophila and Danio rerio, respectively [18,19,20,21,22,23,24,25].</sentence>
					<sentence id="S5.31">Ubiquitin is a 76-amino-acid polypeptide that is covalently linked to substrates in a multi-step process that involves a ubiquitin-activating enzyme (E1), a ubiquitin-conjugating enzyme (E2), and a ubiquitin–protein ligase (E3).</sentence>
					<sentence id="S5.32">E3s recognize specific substrates and catalyze the transfer of ubiquitin to the protein substrate.</sentence>
					<sentence id="S5.33">Ubiquitin was first identified as a tag for proteins destined for degradation.</sentence>
					<sentence id="S5.34">More recently, ubiquitin has also been shown to serve as a signal for endocytosis [26,27].</sentence>
					<sentence id="S5.35">Mib in D. rerio and Neur in Drosophila and Xenopus have been shown to associate with Dl, regulate Dl ubiquitination, and promote its endocytosis [18,19,20,22,25,28].</sentence>
					<sentence id="S5.36">Moreover, genetic and transplantation studies have <xcope id="X5.36.2"><cue type="speculation" ref="X5.36.2">indicated that</cue> both Neur and Mib act in a non-autonomous manner [18,21,22,23,25,29], <xcope id="X5.36.1"><cue type="speculation" ref="X5.36.1">indicating that</cue> endocytosis of Dl is associated with increased Dl signaling activity</xcope></xcope>.</sentence>
					<sentence id="S5.37">Finally, epsin, a regulator of endocytosis that contains a ubiquitin-interacting motif and that is known in Drosophila as Liquid facet, is essential for Dl signaling [30,31].</sentence>
					<sentence id="S5.38">In one study, <xcope id="X5.38.3">Liquid facet was <cue type="speculation" ref="X5.38.3">proposed</cue> to target Dl to an endocytic recycling compartment, <xcope id="X5.38.2"><cue type="speculation" ref="X5.38.2">suggesting</cue> that <xcope id="X5.38.1">recycling of Dl <cue type="speculation" ref="X5.38.1">may</cue> be required for signaling</xcope></xcope></xcope>.</sentence>
					<sentence id="S5.39">Accordingly, <xcope id="X5.39.2"><xcope id="X5.39.3">signaling <cue type="speculation" ref="X5.39.2">would</cue> <cue type="negation" ref="X5.39.3">not</cue> be linked directly to endocytosis</xcope></xcope>, but endocytosis <xcope id="X5.39.1"><cue type="speculation" ref="X5.39.1">would</cue> be prerequisite for signaling</xcope> [30].</sentence>
					<sentence id="S5.40"><xcope id="X5.40.1">How endocytosis of Dl leads to the activation of N <cue type="speculation" ref="X5.40.1">remains to be elucidated</cue></xcope>.</sentence>
					<sentence id="S5.41">Also, <xcope id="X5.41.1"><xcope id="X5.41.2"><cue type="speculation" ref="X5.41.1">whether</cue> the signaling activity of Ser is similarly regulated by endocytosis</xcope> is <cue type="speculation" ref="X5.41.2">not known</cue></xcope>.</sentence>
					<sentence id="S5.42">Neur and Mib proteins completely differ in primary structure.</sentence>
					<sentence id="S5.43">Drosophila Neur is a 754-amino-acid protein that contains two conserved Neur homology repeats of unknown function and one C-terminal catalytic really interesting new gene (RING) domain.</sentence>
					<sentence id="S5.44">D. rerio Mib (also known as DIP-1 in the mouse [32]) is a 1,030-amino-acid protein with one ZZ zinc finger domain surrounded by two Mib/HERC2 domains, two Mib repeats, eight ankyrin repeats, two atypical RING domains, and one C-terminal catalytic RING domain.</sentence>
					<sentence id="S5.45">Both genes have been conserved from flies to mammals [18,19,33,34].</sentence>
					<sentence id="S5.46">While genetic analysis has revealed that neur in Drosophila and mib in D. rerio are strictly required for N signaling, knockout studies of mouse Neur1 has <xcope id="X5.46.1"><cue type="speculation" ref="X5.46.1">indicated that</cue> NEUR1 is not strictly required for N signaling</xcope> [33,34].</sentence>
					<sentence id="S5.47">One <xcope id="X5.47.1"><cue type="speculation" ref="X5.47.1">possible</cue> explanation is functional redundancy with the mouse Neur2 gene</xcope>.</sentence>
					<sentence id="S5.48">Conversely, <xcope id="X5.48.1">the function of Drosophila mib (D-mib), the homolog of D. rerio mib gene, is <cue type="speculation" ref="X5.48.1">not known</cue></xcope>.</sentence>
					<sentence id="S5.49">To establish the respective roles of these two distinct E3 ligases in the context of a single model organism, we have studied the function of the Drosophila D-mib gene.</sentence>
					<sentence id="S5.50">We report here that <xcope id="X5.50.1">D-mib, like D. rerio Mib, <cue type="speculation" ref="X5.50.1">appears</cue> to regulate Dl signaling during leg segmentation, wing vein formation, and lateral inhibition in the adult notum</xcope>.</sentence>
					<sentence id="S5.51">We further show that D-mib is specifically required for Ser endocytosis and signaling during wing development, <xcope id="X5.51.1"><cue type="speculation" ref="X5.51.1">indicating</cue> for the first time, to our knowledge, that endocytosis regulates Ser signaling</xcope>.</sentence>
					<sentence id="S5.52">Interestingly, the D-mib activity was found necessary for a subset of N signaling events that are distinct from those requiring the activity of the neur gene.</sentence>
					<sentence id="S5.53">Nevertheless, the ectopic expression of Neur compensates for the loss of D-mib activity in the wing, <xcope id="X5.53.1"><cue type="speculation" ref="X5.53.1">indicating that</cue> Neur and D-mib have overlapping functions</xcope>.</sentence>
					<sentence id="S5.54">We conclude that D-mib and Neur are two structurally distinct proteins with similar molecular activities but distinct and complementary functions in Drosophila.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S5.55">Results</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S5.56">Isolation of D-mib Mutations</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.57">The closest Drosophila homolog of the vertebrate mib gene is the predicted gene CG5841, D-mib [18].</sentence>
					<sentence id="S5.58">The D-mib mutations identified are shown in Figure 1.</sentence>
					<sentence id="S5.59">A P-element inserted into the 5? untranslated region of the D-mib gene was recently isolated (http://flypush.imgen.bcm.tmc.edu/pscreen/) (Figure 1A).</sentence>
					<sentence id="S5.60">Insertion of this P-element confers late pupal lethality.</sentence>
					<sentence id="S5.61">Lethality was reverted by precise excision of the P-element, <xcope id="X5.61.1"><cue type="speculation" ref="X5.61.1">suggesting</cue> that insertion of this P-element is a D-mib mutation, referred to as D-mib1</xcope>.</sentence>
					<sentence id="S5.62">A 13.6-kb deletion that removes the entire D-mib coding region was selected by imprecise excision of this P-element.</sentence>
					<sentence id="S5.63">This deletion represents a null allele of D-mib and was named D-mib2.</sentence>
					<sentence id="S5.64">This deletion also deletes the 3? flanking RpS31 gene (Figure 1A).</sentence>
					<sentence id="S5.65">The D-mib1 and D-mib2mutant alleles did <xcope id="X5.65.1"><cue type="negation" ref="X5.65.1">not</cue> complement the l(3)72CdaJ12 and l(3)72CdaI5 lethal mutations that have been mapped to the same cytological interval as the D-mib gene [35]</xcope>.</sentence>
					<sentence id="S5.66">This <xcope id="X5.66.1"><cue type="speculation" ref="X5.66.1">indicates that</cue> these two lethal mutations are D-mib mutant alleles</xcope>, and they were therefore renamed D-mib3 and D-mib4, respectively.</sentence>
					<sentence id="S5.67">The D-mib1 and D-mib3 mutations behave as genetic null alleles (see Materials and Methods).</sentence>
					<sentence id="S5.68">In contrast, D-mib4 is a partial loss-of-function allele because flies trans-heterozygous for D-mib4 and any other D-mib null alleles are viable.</sentence>
					<sentence id="S5.69">These four mutations identify the CG5841 gene as D-mib by the following evidence.</sentence>
					<sentence id="S5.70">First, lethality of homozygous D-mib1 pupae is associated with the insertion of a P-element into the 5? UTR of the D-mib gene.</sentence>
					<sentence id="S5.71">Second, genomic sequencing of the D-mib3 allele revealed the presence of a stop codon at position 258 (Figure 1B).</sentence>
					<sentence id="S5.72"><xcope id="X5.72.1">This allele is therefore <cue type="speculation" ref="X5.72.1">predicted</cue> to produce a truncated protein devoid of the catalytic RING domain, consistent with D-mib3 being a null allele</xcope>.</sentence>
					<sentence id="S5.73">Genomic sequencing of the D-mib4 allele showed that this mutation is associated with a valine-to-methionine substitution at a conserved position in the second Mib repeat (Figure 1B).</sentence>
					<sentence id="S5.74">Third, Western blot analysis showed that the D-mib protein was <xcope id="X5.74.1"><cue type="negation" ref="X5.74.1">not</cue> detectable in imaginal disc and brain complex extracts prepared from homozygous D-mib1 and D-mib1/D-mib2 larvae</xcope> (Figure 1C and C?).</sentence>
					<sentence id="S5.75">Fourth, the leaky, GAL4-independent expression of a UAS-D-mib transgene fully rescued the lethality of D-mib1/D-mib2 flies <xcope id="X5.75.1">(data <cue type="negation" ref="X5.75.1">not</cue> shown</xcope>; see also Figure 1H).</sentence>
					<sentence id="S5.76">Thus, our analysis identified both complete and partial D-mib loss-of-function alleles.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S5.77">Molecular and Genetic Characterization of D-mib Mutations</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.78">(A) Molecular map of the D-mib locus showing the position of the P-element inserted into the 5? untranslated region (allele D-mib1) and the 13.6 kb deletion that removes the D-mib and the RpS31 genes (allele D-mib2).</sentence>
					<sentence id="S5.79">Transcribed regions are indicated with arrows, and exons are indicated with boxes.</sentence>
					<sentence id="S5.80">Open reading frames are shown in black.</sentence>
					<sentence id="S5.81">(B) Domain composition of D-mib and D. rerio Mib.</sentence>
					<sentence id="S5.82">Both proteins show identical domain organization.</sentence>
					<sentence id="S5.83">D-mib has an N-terminal ZZ zinc finger flanked on either side by a Mib/HERC2 (M-H) domain, followed by two Mib repeats, six ankyrin repeats, two atypical RING domains, and a C-terminal protypical RING that has been associated with catalytic E3 ubiquitin ligase activity.</sentence>
					<sentence id="S5.84"><xcope id="X5.84.1">The D-mib3 mutant allele is <cue type="speculation" ref="X5.84.1">predicted</cue> to produce a truncated protein devoid of E3 ubiquitin ligase activity</xcope> whereas the D-mib4 protein carries a mutation at a conserved position in the second Mib repeat.</sentence>
					<sentence id="S5.85">(C and C?) Western blot analysis of D-mib (C).</sentence>
					<sentence id="S5.86">The endogenous D-mib protein (predicted size: 130 kDa) was detected in S2 cells (lane 2) and in imaginal discs from wild-type larvae (lane 3) but was <xcope id="X5.86.1"><cue type="negation" ref="X5.86.1">not</cue> detectable in homozygous D-mib1 (lane 4) and D-mib1/D-mib3 (lane 5) third instar larvae</xcope>.</sentence>
					<sentence id="S5.87">The D-mib protein produced in transfected S2 cells from the cDNA used in this study (lane 1) runs exactly as endogenous D-mib (lane 2).</sentence>
					<sentence id="S5.88">Panel C? shows a Red Ponceau staining of the gel with the same protein samples as in panel C.</sentence>
					<sentence id="S5.89">(D–H) Wings from wild-type (D), D-mib1 (E), SerRX82/Serrev6.1 (F), D-mib2/D-mib4 (G), and UAS-D-mib2/+; D-mib1/D-mib2 flies (H).</sentence>
					<sentence id="S5.90">D-mib (E) and Ser (F) mutant flies showed similar wing loss phenotypes.</sentence>
					<sentence id="S5.91">The D-mib mutant phenotype could be almost fully rescued by a leaky UAS-D-mib transgene (H).</sentence>
					<sentence id="S5.92">(D?) and (G?) show high magnification views of (D) and (G), respectively, to show that D-mib2/D-mib4 mutant flies (G?) exhibited ectopic sensilla (arrowheads) along vein L3.</sentence>
					<sentence id="S5.93">(I–N) Nota (I–K) and legs (L–N) from wild-type (I and L), D-mib1 (J and M), and SerRX82/Serrev6.1 (K and N) flies.</sentence>
					<sentence id="S5.94">D-mib mutant flies showed a weak neurogenic phenotype (J) <xcope id="X5.94.1">that was <cue type="negation" ref="X5.94.1">not</cue> observed in Ser mutant flies (K)</xcope>.</sentence>
					<sentence id="S5.95">Ectopic sensory organs in D-mib mutant flies developed from <xcope id="X5.95.1">ectopic sensory organ precursor cells <cue type="negation" ref="X5.95.1">(not</cue> shown)</xcope>.</sentence>
					<sentence id="S5.96">D-mib (M) and Ser (N) mutant legs also showed distinct growth and/or elongation defects.</sentence>
					<sentence id="S5.97">Arrows in (J) show ectopic macrochaetes.</sentence>
					<sentence id="S5.98">Arrows in (L–N) indicate the joints.</sentence>
					<sentence id="S5.99">Ti, tibia; t1 to t5, tarsal segments 1 to 5.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S5.100">D-mib Regulates Dl Signaling</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.101">Complete loss of zygotic D-mib activity in homozygous D-mib1 and trans-heterozygous D-mib2/D-mib3, D-mib1/D-mib3 and D-mib1/D-mib2 individuals led to late pupal lethality.</sentence>
					<sentence id="S5.102">Mutant pupae died as pharate adults showing ectopic macrochaetes, increased microchaete density on the dorsal thorax (Figure 1I and 1J), short legs <xcope id="X5.102.1"><cue type="negation" ref="X5.102.1">lacking</cue> tarsal segmentation</xcope> (Figure 1L and 1M), and nearly complete loss of eye and wing tissues (Figure 1D and 1E).</sentence>
					<sentence id="S5.103">Tissue losses were associated with a dramatic reduction in size of the eye field and of the wing pouch in mutant discs of third instar larvae (Figure 2A–2E).</sentence>
					<sentence id="S5.104">Hypomorphic D-mib2/D-mib4 mutant flies only showed ectopic sensory organs, rough eyes, small wings, and thickened veins (Figure 1D, 1D?, 1G, and 1G?; <xcope id="X5.104.1">data <cue type="negation" ref="X5.104.1">not</cue> shown)</xcope>.</sentence>
					<sentence id="S5.105">All these phenotypes <xcope id="X5.105.1"><cue type="speculation" ref="X5.105.1">may</cue> result from reduced N signaling</xcope>.</sentence>
					<sentence id="S5.106">More specifically, <xcope id="X5.106.2">the bristle and leg phenotypes are <cue type="speculation" ref="X5.106.2">likely</cue> to result from reduced signaling by Dl (and <xcope id="X5.106.1"><cue type="negation" ref="X5.106.1">not</cue> by Ser)</xcope></xcope>.</sentence>
					<sentence id="S5.107">Indeed, a reduction in Dl-mediated lateral inhibition can result in ectopic sensory organs and increased bristle density on the body surface.</sentence>
					<sentence id="S5.108">In contrast, a complete loss of Ser signaling had <xcope id="X5.108.1"><cue type="negation" ref="X5.108.1">no</cue> effect on bristle density</xcope> (Figure 1K).</sentence>
					<sentence id="S5.109">Likewise, loss of Dl signaling has been shown to result in short unsegmented legs, similar to the ones seen in the absence of D-mib activity (Figure 1M), whereas a complete loss of Ser activity led to the formation of elongated unsegmented legs (Figure 1N) [36,37,38].</sentence>
					<sentence id="S5.110">Finally, the vein phenotype seen in D-mib hypomorphic flies is similar to the one seen in Dlts mutant flies [39].</sentence>
					<sentence id="S5.111">Together, these observations <xcope id="X5.111.1"><cue type="speculation" ref="X5.111.1">suggest</cue> that D-mib regulates Dl signaling in several developmental contexts</xcope>.</sentence>
					<sentence id="S5.112">Consistent with this conclusion, we have shown that D-mib binds Dl and promotes Dl signaling and that overexpression of D-mib down-regulates the accumulation of Dl at the cell surface (E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data).</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S5.113">The D-mib and neur Genes Have Distinct Functions during Wing Development</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.114">(A–E) Wing imaginal discs (B–E) from wild-type (B and D), D-mib1 (C), and D-mib1/D-mib2 (E) third instar larvae stained for Cut (B and C) and wg-lacZ (D and E).</sentence>
					<sentence id="S5.115">D-mib mutant discs showed a dramatically reduced size of the wing pouch (see diagram in [A] showing the different regions of the wing imaginal disc; V, ventral; D, dorsal), as well as a complete loss of Cut and wg-lacZ (red arrows in [B–E]) expression at the wing margin.</sentence>
					<sentence id="S5.116"><xcope id="X5.116.1">Expression of wg-lacZ in the hinge region (arrowheads in [D] and [E]) and the accumulation of Cut in sensory cells (small arrows in [B] and [C]) and muscle precursor cells (large arrowheads in [B] and [C]) <cue type="speculation" ref="X5.116.1">appeared</cue> to be largely unaffected)</xcope>.</sentence>
					<sentence id="S5.117"><xcope id="X5.117.1">(F and F?) Expression of Cut (red) at the wing margin was <cue type="negation" ref="X5.117.1">not</cue> affected by the complete loss of neur activity in neur1F65 mutant clones</xcope> (indicated by the loss of the nuclear green fluorescent protein [GFP] marker, in green).</sentence>
					<sentence id="S5.118">Bar is 50 ?m in (B–E) and 20 ?m in (F and F?).</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S5.119">D-mib and neur Have Distinct Functions</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.120">We then studied in more detail the function of D-mib during wing development.</sentence>
					<sentence id="S5.121">Growth of the wing pouch depends on the activity of an organizing center located at the dorsal-ventral (D-V) boundary [40,41].</sentence>
					<sentence id="S5.122">This boundary is established in first instar larvae and is defined by the apterous expression boundary.</sentence>
					<sentence id="S5.123">Apterous activates the expression of the Ser and fringe genes in dorsal cells.</sentence>
					<sentence id="S5.124">High levels of Ser in dorsal cells activate N in trans in ventral cells and suppress N activation in cis in dorsal cells, whereas Fringe modifies N in dorsal cells such that dorsal cells located at the D-V boundary respond to Dl.</sentence>
					<sentence id="S5.125">Thus, composite signaling by Ser and Dl leads to symmetric N activation in margin cells located along the D-V boundary [8,9,42,43].</sentence>
					<sentence id="S5.126">N then regulates the expression of the vestigial and wingless (wg) genes that cooperate to promote growth of the wing pouch.</sentence>
					<sentence id="S5.127">N also regulates expression of the cut gene in margin cells [44].</sentence>
					<sentence id="S5.128">Thus, loss of N signaling results in a reduction in size of the wing pouch accompanied by the loss of cut and wg expression along the D-V boundary.</sentence>
					<sentence id="S5.129">A complete loss of Cut and Wg accumulation and wg-lacZ expression was observed in the central region of third instar D-mib mutant wing discs <xcope id="X5.129.1">(data <cue type="negation" ref="X5.129.1">not</cue> shown)</xcope>.</sentence>
					<sentence id="S5.130">Thus, the D-mib wing phenotype <xcope id="X5.130.1"><cue type="speculation" ref="X5.130.1">may</cue> result from defective N inductive signaling at the D-V boundary</xcope>.</sentence>
					<sentence id="S5.131">We conclude that the activity of the D-mib gene is required for the specification of the wing margin and, hence, growth of the wing pouch.</sentence>
					<sentence id="S5.132">Interestingly, <xcope id="X5.132.1">wing margin formation and expression of Cut are <cue type="negation" ref="X5.132.1">not</cue> affected by the complete loss of neur activity</xcope> (Figure 2F and 2F?) [45].</sentence>
					<sentence id="S5.133">Similarly, loss of neur activity had <xcope id="X5.133.2"><cue type="negation" ref="X5.133.2">no</cue> detectable effect on leg segmentation</xcope> <xcope id="X5.133.1">(data <cue type="negation" ref="X5.133.1">not</cue> shown)</xcope> and vein determination [45], two processes shown here to depend on D-mib gene activity.</sentence>
					<sentence id="S5.134">We therefore conclude that D-mib and neur have distinct and complementary functions in Drosophila.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S5.135">D-mib Co-Localizes with Dl and Ser at the Apical Cortex</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.136">We next studied the subcellular localization of D-mib (Figure 3).</sentence>
					<sentence id="S5.137">Anti-D-mib antibodies were generated that specifically detected D-mib on Western blots (see Figure 1C) and on fixed tissues (Figure 3F–F?).</sentence>
					<sentence id="S5.138">Using these antibodies, we found that D-mib was detected in all imaginal disc cells (Figure 3A and 3B).</sentence>
					<sentence id="S5.139">We then examined D-mib subcellular distribution in epithelial cells located along the edge of the wing discs because cross-sectional imaging affords better resolution along the apical-basal axis.</sentence>
					<sentence id="S5.140">D-mib co-localized with Ser, Dl, and N at the apical cortex (Figure 3B–3D???).</sentence>
					<sentence id="S5.141">Dl and Ser were also detected in large intracellular vesicles that <xcope id="X5.141.1"><cue type="speculation" ref="X5.141.1">probably</cue> correspond to multivesicular bodies in that they also stained for hepatocyte growth factor-regulated tyrosine kinase substrate</xcope> [46] (Figure 3B–3C'''???; data not shown).</sentence>
					<sentence id="S5.142">The intracellular dots seen with the anti-D-mib antibodies were distinct from the Dl- and Ser-positive dots and <xcope id="X5.142.1"><cue type="speculation" ref="X5.142.1">appeared</cue> to result from background staining</xcope> (data not shown).</sentence>
					<sentence id="S5.143">The reduced cytoplasmic staining seen in D-mib mutant cells (Figure 3F–3F??)'' <xcope id="X5.143.1"><cue type="speculation" ref="X5.143.1">suggests</cue> that D-mib is also present in the cytoplasm</xcope>.</sentence>
					<sentence id="S5.144">A similar localization at the apical cortex and in the cytoplasm was seen for a functional yellow fluorescent protein (YFP)::D-mib fusion protein (see Figure 6 below).</sentence>
					<sentence id="S5.145">These localization data <xcope id="X5.145.2"><cue type="speculation" ref="X5.145.2">suggest</cue> that D-mib <xcope id="X5.145.1"><cue type="speculation" ref="X5.145.1">may</cue> act at the apical cortex to regulate the activity of Dl and/or Ser</xcope></xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S5.146">D-mib Co-Localizes with Dl and Ser at the Apical Cell Cortex</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.147">(A and A?) D-mib (green) is detected in all cells of the wing imaginal disc.</sentence>
					<sentence id="S5.148">In (A), Ser is in red and Discs-large (Dlg) is in blue.</sentence>
					<sentence id="S5.149">(B–D???) D-mib (green in B, B?, C, C?, D, and D?) co-localized with Ser (red in [B and B??]), Dl (red in [C and C??]), N (red in [D and D??]), and E-Cadherin (E-Cad; blue in [D and D???]) and was found apical to Discs-large (Dlg; blue in [B, B???, C, and C???]) in notum cells located at the edges of the wing discs.</sentence>
					<sentence id="S5.150">(E–E??) D-mib (green in [E and E?]) co-localized with Dl (red in [E and E??]) at the apical cortex of wing pouch cells.</sentence>
					<sentence id="S5.151"><xcope id="X5.151.1">(F–F??) D-mib staining at the apical cortex (blue in [F and F?]) was <cue type="negation" ref="X5.151.1">not</cue> detected in D-mib2 mutant clone (marked by loss of nuclear GFP staining</xcope>; green in [F]).</sentence>
					<sentence id="S5.152">Loss of D-mib activity has <xcope id="X5.152.1"><cue type="negation" ref="X5.152.1">no</cue> detectable effect on the apical accumulation of Dl</xcope> (red in [F and F??]).</sentence>
					<sentence id="S5.153">Bar is 50 ?m for (A and A?) and 10 ?m for (B–F??).</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S5.154">D-mib Is Required in Dorsal Cells for Margin Expression of Cut</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.155">Large dorsal clones of D-mib2 mutant cells (marked by the loss of nuclear GFP, in green) resulted in a complete loss of Cut (red) expression (A and B).</sentence>
					<sentence id="S5.156">This <xcope id="X5.156.1"><cue type="speculation" ref="X5.156.1">indicates that</cue> D-mib is required for Ser signaling by dorsal cells</xcope>.</sentence>
					<sentence id="S5.157">In contrast, ventral clones did <xcope id="X5.157.2"><cue type="negation" ref="X5.157.2">not</cue> prevent the expression of Cut (C and D)</xcope>, <xcope id="X5.157.1"><cue type="speculation" ref="X5.157.1">implying</cue> that D-mib is not strictly required for Dl signaling</xcope>.</sentence>
					<sentence id="S5.158">Note that mutant ventral cells abutting wild-type dorsal cells expressed Cut (arrow in [D]), <xcope id="X5.158.2"><cue type="speculation" ref="X5.158.2">indicating that</cue> <xcope id="X5.158.1">D-mib is <cue type="negation" ref="X5.158.1">not</cue> required for N signal transduction</xcope></xcope>.</sentence>
					<sentence id="S5.159">Low-magnification views of the wing portion of the discs are shown in (A) and (C).</sentence>
					<sentence id="S5.160">(B) and (D) show high-magnification views of the areas boxed in (A) and (C), respectively.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S5.161">D-mib Regulates the Cell-Surface Level of Ser</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.162">We next examined the <xcope id="X5.162.1"><cue type="speculation" ref="X5.162.1">potential</cue> role of D-mib in regulating Dl and Ser distribution in wing imaginal discs</xcope>.</sentence>
					<sentence id="S5.163">We focused our analysis on the notum region since D-mib mutant discs have <xcope id="X5.163.1"><cue type="negation" ref="X5.163.1">no</cue> wing pouch</xcope> (Figure 4).</sentence>
					<sentence id="S5.164">Dl and Ser co-localized both at the apical cortex and in large intracellular vesicles in wild-type cells (Figure 4A–4C?).</sentence>
					<sentence id="S5.165">The complete loss of D-mib activity in D-mib1 mutant discs did <xcope id="X5.165.1"><cue type="negation" ref="X5.165.1">not</cue> detectably change the subcellular localization of Dl</xcope> (Figure 4C, 4C?, 4F, and 4F?).</sentence>
					<sentence id="S5.166">In contrast, the accumulation of Ser at the apical cortex was strongly increased (Figure 4E) and Ser accumulation in Dl-positive vesicles was dramatically reduced (Figure 4E?) in D-mib1 mutant discs.</sentence>
					<sentence id="S5.167">Similar results were also obtained in D-mib2 mutant clones, which showed strongly elevated levels of cortical Ser (Figure 4H) whereas <xcope id="X5.167.1">the amount of Dl at the apical cortex was <cue type="negation" ref="X5.167.1">not</cue> detectably modified</xcope> (see Figures 3F–3F?? and 4J).</sentence>
					<sentence id="S5.168">Of note, loss of D-mib2 activity in clones did <xcope id="X5.168.1"><cue type="negation" ref="X5.168.1">not</cue> block the accumulation of Ser into intracellular dots</xcope> (Figure 4H?).</sentence>
					<sentence id="S5.169">Thus, trafficking of Ser towards this intracellular compartment is, at least in part, D-mib-independent.</sentence>
					<sentence id="S5.170">We therefore conclude that the D-mib gene is required to regulate the level of Ser at the apical cortex of wing disc cells.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S5.171">D-mib Is Required to Down-Regulate Ser at the Apical Cortex</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.172">(A–F?) Distribution of Dl (green) and Ser (red) in the notum region of wild-type (A–C?) and D-mib1 mutant (D–F?) wing imaginal discs.</sentence>
					<sentence id="S5.173">The boxed areas in (A) and (D) are shown at higher magnification in (B–F?).</sentence>
					<sentence id="S5.174">The specific loss of Ser accumulation into intracellular vesicles (compare [E?] with [B?]) correlated with the elevated levels of Ser seen at the apical cortex of D-mib mutant cells (compare [E] with [B]).</sentence>
					<sentence id="S5.175">(G–J?) Ser (red in [H and H?]) accumulated at the apical cortex (H) as well as in intracellular dots (H?) in D-mib2 mutant cells (marked by the loss of nuclear GFP; green in [G]).</sentence>
					<sentence id="S5.176">Cut is shown in blue (G).</sentence>
					<sentence id="S5.177"><xcope id="X5.177.1">The distribution of Dl (red in [J and J?]) was <cue type="negation" ref="X5.177.1">not</cue> affected by the loss of D-mib activity</xcope>.</sentence>
					<sentence id="S5.178">Low-magnification views of the wing portion of the discs are shown in (G) and (I).</sentence>
					<sentence id="S5.179">(H and H?) and (J and J?) show high magnification views of the areas boxed in (G) and (I), respectively.</sentence>
					<sentence id="S5.180">Clone boundaries are outlined in (H and H?) and (J and J?).</sentence>
					<sentence id="S5.181">Bar is 40 ?m for (A, D, G), 5 ?m for (B–C? and E–F?), and 10 ?m for (H–J?).</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S5.182">D-mib Is Required for Ser Endocytosis</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.183"><xcope id="X5.183.1">Ubiquitin-mediated endocytosis is <cue type="speculation" ref="X5.183.1">thought</cue> to depend on monoubiquitination</xcope>.</sentence>
					<sentence id="S5.184">Thus, by analogy with the function of Mib in D. rerio [18,28], we <xcope id="X5.184.2"><cue type="speculation" ref="X5.184.2">suggest</cue> that D-mib <xcope id="X5.184.1"><cue type="speculation" ref="X5.184.1">may</cue> directly monoubiquitinate Ser</xcope></xcope>.</sentence>
					<sentence id="S5.185">Consistent with this hypothesis, we show in a companion paper that D-mib binds Ser (E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data).</sentence>
					<sentence id="S5.186">Moreover, a mutation in the C-terminal catalytic RING domain of D-mib abolished its ability to internalize Ser in transfected S2 cells (R. L. B. and F. S., unpublished data) <xcope id="X5.186.1"><cue type="speculation" ref="X5.186.1">implying</cue> that the E3 ubiquitin ligase activity of D-mib is required for Ser internalization</xcope>.</sentence>
					<sentence id="S5.187">Biochemical analysis of the ubiquitination events regulated by D-mib will be needed to further define the mechanism by which D-mib regulates the endocytosis of Ser in vivo.</sentence>
					<sentence id="S5.188">To test <xcope id="X5.188.1"><cue type="speculation" ref="X5.188.1">whether</cue> this specific increase in the level of Ser at the apical cortex resulted from reduced Ser endocytosis in D-mib mutant cells</xcope>, we followed the endocytosis of Ser in living imaginal discs using an antibody uptake assay.</sentence>
					<sentence id="S5.189">Briefly, dissected wing discs were cultured for 15 min in the presence of antibodies that recognize the extracellular part of Ser or Dl, then washed, cultured for another 45 min in medium <xcope id="X5.189.1"><cue type="negation" ref="X5.189.1">without</cue> antibodies</xcope>, and then fixed.</sentence>
					<sentence id="S5.190">The uptake of anti-Ser and anti-Dl antibodies was then assessed using secondary antibodies.</sentence>
					<sentence id="S5.191">The results are shown in Figure 5.</sentence>
					<sentence id="S5.192">Using this assay, we found that anti-Ser-and anti-Dl antibodies were internalized in wild-type epithelial cells (Figure 5A–5C??).</sentence>
					<sentence id="S5.193">The complete loss of D-mib activity in D-mib1 wing discs did not significantly change the internalization of anti-Dl antibodies (Figure 5D''??, 5E''??, and 5F??)'', <xcope id="X5.193.2"><cue type="speculation" ref="X5.193.2">indicating that</cue> <xcope id="X5.193.1">D-mib is <cue type="negation" ref="X5.193.1">not</cue> required for Dl endocytosis in this tissue</xcope></xcope>.</sentence>
					<sentence id="S5.194">However, the loss of D-mib activity strongly inhibited the endocytosis of anti-Ser antibodies (Figure 5E?).</sentence>
					<sentence id="S5.195">Moreover, high levels of anti-Ser antibodies were seen at the apical surface (Figure 5D? and 5F?), confirming that D-mib mutant cells accumulate high levels of Ser at their surface.</sentence>
					<sentence id="S5.196">We therefore conclude that D-mib is specifically required for the endocytosis of Ser in wing discs.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S5.197">D-mib Is Required for Ser Endocytosis</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.198">Localization of the anti-Ser (red) and anti-Dl (green) antibodies that have been internalized by wild-type (A–C??) and D-mib1 mutant (D–F??) cells in the notum region of wing discs.</sentence>
					<sentence id="S5.199">(A–A??) and (D–D??) show apical sections and (B–B??) and (E–E??) show basal sections.</sentence>
					<sentence id="S5.200">(C–C??) and (F–F??) show confocal z-sections.</sentence>
					<sentence id="S5.201">The z-section axes are shown with a double-headed arrow in (A) and (D).</sentence>
					<sentence id="S5.202">Internalized anti-Ser and anti-Dl antibodies co-localized in wild-type cells.</sentence>
					<sentence id="S5.203">In contrast, high levels of anti-Ser antibodies were detected at the cell surface of D-mib mutant epithelial cells whereas anti-Dl antibodies were efficiently internalized.</sentence>
					<sentence id="S5.204">Bar is 10 ?m for all panels.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S5.205">D-mib Regulates Ser Signaling</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.206">The regulation of Ser endocytosis by D-mib <xcope id="X5.206.2"><cue type="speculation" ref="X5.206.2">suggests</cue> that D-mib <xcope id="X5.206.1"><cue type="speculation" ref="X5.206.1">may</cue> regulate Ser signaling</xcope></xcope>.</sentence>
					<sentence id="S5.207">Ser expression is restricted to dorsal cells in second instar wing imaginal discs [7,10,44,47,48].</sentence>
					<sentence id="S5.208">Ser in dorsal cells signals across the D-V boundary to activate N in ventral cells [8,9].</sentence>
					<sentence id="S5.209">If D-mib is required for Ser signaling during wing development, then loss of D-mib activity in dorsal cells <xcope id="X5.209.1"><cue type="speculation" ref="X5.209.1">should</cue> affect the specification of the wing margin in a non-autonomous manner</xcope>.</sentence>
					<sentence id="S5.210">Loss of D-mib activity in large dorsal clones of D-mib2 mutant cells resulted in a loss of Cut expression at the D-V interface (Figure 6A and 6B).</sentence>
					<sentence id="S5.211">The <xcope id="X5.211.2"><cue type="negation" ref="X5.211.2">lack</cue> of Cut expression in wild-type ventral cells abutting the D-V boundary</xcope> <xcope id="X5.211.1"><cue type="speculation" ref="X5.211.1">indicates that</cue> D-mib is required for Ser signaling by dorsal cells and acts in a non-autonomous manner to activate N in ventral cells</xcope>.</sentence>
					<sentence id="S5.212">Conversely, loss of D-mib activity in large ventral clones (Figure 6C and 6D) did <xcope id="X5.212.2"><cue type="negation" ref="X5.212.2">not</cue> disrupt margin specification</xcope>, <xcope id="X5.212.1"><cue type="speculation" ref="X5.212.1">indicating that</cue> D-mib is not strictly required for Dl signaling by ventral cells</xcope>.</sentence>
					<sentence id="S5.213">However, a narrowing of the Cut-positive margin was observed (Figure 6D), <xcope id="X5.213.1"><cue type="speculation" ref="X5.213.1">suggesting</cue> that D-mib contributes to regulating the level of Dl signaling</xcope>.</sentence>
					<sentence id="S5.214">Of note, ventral D-mib mutant cells expressed Cut, <xcope id="X5.214.2"><cue type="speculation" ref="X5.214.2">implying</cue> that <xcope id="X5.214.1">D-mib is <cue type="negation" ref="X5.214.1">not</cue> required for N signal transduction</xcope></xcope>.</sentence>
					<sentence id="S5.215">We next tested <xcope id="X5.215.1"><cue type="speculation" ref="X5.215.1">whether</cue> expression of D-mib in dorsal cells is sufficient to rescue the D-mib wing phenotype</xcope>.</sentence>
					<sentence id="S5.216">D-mib was expressed in dorsal cells of D-mib2/D-mib3 mutant discs using Ser-GAL4.</sentence>
					<sentence id="S5.217">Similarly to the expression of the Ser gene, Ser-GAL4 expression is restricted to dorsal cells in second/early third instar larvae and is weakly expressed in ventral cells in mid/late third instar larvae, i.e., after margin cell specification [49,50].</sentence>
					<sentence id="S5.218">Expression of D-mib in dorsal cells was sufficient to rescue growth of the wing pouch and of the expression of Cut in margin cells in D-mib mutant discs (Figure 7A).</sentence>
					<sentence id="S5.219">This result confirmed that D-mib regulates Ser signaling by dorsal cells.</sentence>
					<sentence id="S5.220">A similar rescue was observed with a YFP::D-mib protein (Figure 7B–7B?)â€ť, <xcope id="X5.220.1"><cue type="speculation" ref="X5.220.1">indicating that</cue> YFP::D-mib is functional</xcope>.</sentence>
					<sentence id="S5.221">YFP::D-mib localized at the apical cortex and in the cytoplasm (Figure 7C–7 D???), as seen for endogenous D-mib (see Figure 3).</sentence>
					<sentence id="S5.222">YFP::D-mib co-localized with Dl and Ser at the apical cortex of cells expressing low levels of YFP::D-mib.</sentence>
					<sentence id="S5.223">However, cells expressing high levels of YFP::D-mib showed a strong reduction in the level of both Dl and Ser at the cortex (Figure 7C–7C???)''', further <xcope id="X5.223.1"><cue type="speculation" ref="X5.223.1">indicating that</cue> D-mib down-regulates the levels of both Ser and Dl at the apical cortex</xcope> (E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data).</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S5.224">Expression of D-mib in Dorsal Cells Is Sufficient to Rescue the D-mib Mutant Phenotype</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.225">(A) Expression of D-mib (green) in dorsal cells, using Ser-GAL4, rescued the growth of the wing pouch and margin Cut (red) expression in D-mib2/D-mib3 mutant discs.</sentence>
					<sentence id="S5.226">(B–D???) Ser-GAL4-driven expression of YFP::D-mib (green) rescued the D-mib2/D-mib3 phenotype and strongly reduced the level of Dl (blue in [B, B?, C, C??, D, and D??]) and Ser (red in [B, B??, C, C???, D, and D???]) in dorsal cells.</sentence>
					<sentence id="S5.227">(C–D???) are high-magnification views (apical [C–C???] and basal [D–D???]) of the disc shown in (B–B??).</sentence>
					<sentence id="S5.228">YFP::D-mib co-localized with Dl and Ser at the apical cortex in cells expressing only low levels of YFP::D-mib.</sentence>
					<sentence id="S5.229">Bar is 50 ?m for (A–B??) and 10 ?m for (C–D???).</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S5.230">D-mib Acts Downstream of Ser and Upstream of Activated N</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.231">The functional assay was then used to genetically position the requirement for the D-mib gene activity relative to Ser and N (Figure 8).</sentence>
					<sentence id="S5.232">Expression of an activated version of N, Ncdc10 [51], led to the activation of Cut and promoted growth in dorsal cells of D-mib2/D-mib3 mutant discs (Figure 8C).</sentence>
					<sentence id="S5.233">This <xcope id="X5.233.1"><cue type="speculation" ref="X5.233.1">indicates that</cue> D-mib acts at a step upstream of N activation</xcope>.</sentence>
					<sentence id="S5.234">By contrast, elevated levels of Ser expression <xcope id="X5.234.1"><cue type="negation" ref="X5.234.1">failed</cue> to restore Cut expression and growth of the wing pouch in D-mib2/D-mib3 mutant larvae</xcope> (Figure 8B).</sentence>
					<sentence id="S5.235">This confirms that Ser signaling requires the activity of the D-mib gene, i.e., that D-mib acts downstream of Ser.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S5.236">Expression of Neur in Dorsal Cells Is Sufficient to Rescue the D-mib Mutant Phenotype</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.237">D-mib2/D-mib3 mutant discs expressing GFP (A) (GFP staining <xcope id="X5.237.1"><cue type="negation" ref="X5.237.1">not</cue> shown)</xcope>, Ser (B), Ncdc10 (C), or Neur (D) under the control of Ser-GAL4 were stained for Cut (red).</sentence>
					<sentence id="S5.238">Expression of Ser in dorsal cells did <xcope id="X5.238.1"><cue type="negation" ref="X5.238.1">not</cue> rescue the D-mib2/D-mib3 wing pouch mutant phenotype</xcope> (compare [B] with [A]), consistent with D-mib being required for Ser signaling.</sentence>
					<sentence id="S5.239">By contrast, expression of Ncdc10, an activated version of N, led to the deregulated growth of the dorsal compartment and the expression of Cut in most dorsal cells (C), <xcope id="X5.239.1"><cue type="speculation" ref="X5.239.1">indicating that</cue> activated N acts downstream of D-mib</xcope>.</sentence>
					<sentence id="S5.240">Expression of Neur in dorsal cells was sufficient to compensate for the loss of D-mib activity (D).</sentence>
					<sentence id="S5.241">Bar is 40 ?m for all panels.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S5.242">neur and D-mib Functions Partially Overlap</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.243">The different requirements for neur and D-mib gene activity <xcope id="X5.243.2"><cue type="speculation" ref="X5.243.2">may</cue> <xcope id="X5.243.1"><cue type="speculation" ref="X5.243.1">suggest</cue> that Neur and D-mib have distinct molecular activities</xcope></xcope>.</sentence>
					<sentence id="S5.244">Alternatively, this difference <xcope id="X5.244.1"><cue type="speculation" ref="X5.244.1">may</cue> reflect a difference in gene expression</xcope>.</sentence>
					<sentence id="S5.245">Consistent with the latter hypothesis, <xcope id="X5.245.3">the neur gene is <cue type="negation" ref="X5.245.3">not</cue> expressed in wing pouch and wing margin cells, where <xcope id="X5.245.2">it is <cue type="negation" ref="X5.245.2">not</cue> required</xcope></xcope>, and <xcope id="X5.245.1"><cue type="speculation" ref="X5.245.1">appears</cue> to be expressed only in sensory cells [52], where it is required</xcope>.</sentence>
					<sentence id="S5.246">By contrast, <xcope id="X5.246.1">D-mib <cue type="speculation" ref="X5.246.1">appears</cue> to be uniformly expressed in imaginal discs</xcope>.</sentence>
					<sentence id="S5.247">To test this hypothesis, we examined <xcope id="X5.247.1"><cue type="speculation" ref="X5.247.1">whether</cue> the forced ubiquitous expression of the neur gene can suppress the D-mib loss-of-function phenotype</xcope>.</sentence>
					<sentence id="S5.248">Expression of Neur, using actin-GAL4, restored growth of the wing pouch and formation of the wing margin <xcope id="X5.248.1">(data <cue type="negation" ref="X5.248.1">not</cue> shown)</xcope>.</sentence>
					<sentence id="S5.249">Moreover, expression of Neur in dorsal cells, using Ser-GAL4, was sufficient to rescue growth of the wing pouch as well as the expression of Cut in margin cells in D-mib mutant discs (Figure 8D).</sentence>
					<sentence id="S5.250">We conclude that ectopic expression of Neur compensates for the loss of D-mib activity.</sentence>
					<sentence id="S5.251">In a converse experiment, we found that the neur-driven expression of D-mib, using neurPGAL4, did <xcope id="X5.251.1"><cue type="negation" ref="X5.251.1">not</cue> rescue the cuticular neurogenic phenotype of neurPGAL4/neur1F65 embryos</xcope>.</sentence>
					<sentence id="S5.252">Three UAS-D-mib transgenic lines were tested, and <xcope id="X5.252.4"><cue type="negation" ref="X5.252.4">none</cue> showed detectable rescue</xcope> whereas the two UAS-neur lines used as positive controls <xcope id="X5.252.2"><cue type="speculation" ref="X5.252.2">either</cue> fully <cue type="speculation" ref="X5.252.2">or</cue> partially</xcope> rescued the cuticular neurogenic phenotype of neurPGAL4/neur1F65 embryos <xcope id="X5.252.1">(data <cue type="negation" ref="X5.252.1">not</cue> shown</xcope>; UAS-D-mib+/+, neurPGAL4+/+ embryos developed normally).</sentence>
					<sentence id="S5.253">This <xcope id="X5.253.2"><cue type="speculation" ref="X5.253.2">indicates that</cue> <xcope id="X5.253.1">a key function of Neur in the embryo <cue type="negation" ref="X5.253.1">cannot</cue> be provided by D-mib</xcope></xcope>.</sentence>
					<sentence id="S5.254">We therefore <xcope id="X5.254.1"><cue type="speculation" ref="X5.254.1">suggest</cue> that Neur and D-mib functions overlap but are not strictly identical</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S5.255">Discussion</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.256">Many recent studies have revealed that endocytosis plays multiple roles in the regulation of N signaling (reviewed in [2]; see also [53,54]).</sentence>
					<sentence id="S5.257">Here, we show that the conserved E3 ubiquitin ligases Neur and D-mib have similar molecular activities in the regulation of Dl and Ser endocytosis but distinct developmental functions in Drosophila.</sentence>
					<sentence id="S5.258">Our analysis first establishes that D-mib regulates Ser signaling during wing development.</sentence>
					<sentence id="S5.259">First, clonal analysis revealed that the activity of the D-mib gene is specifically required in dorsal cells for the expression of Cut at the wing margin.</sentence>
					<sentence id="S5.260">Second, expression of D-mib in the dorsal Ser-signaling cells was sufficient to rescue the D-mib mutant wing phenotype.</sentence>
					<sentence id="S5.261">Third, results from an in vivo antibody uptake assay <xcope id="X5.261.2"><cue type="speculation" ref="X5.261.2">indicated that</cue> the endocytosis of Ser (but <xcope id="X5.261.1"><cue type="negation" ref="X5.261.1">not</cue> of Dl)</xcope> was strongly inhibited in D-mib mutant cells</xcope>.</sentence>
					<sentence id="S5.262">This inhibition correlated with the strong accumulation of Ser (but <xcope id="X5.262.1"><cue type="negation" ref="X5.262.1">not</cue> Dl)</xcope> at the apical cortex of D-mib mutant cells.</sentence>
					<sentence id="S5.263">Thus, an essential function of D-mib in the wing is to regulate the endocytosis of Ser in dorsal cells to non-autonomously promote the activation of N along the D-V boundary.</sentence>
					<sentence id="S5.264">By analogy, the defective growth of the eye tissue <xcope id="X5.264.2"><cue type="speculation" ref="X5.264.2">may</cue> similarly result from the <xcope id="X5.264.1"><cue type="negation" ref="X5.264.1">lack</cue> of Ser signaling and of N activation along the D-V boundary</xcope></xcope> [55].</sentence>
					<sentence id="S5.265">Because D-mib co-localizes with Ser at the apical cortex of wing disc cells, acts in a RING-finger-dependent manner to regulate Ser endocytosis in S2 cells (R. L. B. and F. S., unpublished results), and physically associates with Ser in co-immunoprecipitation experiments (E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data), D-mib <xcope id="X5.265.1"><cue type="speculation" ref="X5.265.1">may</cue> ubiquitinate Ser and directly regulate its endocytosis</xcope>.</sentence>
					<sentence id="S5.266">Our analysis further <xcope id="X5.266.1"><cue type="speculation" ref="X5.266.1">suggests</cue> that endocytosis of Ser is required for Ser signaling</xcope>.</sentence>
					<sentence id="S5.267">This conclusion is consistent with observations made earlier showing that secreted versions of Ser <xcope id="X5.267.1"><cue type="negation" ref="X5.267.1">cannot</cue> activate N</xcope> but instead antagonize Ser signaling [56,57].</sentence>
					<sentence id="S5.268">Thus, <xcope id="X5.268.1">endocytosis of both N ligands <cue type="speculation" ref="X5.268.1">appears</cue> to be strictly required for N activation in Drosophila</xcope>.</sentence>
					<sentence id="S5.269">Different models have been proposed to explain how endocytosis of the ligand, which removes the ligand from the cell surface, results in N receptor activation (discussed in [17,20,21,30]).</sentence>
					<sentence id="S5.270">Interestingly, <xcope id="X5.270.1">the strong requirement for Dl and Ser endocytosis seen in Drosophila is <cue type="negation" ref="X5.270.1">not</cue> conserved in Caenorhabditis elegans, in which secreted ligands have been shown to be functional</xcope> [58,59].</sentence>
					<sentence id="S5.271">Noticeably, there is <xcope id="X5.271.2"><cue type="negation" ref="X5.271.2">no</cue> C. elegans Mib homolog</xcope>, and <xcope id="X5.271.1">the function of C. elegans neur (F10D7.5) is <cue type="speculation" ref="X5.271.1">not known</cue></xcope>.</sentence>
					<sentence id="S5.272">We <xcope id="X5.272.2"><cue type="speculation" ref="X5.272.2">speculate</cue> that endocytosis of the ligands <xcope id="X5.272.1"><cue type="speculation" ref="X5.272.1">may</cue> have evolved as a means to ensure tight spatial regulation of the activation of N</xcope></xcope>.</sentence>
					<sentence id="S5.273">Our analysis also establishes that the activity of the D-mib gene is required for a subset of N signaling events that are distinct from those that require the activity of the neur gene.</sentence>
					<sentence id="S5.274">We have shown that the D-mib gene regulates wing margin formation, leg segmentation, and vein formation, whereas <xcope id="X5.274.1"><cue type="negation" ref="X5.274.1">none</cue> of these three processes depend on neur gene activity</xcope> ([45,60]; this study).</sentence>
					<sentence id="S5.275">Conversely, the activity of the neur gene is essential for binary cell-fate decisions in the bristle lineage [22] that do <xcope id="X5.275.2"><cue type="negation" ref="X5.275.2">not</cue> require the activity of the D-mib gene</xcope> <xcope id="X5.275.1"><cue type="negation" ref="X5.275.1">(no</cue> bristle defects were seen in D-mib mutant flies)</xcope>.</sentence>
					<sentence id="S5.276">The activity of the neur gene is also required for lateral inhibition during neurogenesis in embryos and pupae [4,45,61].</sentence>
					<sentence id="S5.277">This process is largely independent of D-mib gene activity since the complete loss of D-mib function only resulted in a mild neurogenic phenotype in the notum.</sentence>
					<sentence id="S5.278">These data thus <xcope id="X5.278.1"><cue type="speculation" ref="X5.278.1">indicate that</cue> the neur and D-mib genes have largely distinct and complementary functions in Drosophila</xcope>.</sentence>
					<sentence id="S5.279"><xcope id="X5.279.1"><cue type="speculation" ref="X5.279.1">Whether</cue> a similar functional relationship between Neur and D-mib exists in vertebrates</xcope> awaits the study of the D. rerio neur genes and/or of the murine Mib and Neur genes.</sentence>
					<sentence id="S5.280"><xcope id="X5.280.1">The functional differences observed between D-mib and neur <cue type="negation" ref="X5.280.1">cannot</cue> be simply explained by obvious differences in molecular activity and/or substrate specificity</xcope>.</sentence>
					<sentence id="S5.281">First, both Neur and D-mib physically interact with Dl ([20]; E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data) and promote the down-regulation of Dl from the apical membrane when overexpressed (E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data).</sentence>
					<sentence id="S5.282">Furthermore, <xcope id="X5.282.1">Dl signaling <cue type="speculation" ref="X5.282.1">appears</cue> to require the activity of either Neur or D-mib, depending on the developmental contexts</xcope>.</sentence>
					<sentence id="S5.283">We have shown here that specific aspects of the D-mib phenotype in legs and in the notum <xcope id="X5.283.3"><cue type="negation" ref="X5.283.3">cannot</cue> simply result from loss of Ser signaling</xcope> and are <xcope id="X5.283.2"><cue type="speculation" ref="X5.283.2">consistent with</cue> reduced Dl signaling</xcope>, <xcope id="X5.283.1"><cue type="speculation" ref="X5.283.1">suggesting</cue> that D-mib regulates Dl signaling</xcope>.</sentence>
					<sentence id="S5.284">Consistent with this interpretation, overexpression studies <xcope id="X5.284.1"><cue type="speculation" ref="X5.284.1">indicate that</cue> D-mib up-regulates the signaling activity of Dl, whereas a dominant-negative form of D-mib inhibits it</xcope> (E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data).</sentence>
					<sentence id="S5.285">We note, however, that <xcope id="X5.285.1"><cue type="negation" ref="X5.285.1">no</cue> clear defects in Dl subcellular localization and/or trafficking were observed in D-mib mutant cells</xcope>.</sentence>
					<sentence id="S5.286">It is <xcope id="X5.286.2"><cue type="speculation" ref="X5.286.2">conceivable</cue> that the contribution of D-mib to the endocytosis of Dl is masked by the activity of D-mib-independent processes that <xcope id="X5.286.1"><cue type="speculation" ref="X5.286.1">may, or may not</cue>, be linked to Dl signaling</xcope></xcope>.</sentence>
					<sentence id="S5.287">We have also shown that, reciprocally, Neur and D-mib <xcope id="X5.287.1"><cue type="speculation" ref="X5.287.1">may</cue> similarly regulate Ser</xcope>.</sentence>
					<sentence id="S5.288">Neur and D-mib were shown to similarly promote down-regulation of Ser from the cell surface when overexpressed (E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data).</sentence>
					<sentence id="S5.289">Moreover, D-mib binds Ser (E. C. Lai, F. Roegiers, X. Qin, R. Le Borgne, F. Schweisguth, et al., unpublished data) and regulates Ser signaling (this study).</sentence>
					<sentence id="S5.290"><xcope id="X5.290.1"><cue type="speculation" ref="X5.290.1">Whether</cue> endogenous Neur binds and activates Ser</xcope> remains to be tested.</sentence>
					<sentence id="S5.291">However, the ability of Neur to rescue the D-mib mutant wing phenotype when expressed in dorsal cells strongly <xcope id="X5.291.2"><cue type="speculation" ref="X5.291.2">indicates that</cue> <xcope id="X5.291.1">Neur <cue type="speculation" ref="X5.291.1">can</cue> promote Ser signaling</xcope></xcope>.</sentence>
					<sentence id="S5.292">Together, these data <xcope id="X5.292.1"><cue type="speculation" ref="X5.292.1">indicate that</cue> Neur and D-mib have similar molecular activities</xcope>.</sentence>
					<sentence id="S5.293">D-mib and Neur <xcope id="X5.293.1"><cue type="speculation" ref="X5.293.1">may</cue> have identical molecular activities but distinct expression patterns, hence distinct functions at the level of the organism</xcope>.</sentence>
					<sentence id="S5.294">Consistent with this possibility, D-mib is uniformly distributed in imaginal discs, whereas Neur is specifically detected in sensory cells [52].</sentence>
					<sentence id="S5.295">Importantly, the rescue of the D-mib mutant phenotype by ectopic expression of Neur strongly <xcope id="X5.295.1"><cue type="speculation" ref="X5.295.1">supports</cue> this interpretation</xcope>.</sentence>
					<sentence id="S5.296">This result further <xcope id="X5.296.2"><cue type="speculation" ref="X5.296.2">suggests</cue> that Neur <xcope id="X5.296.1"><cue type="speculation" ref="X5.296.1">can</cue> regulate Ser signaling</xcope></xcope>.</sentence>
					<sentence id="S5.297">Consistent with this idea, overexpression of Neur in imaginal discs resulted in a strong reduction of Ser accumulation at the apical cortex (data <xcope id="X5.297.1"><cue type="negation" ref="X5.297.1">not</cue> shown)</xcope>.</sentence>
					<sentence id="S5.298">Thus, despite their obvious structural differences, <xcope id="X5.298.1">Neur and D-mib <cue type="speculation" ref="X5.298.1">appear</cue> to act similarly to promote the endocytosis of Dl and Ser</xcope>.</sentence>
					<sentence id="S5.299">Nevertheless, our observation that D-mib <xcope id="X5.299.3"><cue type="negation" ref="X5.299.3">could not</cue> compensate for the loss of neur activity in the embryo</xcope> <xcope id="X5.299.2"><cue type="speculation" ref="X5.299.2">indicates that</cue> D-mib and Neur have overlapping <xcope id="X5.299.1"><cue type="negation" ref="X5.299.1">rather than</cue> identical molecular activities</xcope></xcope>.</sentence>
					<sentence id="S5.300">In conclusion, <xcope id="X5.300.1">Neur and D-mib <cue type="speculation" ref="X5.300.1">appear</cue> to have similar molecular activities in the regulation of Dl and Ser endocytosis but distinct developmental functions in Drosophila</xcope>.</sentence>
					<sentence id="S5.301"><xcope id="X5.301.2">The conservation from Drosophila to mammals of these two structurally distinct but functionally similar E3 ubiquitin ligases is <cue type="speculation" ref="X5.301.2">likely</cue> to reflect a combination of evolutionary advantages associated with: (i) specialized expression pattern, as evidenced by the cell-specific expression of the neur gene in sensory organ precursor cells [52]; (ii) specialized function, as <xcope id="X5.301.1"><cue type="speculation" ref="X5.301.1">suggested</cue> by the role of murine MIB in TNF?? signaling</xcope> [32]; (iii) regulation of protein stability, localization, and/or activity</xcope>.</sentence>
					<sentence id="S5.302">For instance, Neur, but <xcope id="X5.302.1"><cue type="negation" ref="X5.302.1">not</cue> D-mib</xcope>, localizes asymmetrically during asymmetric sensory organ precursor cell divisions [22].</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S5.303">Materials and Methods</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S5.304">Flies.</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.305">The D-mib1 mutation corresponds to the EY97600 P-element insertion generated by the Gene Disruption Project (http://flypush.imgen.bcm.tmc.edu/pscreen/).</sentence>
					<sentence id="S5.306">The D-mib2 allele was selected as w- D-mib mutant derivative by imprecise excision of the EY97600 P-element.</sentence>
					<sentence id="S5.307">The precise breakpoints of the D-mib2 deletion were determined by sequencing a PCR fragment amplified from genomic DNA prepared from D-mib2 homozygous larvae.</sentence>
					<sentence id="S5.308">The l(3)72CdaJ12 and l(3)72CdaI5 alleles originally isolated by [35] <xcope id="X5.308.1"><cue type="negation" ref="X5.308.1">failed</cue> to complement the D-mib1 and D-mib2 mutations</xcope> and were renamed D-mib3 and D-mib4.</sentence>
					<sentence id="S5.309"><xcope id="X5.309.1">The D-mib1, D-mib2, and D-mib3 alleles <cue type="speculation" ref="X5.309.1">appear</cue> to be genetically null alleles</xcope> since the phenotypes of D-mib1/D-mib1 and D-mib1/D-mib3 mutant pupae are indistinguishable from the ones seen in D-mib1/D-mib2 and D-mib2/D-mib3 pupae.</sentence>
					<sentence id="S5.310">Sequence analysis of the D-mib3 and D-mib4 alleles was carried on PCR products prepared from genomic DNA prepared from D-mib3/D-mib2 and D-mib4/D-mib2 mutant pupae.</sentence>
					<sentence id="S5.311">Genomic DNA from l(3)72Cda/D-mib2 mutant pupae was used as control for polymorphism.</sentence>
					<sentence id="S5.312">D-mib2 mutant clones were generated in y w hs-flp;FRT2A D-mib2/FRT2A M(3)i55 ubi-nlsGFP larvae.</sentence>
					<sentence id="S5.313">neur1F65 mutant clones were generated as previously described [22].</sentence>
					<sentence id="S5.314">UAS-D-mib and UAS-YFP::D-mib lines were generated via standard P-element transformation.</sentence>
					<sentence id="S5.315">These constructs were derived from the SD05267 cDNA obtained from ResGen (Invitrogen, Carlsbad, California, United States).</sentence>
					<sentence id="S5.316">Cloning details for these constructs are available upon request.</sentence>
					<sentence id="S5.317">UAS-Dl (gift from M. Muskavith), UAS-Ser (gift from R. Fleming), UAS-Neur (gift from C. Delidakis), UAS-Ncdc10 (gift of T. Klein), Ser-GAL4 lines, and Ser mutant alleles are described in FlyBase (http://flybase.bio.indiana.edu/).</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S5.318">Antibodies.</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.319">Dissected imaginal discs were fixed in 4% paraformaldehyde (15 min) and incubated with antibodies at room temperature in PBS 1X with 0.1% TritonX-100.</sentence>
					<sentence id="S5.320">Rabbit polyclonal anti-D-mib antibodies were raised against the CYNERKTDDSELPGN peptide (CovalAb, Lyon, France).</sentence>
					<sentence id="S5.321">Immunopurified anti-D-mib antibodies (rabbit 541) were used (immunofluorescence, 1:100; Western blot, 1:1,000).</sentence>
					<sentence id="S5.322">Other primary antibodies were mouse anti-Cut (2B10; Developmental Studies Hybridoma Bank [DSHB, Iowa City, Iowa, United States]; 1:500); rat anti-DE-Cadherin (gift from T. Uemura; 1:50); guinea pig anti-Discs-large (gift from P. Bryant; 1:3,000); anti-?-galactosidase (Cappel [MP Biomedicals, Irvine, California, United States]; 1:1,000); mouse anti-DeltaECD (C594.9B; DSHB; 1:1,000); mouse anti-NotchECD (C548.2H; DSHB; 1:1,000); rat anti-Ser (gift from K. Irvine; 1:2,000); rat anti-Ser (gift from S. Cohen; 1:200); rabbit anti-Ser (gift from E. Knust; 1:10); and guinea pig anti-Senseless (gift from H. Bellen; 1:3,000).</sentence>
					<sentence id="S5.323">Cy2-, Cy3-, and Cy5-coupled secondary antibodies were from Jackson Laboratory (Bar Harbor, Maine, United States).</sentence>
					<sentence id="S5.324">Alexa488-coupled secondary antibodies and phalloidin were from Molecular Probes (Eugene, Oregon, United States).</sentence>
					<sentence id="S5.325">Images were acquired on a Leica (Wetzlar, Germany) SP2 microscope and assembled using Adobe Photoshop (Adobe Systems, San Jose, California, United States).</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S5.326">Endocytosis assay.</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S5.327">Third instar larvae wing discs were dissected in Schneider's Drosophila medium (Gibco BRL, San Diego, California, United States) containing 10% fetal calf serum (Gibco BRL).</sentence>
					<sentence id="S5.328">Wing discs were cut between the wing pouch and the thorax to facilitate antibody diffusion.</sentence>
					<sentence id="S5.329">Wing discs were cultured for 15 min with mouse anti-Dl (C594–9B at 1:100) and rat anti-Ser antibody (1:500; from K. Irvine).</sentence>
					<sentence id="S5.330">Following three medium changes and a 45-min chase period, wing discs were fixed and incubated with secondary antibodies.</sentence>
				</DocumentPart>
		</Document>
		<Document type="Biological_full_article">
			<DocID type="PMCID">PMC1131882</DocID>
				<DocumentPart type="Title">
					<sentence id="S6.1">RAG1 Core and V(D)J Recombination Signal Sequences Were Derived from Transib Transposons</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S6.2">Abstract</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.3">The V(D)J recombination reaction in jawed vertebrates is catalyzed by the RAG1 and RAG2 proteins, <xcope id="X6.3.1">which are <cue type="speculation" ref="X6.3.1">believed</cue> to have emerged approximately 500 million years ago from transposon-encoded proteins</xcope>.</sentence>
					<sentence id="S6.4">Yet <xcope id="X6.4.1"><cue type="negation" ref="X6.4.1">no</cue> transposase sequence similar to RAG1 or RAG2 has been found</xcope>.</sentence>
					<sentence id="S6.5">Here we show that the approximately 600-amino acid “core” region of RAG1 required for its catalytic activity is significantly similar to the transposase encoded by DNA transposons that belong to the Transib superfamily.</sentence>
					<sentence id="S6.6">This superfamily was discovered recently based on computational analysis of the fruit fly and African malaria mosquito genomes.</sentence>
					<sentence id="S6.7">Transib transposons also are present in the genomes of sea urchin, yellow fever mosquito, silkworm, dog hookworm, hydra, and soybean rust.</sentence>
					<sentence id="S6.8">We demonstrate that recombination signal sequences (RSSs) were derived from terminal inverted repeats of an ancient Transib transposon.</sentence>
					<sentence id="S6.9">Furthermore, the critical DDE catalytic triad of RAG1 is shared with the Transib transposase as part of conserved motifs.</sentence>
					<sentence id="S6.10">We also studied several divergent proteins encoded by the sea urchin and lancelet genomes that are 25%-30% identical to the RAG1 N-terminal domain and the RAG1 core.</sentence>
					<sentence id="S6.11">Our results provide the first direct evidence linking RAG1 and RSSs to a specific superfamily of DNA transposons and <xcope id="X6.11.1"><cue type="speculation" ref="X6.11.1">indicate that</cue> the V(D)J machinery evolved from transposons</xcope>.</sentence>
					<sentence id="S6.12">We <xcope id="X6.12.2"><cue type="speculation" ref="X6.12.2">propose</cue> that only the RAG1 core was derived from the Transib transposase, whereas the N-terminal domain was assembled from separate proteins of unknown function that <xcope id="X6.12.1"><cue type="speculation" ref="X6.12.1">may</cue> still be active in sea urchin, lancelet, hydra, and starlet sea anemone</xcope></xcope>.</sentence>
					<sentence id="S6.13">We also <xcope id="X6.13.2"><cue type="speculation" ref="X6.13.2">suggest</cue> that <xcope id="X6.13.1">the RAG2 protein was <cue type="negation" ref="X6.13.1">not</cue> encoded by ancient Transib transposons</xcope> but emerged in jawed vertebrates as a counterpart of RAG1 necessary for the V(D)J recombination reaction</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S6.14">Introduction</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.15">The immune system of jawed vertebrates detects and destroys foreign invaders, including bacteria and viruses, by a specific response to an unlimited number of antigens expressed by them.</sentence>
					<sentence id="S6.16">The antigens can be identified after they are specifically bound by surface receptors of vertebrate B and T immune cells (BCRs and TCRs, respectively).</sentence>
					<sentence id="S6.17">Because <xcope id="X6.17.1">the vast repertoire of BCRs and TCRs <cue type="negation" ref="X6.17.1">cannot</cue> be encoded genetically</xcope>, ancestors of jawed vertebrates adopted an elegant combinatorial solution [1].</sentence>
					<sentence id="S6.18">The variable portions of the BCR and TCR genes are composed of separate V (variable), D (diversity), and J (joining) segments, which are represented by fewer than a few hundred copies each.</sentence>
					<sentence id="S6.19">In a B and T cell site-specific recombination reaction, commonly known as V(D)J recombination, one V, one D, and one J segment are joined together into a single exon encoding the variable antigen-binding region of the receptor.</sentence>
					<sentence id="S6.20">In addition to this combinatorial diversity, further diversity is generated by small insertions and deletions at junctions between the joined segments.</sentence>
					<sentence id="S6.21">In V(D)J recombination, DNA cleavage is catalyzed by two proteins encoded by the recombination-activating genes, approximately 1040-amino acid (aa) RAG1 and approximately 530-aa RAG2 [2,3].</sentence>
					<sentence id="S6.22">The site specificity of the recombination is defined by the binding of RAG1/2 to RSSs flanking the V, D, and J segments [4].</sentence>
					<sentence id="S6.23">All RSSs can be divided into two groups, referred to as RSS12 and RSS23, and consist of conserved heptamer and nonamer sequences separated by a variable spacer either 12 ± 1 (RSS12) or 23 ± 1 (RSS23) bp long [4–7].</sentence>
					<sentence id="S6.24">During V(D)J recombination, RAG1/2 complex binds one RSS12 and one RSS23, bringing them into juxtaposition, and cuts the chromosome between the RSS heptamers and the corresponding V and D, D and J, or V and J coding segments [3,8].</sentence>
					<sentence id="S6.25">A rule requiring that efficient V(D)J recombination occur between RSS12 and RSS23 is known as the “12/23” rule [1].</sentence>
					<sentence id="S6.26">Even prior to the discovery of RAG1 and RAG2, it had been <xcope id="X6.26.1"><cue type="speculation" ref="X6.26.1">suggested</cue> that the first two RSSs were originally terminal inverted repeats (TIRs) of an ancient transposon whose accidental insertion into a gene ancestral to BCR and TCR, followed by gene duplications, triggered the emergence of the V(D)J machinery</xcope> [4].</sentence>
					<sentence id="S6.27">Later, this model was expanded by the <xcope id="X6.27.2"><cue type="speculation" ref="X6.27.2">suggestion</cue> that both RAG1 and RAG2 <xcope id="X6.27.1"><cue type="speculation" ref="X6.27.1">might</cue> have evolved from a transposase (TPase) that catalyzed transpositions of ancient transposons flanked by TIRs that were precursors of RSSs</xcope></xcope> [9].</sentence>
					<sentence id="S6.28">This model has received additional support through observations of similar biochemical reactions in transposition and V(D)J recombination [10,11].</sentence>
					<sentence id="S6.29">Finally, it was demonstrated that RAG1/2 catalyzed transpositions of a DNA segment flanked by RSS12 and RSS23 in vitro [12,13] and in vivo in yeast [14].</sentence>
					<sentence id="S6.30">In vertebrates, in vivo RAG-mediated transpositions are strongly suppressed, <xcope id="X6.30.1"><cue type="speculation" ref="X6.30.1">probably</cue> to minimize potential harm to genome function</xcope>.</sentence>
					<sentence id="S6.31">So far, only one putative instance of such a transposition has been reported [15].</sentence>
					<sentence id="S6.32">However, given the <xcope id="X6.32.2"><cue type="negation" ref="X6.32.2">lack</cue> of significant structural similarities between RAGs and known TPases</xcope>, <xcope id="X6.32.1">the “RAG transposon” model [9,12,13,16] remained <cue type="speculation" ref="X6.32.1">unproven</cue></xcope>.</sentence>
					<sentence id="S6.33">Here we demonstrate that the RAG1 core and RSSs were derived from a TPase and TIRs encoded by ancient DNA transposons from the Transib superfamily [17].</sentence>
					<sentence id="S6.34">The Transib superfamily is one of ten superfamilies of DNA transposons detected so far in eukaryotes [17].</sentence>
					<sentence id="S6.35">Like other DNA transposons, Transib transposons exist as autonomous and nonautonomous elements.</sentence>
					<sentence id="S6.36">The autonomous Transib transposons are 3–4 kb long and code for an approximately 700-aa TPase that is not similar to TPases from any other transposon superfamilies.</sentence>
					<sentence id="S6.37">Computational analysis of Transib elements, including their numerous insertions into copies of other transposons, demonstrated that Transib transposons are flanked by 5-bp target site duplications (TSDs), which also distinguishes this superfamily from all the others [17].</sentence>
					<sentence id="S6.38">Transib transpositions are <xcope id="X6.38.1"><cue type="speculation" ref="X6.38.1">expected</cue> to be catalyzed by the binding of the TPase to TIRs of autonomous and nonautonomous transposons</xcope> [17].</sentence>
					<sentence id="S6.39">As discussed in this paper, in addition to the fruit fly (Drosophila melanogaster) and African malaria mosquito (Anopheles gambiae) genomes, in which Transib transposons were originally discovered, these genes are also present in diverse animals (Table S1), including other species of fruit fly (e.g., Drosophila pseudoobscura, Drosophila willistoni), yellow fever mosquito (Anopheles aegypti), silkworm (Bombyx mori), red flour beetle (Tribolium castaneum), dog hookworm (Ancylostoma caninum), freshwater flatworm (Schmidtea mediterranea), hydra (Hydra magnipapillata), sea urchin (Strongylocentrotus purpuratus), and soybean rust (Phakopsora pachyrhizi).</sentence>
					<sentence id="S6.40"><xcope id="X6.40.1">Genomes of plants and vertebrates <cue type="speculation" ref="X6.40.1">seem</cue> to be free of any recognizable Transib transposons</xcope> (Figure 1).</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S6.41">Schematic Presentation of Transib transposons, RAG1, RAG2, and RAG1-Like Proteins in Eukaryotes</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.42">The basic timescale of the evolutionary tree is based on published literature [49–51].</sentence>
					<sentence id="S6.43">Red circles mark species in which Transib TPases were found.</sentence>
					<sentence id="S6.44">Gray squares indicate RAG2; orange and blue ellipses show the RAG1 core and RAG1 N-terminal domain, respectively.</sentence>
					<sentence id="S6.45">Overall taxonomy, including common and Latin names, is reported on the right side of the figure.</sentence>
					<sentence id="S6.46">A question mark at the lamprey lineage indicates insufficient sequence data.</sentence>
					<sentence id="S6.47">A <xcope id="X6.47.2"><cue type="negation" ref="X6.47.2">lack</cue> of any labels</xcope> means that the Transib TPase and RAG1/2 are <xcope id="X6.47.1"><cue type="negation" ref="X6.47.1">not</cue> present in the sequenced portions of the corresponding genomes</xcope>.</sentence>
					<sentence id="S6.48">Among branches <xcope id="X6.48.2"><cue type="negation" ref="X6.48.2">lacking</cue> Transib TPases</xcope>, only <xcope id="X6.48.1">lamprey and crocodile genomes are <cue type="negation" ref="X6.48.1">not</cue> extensively sequenced to date</xcope>.</sentence>
					<sentence id="S6.49">In sea anemone, the RAG1 core–like protein is capped by the ring finger motif, which also forms the C-terminus in the RAG1 N-terminal domain.</sentence>
					<sentence id="S6.50">In fungi, the Transib TPase was detected in soybean rust only.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S6.51">Results</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S6.52">Detection of Similarity between Transib TPases and RAG1</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.53">Using protein sequences of seven known Transib TPases (Transib1 through Transib4 and Transib1_AG through Transib3_AG from D. melanogaster and A. gambiae, respectively) [17] as queries in a standard BLASTP search against all GenBank proteins, we found that the approximately 60-aa C-terminal portion of the Transib2_AG TPase was 35%-38% identical to the C-terminal portion of the RAG1 core (Figure S1).</sentence>
					<sentence id="S6.54">However, this similarity was only marginally significant (E = 0.07 where the E-value is an expected number of sequences matching by chance; Table 1).</sentence>
					<sentence id="S6.55">In another search against GenBank, using PSI-BLAST [18] (see Materials and Methods) with the Transib2_AG TPase as a query, we found that two unclassified proteins (GenBank gi 30923617 and 30923765; annotated as hypothetical proteins) and RAG1s constituted the only group of any GenBank proteins similar to the Transib2_AG TPase (Table 1).</sentence>
					<sentence id="S6.56">The statistical significance of similarity between the TPase and RAG1s was measured by Ei = 0.025, where Ei is the E-value threshold for the first inclusion of RAG1 sequences into the PSI-BLAST iterations [18] (Materials and Methods).</sentence>
					<sentence id="S6.57">The observed improvement in significance of the Transib/RAG1 similarity (from E = 0.07 in BLASTP to Ei = 0.025 in PSI-BLAST; Table 1) was due to the fact that both 151-aa and 123-aa hypothetical GenBank proteins were apparent remnants of Transib TPases (approximately 40% identity to the Transib2_AG TPase, E &lt; 10-10 in BLASTP).</sentence>
					<sentence id="S6.58"><xcope id="X6.58.1">The RAG1 proteins <cue type="speculation" ref="X6.58.1">appeared</cue> to be more similar to the position-specific scoring matrix (PSSM) created by PSI-BLAST based on multiple alignment of the Transib2_AG TPase and two Transib TPase-like proteins, than to the solo Transib2_AG TPase in the BLASTP search</xcope>.</sentence>
					<sentence id="S6.59">Given the latter observation, we decided to improve the quality of the PSSM constructed by PSI-BLAST for different Transib TPase sequences.</sentence>
					<sentence id="S6.60">To achieve that, we combined protein sequences of the seven known Transib TPases with the set of all GenBank proteins.</sentence>
					<sentence id="S6.61">As a result, Ei-values for matches of RAG1s to a new PSSM based on alignment of nine Transib TPases (the two GenBank TPase-like proteins plus seven added TPases) noticeably dropped in comparison with the Ei-values obtained for the PSSM constructed in the previous step based on alignment of the three TPases (Table 1).</sentence>
					<sentence id="S6.62">To support the observation that Ei-values of matches between RAG1s and the Transib TPase PSSM decrease as the number of TPase sequences used for construction of the PSSM increases, we identified six new Transib TPases (Transib5, Transib3_DP, Transib4_DP, Transib1_AA, Transib2_AA, Transib3_AA; Figure S2).</sentence>
					<sentence id="S6.63">During the next step of the PSI-BLAST analysis, the original GenBank set was combined with 13 Transib TPases.</sentence>
					<sentence id="S6.64">Again, Ei-values of matches between RAG1s and the new PSSM derived from multiple alignment of 15 Transib TPases (the two GenBank proteins plus all our TPases) were much smaller (approximately 10-6–10-3; Table 1) than those obtained based on the PSSM constructed from the nine TPases at the preceding step (approximately 10-3–10-2).</sentence>
					<sentence id="S6.65">In the final step, we identified one more set of five new Transib TPases (Transib1_DP, Transib2_DP, Transib4_AA, Transib5_AA, and Transib1_SP).</sentence>
					<sentence id="S6.66">When all 18 TPases were combined with the original GenBank set, the Ei values of matches between RAG1s and the Transib PSSM dropped significantly further (10-9–10-4; Table 1).</sentence>
					<sentence id="S6.67">During the final revision of this manuscript, we identified an intermediate RAG1-like sequence in Hydra magnipapillata, called RAG1L_HM, which is significantly similar to both RAG1 and Transib TPase, as shown later.</sentence>
					<sentence id="S6.68">This direct result represents an independent validation of our analysis.</sentence>
					<sentence id="S6.69">The PSI-BLAST PSSM of Transib TPases approximates conservation/variability of the Transib TPase consensus sequence.</sentence>
					<sentence id="S6.70">The more diverse the TPases used in determining the PSSM, the more accurate is the approximation; some of the insect Transib TPases are less than 30% identical to each other, as shown in Figure 2.</sentence>
					<sentence id="S6.71">The RAG1 Ei values decreased as the number of Transib TPases used for the PSSM construction increased due to the fact that RAG1 evolved from a Transib TPase.</sentence>
					<sentence id="S6.72">In all cases, the E values obtained after several rounds of iterations were less than 10-20 at the point of convergence.</sentence>
					<sentence id="S6.73">Nearly the entire sequences of several Transib TPases, excluding their 100–140-aa N-terminal domains, converged with an approximately 600-aa portion of RAG1 defined by positions approximately 360–1010 (Figure S3).</sentence>
					<sentence id="S6.74">This portion of RAG1 corresponds to the “RAG1 core,” hereafter numbered relative to human RAG1 (residues 387–1011), which along with RAG2 is known to be sufficient to perform V(D)J cleavage even after deletions of the 383-aa N-terminal and 32-aa C-terminal portions of RAG1 [19,20].</sentence>
					<sentence id="S6.75">During studies reported here, we identified 11 additional new families of Transib transposons and TPases (see Figure S2) that are well preserved in the genomes of fruit flies (Transib5 in D. melanogaster; and Transib1_DP, Transib2_DP, Transib3_DP, and Transib4_DP in D. pseudoobscura), mosquitoes (Transib1_AA, Transib2_AA, Transib3_AA, Transib4_AA, and Transib5_AA from A. aegypti) and sea urchin (Transib1_SP).</sentence>
					<sentence id="S6.76">Transib1_SP is the first Transib transposon identified outside of insect genomes.</sentence>
					<sentence id="S6.77">A well-preserved 4132-bp Transib1_SP element (contig 7839, positions 376–4506) is flanked by a 5-bp CGGCG TSD, and it encodes a 676-aa TPase (two exons) that is most similar to the Transib2 TPase (34% identity).</sentence>
					<sentence id="S6.78">Based on the currently available sequence data, we also reconstructed portions of TPases that were missed in previous studies [17] (Materials and Methods; see Figure S2).</sentence>
					<sentence id="S6.79">Using the Transib1_SP TPase as a query in TBLASTN searches against all GenBank sections (NR, HTGs, WGS, dbGSS, dbEST, dbSTS, and Trace Archives) we also found diverse Transib TPases in silkworm, red flour beetle, dog hookworm, freshwater flatworm, soybean rust, and hydra (Table S1).</sentence>
					<sentence id="S6.80">At the same time, recently sequenced genomes of honeybee, roundworms, fish, frog, mammals, sea squirts, plants, and fungi (except soybean rust) do <xcope id="X6.80.1"><cue type="negation" ref="X6.80.1">not</cue> contain any detectable Transib transposons</xcope> (see Figure 1).</sentence>
					<sentence id="S6.81"><xcope id="X6.81.1">The observed patchy distribution <cue type="speculation" ref="X6.81.1">could</cue> be caused by horizontal transfers and extinctions of Transib transposons in eukaryotic species</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="TableLegend">
					<sentence id="S6.82">Significance of Similarities between the Transib TPases and RAG1 Core</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.83">The first column lists all 18 Transib TPases used as queries in our analysis, and the shaded areas indicate those added to the original set of all GenBank proteins in subsequent PSI-BLAST searches.</sentence>
					<sentence id="S6.84">The original GenBank set included two incomplete Transib TPase-like proteins.</sentence>
					<sentence id="S6.85">Column 2 lists E-values of best matches between RAG1s and Transib TPases detected in BLASTP searches against the original GenBank set.</sentence>
					<sentence id="S6.86">Column 3 reports Ei-values of best matches between RAG1s and a PSSM derived from the chosen query sequence and the two GenBank TPase-like proteins in PSI-BLAST searches against the original set of all GenBank proteins (see Materials and Methods).</sentence>
					<sentence id="S6.87">Columns 4–6 report the Ei-values for best matches between RAG1s and a Transib-derived PSSM after adding 7, 13, and 18 Transib TPases to the GenBank set, respectively.</sentence>
					<sentence id="S6.88">The numbers of the PSI-BLAST iterations after which the entire RAG1 core significantly aligned with the TPases are indicated in parentheses.</sentence>
					<sentence id="S6.89">Ei-values greater than 1 are indicated by dashes.</sentence>
					<sentence id="S6.90">Each empty cell indicates that <xcope id="X6.90.1">the corresponding TPase query was <cue type="negation" ref="X6.90.1">not</cue> used at the particular stage of PSI-BLAST analysis</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S6.91">Diversity of the Transib TPases and RAG1 Core–Like Proteins in Animals</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.92">The phylogenetic tree was obtained by using the neighbor-joining algorithm implemented in MEGA [44].</sentence>
					<sentence id="S6.93">Evolutionary distance for each pair of protein sequences was measured as the proportion of aa sites at which the two sequences were different.</sentence>
					<sentence id="S6.94">Its scale is shown by the horizontal bar.</sentence>
					<sentence id="S6.95">Bootstrap values higher than 60% are reported at the corresponding nodes.</sentence>
					<sentence id="S6.96">Species abbreviations are as follows: AA, yellow fever mosquito; AG, African malaria mosquito; BF, lancelet; CL, bull shark; DP, D. pseudoobscura fruit fly; FR, fugu fish; HM, hydra; HS, human; NV, starlet sea anemone; SP, sea urchin; XL, frog.</sentence>
					<sentence id="S6.97">(Transib1 through Transib5 are from D. melanogaster fruit fly.)</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S6.98">Common Structural Hallmarks of the Transib TPase and RAG1 Core</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.99">All three core residues from the catalytic DDE triad in the RAG1 proteins (residues 603, 711, and 965) that are necessary for V(D)J recombination [21,22] are conserved in the Transib TPases (Figures 3 and Figure S3).</sentence>
					<sentence id="S6.100">This includes the distances between the second D and E residues, which are much longer in Transib transposons (206–214 aa) and RAG1 (253 aa) than in DDE TPases from other studied superfamilies (e.g., approximately 35-aa in Mariner/Tc1 [23], 2-aa in P [23], approximately 35-aa in Harbinger [24], with hAT as an exception (325-aa, [25]).</sentence>
					<sentence id="S6.101">Moreover, each catalytic residue is a part of a motif that is conserved in the Transib TPases and RAG1 (motifs 4, 6, and 10 in Figures 3 and Figure S3).</sentence>
					<sentence id="S6.102">The RAG1 core is composed of the N-terminal region and the central and C-terminal domains ([26,27].</sentence>
					<sentence id="S6.103">The N-terminal region includes the RSS nonamer-binding regions (residues 387–480), referred to as NBR [28,29].</sentence>
					<sentence id="S6.104">The two terminal motifs of RAG1 NBR are conserved in the Transib TPases (Figure S3), which <xcope id="X6.104.2"><cue type="speculation" ref="X6.104.2">indicates that</cue> they <xcope id="X6.104.1"><cue type="speculation" ref="X6.104.1">may</cue> be important for their binding to the Transib TIRs during transposition (the RSS-like structure of TIRs is described below</xcope></xcope>; Figure 4).</sentence>
					<sentence id="S6.105">The central domain of the RAG1 core (residues 531–763) includes two aspartic acid residues from the DDE triad and <xcope id="X6.105.1">is also <cue type="speculation" ref="X6.105.1">thought</cue> to be involved in binding to the RSS heptamer and RAG2</xcope> [30,31].</sentence>
					<sentence id="S6.106">The C-terminal domain of RAG1 (residues 764–1011) is the portion of RAG1 that is most conserved between RAG1 and Transib TPases.</sentence>
					<sentence id="S6.107">In addition to the catalytic activity attributed to the last residue of the DDE triad, this domain has a strong nonspecific DNA-binding affinity because it binds to coding DNA upstream of the RSS heptamer, and <xcope id="X6.107.1">is <cue type="speculation" ref="X6.107.1">thought</cue> to be involved in RAG1 dimerization</xcope> [26,27].</sentence>
					<sentence id="S6.108"><xcope id="X6.108.1">This domain is <cue type="speculation" ref="X6.108.1">predicted</cue> to function analogously in Transib transposons</xcope>.</sentence>
					<sentence id="S6.109">Several other motifs conserved in Transib TPases and RAG1 include aa residues that have been shown experimentally to be important for specific functions in V(D)J recombination (Figure S3).</sentence>
					<sentence id="S6.110">Based on this information, <xcope id="X6.110.1">the function of these motifs in Transib TPases is <cue type="speculation" ref="X6.110.1">expected</cue> to be similar to that in RAG1</xcope>.</sentence>
					<sentence id="S6.111">Among the most conserved motifs, motif 5 (see Figures 3 and Figure S3) is of particular interest because <xcope id="X6.111.2">its function is <cue type="speculation" ref="X6.111.2">not known</cue> yet</xcope> but <xcope id="X6.111.1">is <cue type="speculation" ref="X6.111.1">expected</cue> to play a role both V(D)J recombination and Transib transposition</xcope>.</sentence>
					<sentence id="S6.112">In conjunction with detailed studies of the Transib superfamily, we also analyzed the remaining nine known superfamilies of DNA transposons defined by diverse TPases (see Table 1 in [24]).</sentence>
					<sentence id="S6.113">Some of these TPases, including Mariner, Harbinger, P, and hAT, also contain the catalytic DDE triad [23].</sentence>
					<sentence id="S6.114">However, based on PSI-BLAST searches, <xcope id="X6.114.2"><cue type="negation" ref="X6.114.2">no</cue> significant similarities between these nine TPases and RAG1 protein were found</xcope> <xcope id="X6.114.1">(data <cue type="negation" ref="X6.114.1">not</cue> shown)</xcope>.</sentence>
					<sentence id="S6.115">Therefore, given that the only significant similarity of the RAG1 core was to the Transib TPase, the RAG1 core was re-confirmed as belonging to the Transib superfamily.</sentence>
					<sentence id="S6.116">In addition to the statistically significant similarity between the approximately 600-aa RAG1 core and Transib TPases, there are two other lines of evidence <xcope id="X6.116.1"><cue type="speculation" ref="X6.116.1">suggesting</cue> evolution of the V(D)J machinery from Transib DNA transposons</xcope>.</sentence>
					<sentence id="S6.117">They include the characteristic TSDs and structure of the TIRs discussed in the next two sections.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S6.118">Multiple Alignment of Ten Conserved Motifs in the RAG1 Core Proteins and Transib TPases</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.119">The motifs are underlined and numbered from 1 to 10.</sentence>
					<sentence id="S6.120">Starting positions of the motifs immediately follow the corresponding protein names.</sentence>
					<sentence id="S6.121">Distances between the motifs are indicated in numbers of aa residues.</sentence>
					<sentence id="S6.122">Black circles denote conserved residues that form the RAG1/Transib catalytic DDE triad.</sentence>
					<sentence id="S6.123">The RAG1 proteins are as follows: RAG1_XL (GenBank GI no. 2501723, Xenopus laevis, frog), RAG1_HS (4557841, Homo sapiens, human), RAG1_GG (131826, Gallus gallus, chicken), RAG1_CL (1470117, Carcharhinus leucas, bull shark), RAG1_FR (4426834, Fugu rubripes, fugu fish).</sentence>
					<sentence id="S6.124">Coloring scheme [43] reflects physiochemical properties of amino acids: black shading marks hydrophobic residues, blue indicates charged (white font), positively charged (red font), and negatively charged (green font); red indicates proline (blue font) and glycine (green font); gray indicates aliphatic (red font) and aromatic (blue font); green indicates polar (black font) and amphoteric (red font); and yellow indicates tiny (blue font) and small (green font).</sentence>
					<sentence id="S6.125">The species abbreviations for the Transib transposons are as follows: AA, yellow fever mosquito; AG, African malaria mosquito; DP, D. pseudoobscura fruit fly.</sentence>
					<sentence id="S6.126">(Transib1 through Transib5 are from the fruitfly D. melanogaster.)</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S6.127">Structural Similarities between the Transib TIRs and V(D)J RSS Signals</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.128">The species abbreviations are: AA, yellow fever mosquito; AG, African malaria mosquito; DM, D. melanogaster fruit fly DP, D. pseudoobscura fruit fly; SP, sea urchin.</sentence>
					<sentence id="S6.129">(Transib1 through Transib5 are from the fruit fly D. melanogaster.)</sentence>
					<sentence id="S6.130">(A) Frequencies of the most frequent nucleotides at each position of the consensus sequence of the 5? TIRs of transposons that belong to 20 families of Transib transposons identified in fruit flies and mosquitoes.</sentence>
					<sentence id="S6.131">The RSS23 consensus sequence is shown immediately under the TIRs consensus sequence.</sentence>
					<sentence id="S6.132">The most conserved nucleotides in the RSS23 heptamer and nonamer, which are necessary for efficient V(D)J recombination, are highlighted.</sentence>
					<sentence id="S6.133">The 23 ± 1 bp variable spacer is marked by Ns.</sentence>
					<sentence id="S6.134">(B) Non-gapped alignment of consensus sequences of 5? TIRs from 21 families of Transib transposons.</sentence>
					<sentence id="S6.135">(C) The 12/23 rule follows from the basic structure of TIRs of the consensus sequences of transposons that belong to the Transib5, Transib2_AG, TransibN1_AG, TransibN2_AG, and TransibN3_AG families.</sentence>
					<sentence id="S6.136">The 5? TIRs of these transposons are aligned with the corresponding 3? TIRs.</sentence>
					<sentence id="S6.137">Structures of the 5? and 3? TIRs resemble RSS12 and RSS23, respectively.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S6.138">Similar Length of TSDs and Target Site Composition in Transib and RAG1/2-Mediated Transpositions</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.139">It has been known that RAG1-mediated transposition in vitro, both intermolecular and intramolecular, is most frequently accompanied by 5-bp TSDs [12,13].</sentence>
					<sentence id="S6.140">In one study [12], 35 of 38 (92%) TSDs generated during RAG-mediated intermolecular transposition were 5 bp long, and the remaining 8% were either 4 or 3 bp long.</sentence>
					<sentence id="S6.141">Also, 69% of 36 TSDs recovered during RAG-mediated intramolecular transpositions were 5 bp in length; of the remaining ones, 28% were 4 bp and 3% were 3 bp long.</sentence>
					<sentence id="S6.142">In another study [13], six of six TSDs detected in the intermolecular transposition were 5 bp long.</sentence>
					<sentence id="S6.143">Intramolecular transposition mediated by murine RAG1/2 proteins was also studied recently in vivo in yeast [14].</sentence>
					<sentence id="S6.144">Again, 60% of TSDs recovered in 26 events were 5 bp long [14].</sentence>
					<sentence id="S6.145">Given the predominance of 5-bp TSDs, it is striking that Transib transposons belong to the only superfamily of eukaryotic DNA transposons with 5-bp TSDs generated upon insertions into the genome [17,24].</sentence>
					<sentence id="S6.146">To illustrate the characteristic 5-bp TSDs, we show copies of Transib transposons with intact 5? and 3? TIRs from diverse families of Transib transposons present in the D. melanogaster, D. pseudoobscura, A. gambiae, and S. purpuratus genomes (Figure S4).</sentence>
					<sentence id="S6.147">Moreover, some families show high target site specificity, e.g., Transib-N1_AG and Transib-N2_AG integrate preferentially at cCASTGg and cCAWTGc, respectively (TSDs are capitalized).</sentence>
					<sentence id="S6.148">RAG1/2-mediated transpositions also show significant target specificity, <xcope id="X6.148.1"><cue type="speculation" ref="X6.148.1">presumably</cue> reflecting the original specificity of the Transib TPase</xcope> [12].</sentence>
					<sentence id="S6.149"><xcope id="X6.149.2">Indigenous properties of the Transib TPase, that were <xcope id="X6.149.1"><cue type="negation" ref="X6.149.1">not</cue> related directly to RAG1 functions, including those responsible for the precise 5-bp length of TSDs</xcope>, <cue type="speculation" ref="X6.149.2">might</cue> have been altered during evolution of RAG1</xcope>, leading to occasional 4-bp and 3-bp TSDs that are atypical for Transib transposons.</sentence>
					<sentence id="S6.150">Both RAG1/2-mediated and Transib transpositions show strong preference for GC-rich target sites [12–14,32], even though genomes hosting Transib transposons are AT-rich (Figure S4; Table 2).</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S6.151">Structure of Transib TIRs</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.152">The structure and conservation patterns of the 38-bp termini of Transib transposons from 21 different families closely resemble those of RSSs, <xcope id="X6.152.1"><cue type="speculation" ref="X6.152.1">suggesting</cue> that the latter were derived from termini of ancient Transib transposons</xcope> (Figures 4 and S4).</sentence>
					<sentence id="S6.153">The 38-bp consensus TIR of Transib transposons consists of a conserved 5?- CACAATG heptamer separated by a variable 23-bp spacer from an AAAAAAATC-3? nonamer.</sentence>
					<sentence id="S6.154">This corresponds closely to the structure of RSSs, which are composed of the conserved heptamers 5?- CACAGTG separated by a variable 22-bp spacers from ACAAAAACC-like nonamers [1,5–7].</sentence>
					<sentence id="S6.155">Only bases at positions 1 through 3 in the heptamer and at positions 5 and 6 in the nonamer are universally conserved in RSSs and absolutely essential for efficient V(D)J recombination [5–7].</sentence>
					<sentence id="S6.156">The corresponding positions are perfectly conserved in all Transib transposons (Figure 4A and 4B; excluding the 85% conserved position 34 in the Transib consensus that corresponds to position 5 in the RSS nonamer).</sentence>
					<sentence id="S6.157">The probability of the observed match between the RSS and Transib termini to occur by chance is less than 10-3 (see Materials and Methods).</sentence>
					<sentence id="S6.158">Although most Transib families are represented by transposons flanked by TIRs similar to RSS23 (Figure 4A), several families include transposons with 5? and 3? termini similar to RSS12 and RSS23, respectively (Figure 4C).</sentence>
					<sentence id="S6.159">Therefore, even the 12/23 rule [1] can be derived directly from the sequence structure of known Transib transposons.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S6.160">RAG1 Core–Like Sequences in the Sea Urchin, Lancelet, Starlet Sea Anemone, and Hydra Genomes</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.161">Using RAG1 proteins as query sequences in a WU BLAST search against sea urchin contigs sequenced at Baylor College (see Materials and Methods), we identified eight proteins approximately 30% identical to portions of the RAG1 core and approximately 50% identical to each other (see Figures 2, 5, and S5).</sentence>
					<sentence id="S6.162">Only one protein is present in two copies, which are 94% identical to each other at the DNA level (contigs 81987 and 6797).</sentence>
					<sentence id="S6.163"><xcope id="X6.163.1">Both copies <cue type="speculation" ref="X6.163.1">appear</cue> to be encoded by pseudogenes damaged by a stop codon at the same position of each protein</xcope>.</sentence>
					<sentence id="S6.164">Interestingly, the 6,690-bp contig 6797 harbours two additional defective pseudogenes coding for different RAG1 core–like proteins (Figure 5).</sentence>
					<sentence id="S6.165">We also identified a 597-aa protein sequence encoded by a single open reading frame (contig 29068, positions 1157–2944), which is 28% identical to nearly the entire RAG1 core (positions 461–1002 in the human RAG1, Figure S5).</sentence>
					<sentence id="S6.166">Extensive analysis of the flanks <xcope id="X6.166.5"><cue type="negation" ref="X6.166.5">failed</cue> to show any hallmarks of <xcope id="X6.166.4"><cue type="speculation" ref="X6.166.4">putative</cue> transposons <xcope id="X6.166.3">that <cue type="speculation" ref="X6.166.3">might</cue> be associated with this RAG1-like protein</xcope></xcope></xcope>, and we did <xcope id="X6.166.2"><cue type="negation" ref="X6.166.2">not</cue> find any evidence <xcope id="X6.166.1"><cue type="speculation" ref="X6.166.1">indicating that</cue> other RAG1 core–like proteins are encoded by transposable elements</xcope></xcope> (Figure 5).</sentence>
					<sentence id="S6.167">Using FGENESH [33], we detected that the RAG1 core–like open reading frame (ORF) in the contig 29068 forms a terminal exon (positions 1154–2947) of an incomplete hypothetical gene composed of two exons (internal and terminal; see Figure S6).</sentence>
					<sentence id="S6.168">The 3? terminal portion of the internal exon encodes a protein sequence <xcope id="X6.168.1">that <cue type="speculation" ref="X6.168.1">appears</cue> to be marginally similar to an approximately 50-aa fragment of the RAG1 core (positions 394–454 in human RAG1</xcope>; Figure S5).</sentence>
					<sentence id="S6.169"><xcope id="X6.169.2">The RAG1 core–like protein in whole genome shotgun (WGS) contig 12509 (Figure 5) also <cue type="speculation" ref="X6.169.2">seems</cue> to be encoded by the last exon starting at position 1650 of a <xcope id="X6.169.1"><cue type="speculation" ref="X6.169.1">hypothetical</cue> RAG1-like gene</xcope></xcope>.</sentence>
					<sentence id="S6.170">Although the two proteins are only 38% identical to each other, they share common features: (1) <xcope id="X6.170.1">their N-terminal portions are <cue type="negation" ref="X6.170.1">missing</cue></xcope> and the RAG1-like sequences start at positions 17 or 18; (2) in both proteins the first aa residue overlaps with the acceptor splice site; and (3) their similarity to RAG1 starts at positions corresponding to position 470 of the human RAG1.</sentence>
					<sentence id="S6.171">Remarkably, the acceptor splice site positions in the sea urchin RAG1 core–like proteins closely correspond to those in RAG1 from teleosts (i.e., most of the living ray-finned or bony fish), in which RAG1 is split by an intron at position homologous to Gly460 in human RAG1 [34].</sentence>
					<sentence id="S6.172">Using the same RAG1 query sequences in a TBLASTN search against WGS trace sequences from the lancelet (Branchiostoma floridae) genome recently sequenced at the Joint Genome Institute (see Materials and Methods), we found that the lancelet genome encodes protein sequences approximately 35% identical to the RAG1 core (Figure S5; RAG1L_BF; BLASTP E-value is equal to 10-34).</sentence>
					<sentence id="S6.173">Again, as in the case of the sea urchin sequences, the lancelet RAG1 core–like elements show <xcope id="X6.173.2"><cue type="negation" ref="X6.173.2">no</cue> hallmarks of transposons</xcope> <xcope id="X6.173.1">(data <cue type="negation" ref="X6.173.1">not</cue> shown)</xcope>.</sentence>
					<sentence id="S6.174">However, unlike highly conserved RAG1 proteins, the RAG1 core–like proteins are remarkably diverse (see Figure 2).</sentence>
					<sentence id="S6.175">During the second review of the manuscript of this article, we were kindly informed by Dr.</sentence>
					<sentence id="S6.176">Hervé Philippe of a RAG1 core–like sequence present the starlet sea anemone (Nematostella vectensis).</sentence>
					<sentence id="S6.177">After that, we screened all available Trace Archives (Materials and Methods) and detected additional RAG1-like proteins.</sentence>
					<sentence id="S6.178">In starlet sea anemone, several approximately 1000-bp WGS trace sequences were found (e.g., GenBank Trace Archive IDs 668021618, 558173651, 568641192, and 599572062), which encode protein, called RAG1L_NV, that is approximately 30% identical to the human RAG1 core (positions 284–802, TBLASTN, 10-26 &lt; E &lt; 10-7).</sentence>
					<sentence id="S6.179">We also found several approximately 1000-bp WGS trace sequences of Hydra magnipapillata (Trace Archive IDs 688654311, 647073738, 666995387, 687186526, 688683890, and 688948453), coding for protein sequences 26%-30% identical to the RAG1 core (positions 753-995, E-value is approximately equal to 10-7 in a BLASTX search against GenBank).</sentence>
					<sentence id="S6.180">Using these trace sequences, we partially assembled a hydra gene, called RAG1L_NM, which encodes the RAG1 core–like protein.</sentence>
					<sentence id="S6.181">Remarkably, the hydra RAG1L_NM protein turned out to be significantly similar to the Transib TPase (26% identity; E-value is approximately equal to 10-14 in a BLASTX search against GenBank proteins combined with the Transib TPase sequences).</sentence>
					<sentence id="S6.182">Therefore, the hydra RAG1 core–like protein provides the first direct link between the RAG1 core and Transib TPase.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S6.183">Schematic Structure of the Sea Urchin RAG1-Like Sequences</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.184">Contig accession numbers are shown in the left column.</sentence>
					<sentence id="S6.185">Inverted complement contigs are marked by “c” followed by the contig number.</sentence>
					<sentence id="S6.186">In each contig, RAG1-like proteins (white rectangle) are schematically aligned with the human RAG1 core (top rectangle).</sentence>
					<sentence id="S6.187">Nucleotide positions of the RAG1-like sequences are shown beneath the white rectangles.</sentence>
					<sentence id="S6.188">Three pairs of recently duplicated sequences (nucleotide identity is higher than 95%) are underlined by red, green, and black lines, respectively.</sentence>
					<sentence id="S6.189">Transposable and repetitive elements detected in the flanking regions are marked by painted rectangles.</sentence>
					<sentence id="S6.190">Names of these elements are shown above the rectangles.</sentence>
					<sentence id="S6.191">Asterisks denote stop codons in the corresponding RAG1-like sequences.</sentence>
					<sentence id="S6.192">BLASTP E-values characterizing similarities between the sea urchin and RAG1 proteins are shown above the white rectangles.</sentence>
					<sentence id="S6.193">Multiple alignment of these protein sequences is reported in Figure S5.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S6.194">N-Terminal–Like Domain of RAG1 in the Sea Urchin, Lancelet, Starlet Sea Anemone, and Hydra Genomes</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.195">A separate analysis of the assembled sea urchin sequences yielded seven sequences encoding three diverse proteins that were significantly similar to the 380-aa N-terminal domain of RAG1 (BLASTX, E &lt; 10-4), excluding the 100-aa N-terminus (Figure 6).</sentence>
					<sentence id="S6.196">The first 305-aa protein is encoded by contig 1226, and its recently duplicated copies are on contigs 1219 and 1222 (approximately 95% identical to each other at the protein level).</sentence>
					<sentence id="S6.197">The second, 195-aa protein (contig 83099) is the shortest.</sentence>
					<sentence id="S6.198">It is only approximately 26% identical to the first protein and more than 90% identical at the DNA level to its duplicate on contig 86231.</sentence>
					<sentence id="S6.199">We also found a third protein on contig 768 that contains unique motifs in its N-terminal regions that best match the homologous regions of RAG1.</sentence>
					<sentence id="S6.200">Furthermore, we found that unassembled WGS trace sequences encode two other proteins, P4_SP and P5_SP, similar to the N-terminal RAG1 domain (Figure 6).</sentence>
					<sentence id="S6.201">By analyzing the lancelet WGS traces, we also found that the lancelet genome encodes five different proteins similar to the N-terminal domain of RAG1 (BLASTP E values in searches against all GenBank proteins were in a range of 10-14–10-7).</sentence>
					<sentence id="S6.202">DNA sequences coding for these proteins, P1_BF through P5_BF, were manually assembled from overlapping WGS sequences (data available upon request).</sentence>
					<sentence id="S6.203">The proteins detected in the sea urchin and lancelet genome share a ring finger motif as well as two novel motifs matching the N-terminal RAG1 domain (Figure 6) and remotely resembling C-x2-C zinc finger motifs.</sentence>
					<sentence id="S6.204">The new conserved motifs are H-x3-L-x3-C-R-x-C-G and D-x3-I-h-P-x2-F-C-x2-C, and <xcope id="X6.204.1">their function <cue type="speculation" ref="X6.204.1">remains to be determined</cue></xcope>.</sentence>
					<sentence id="S6.205">It is <xcope id="X6.205.1"><cue type="speculation" ref="X6.205.1">thought</cue> that the ring finger motif of RAG1 functions as a zinc-binding domain, is involved in dimerization [30,35], and acts as an E3 ligase in the ubiquitylation</xcope> [36].</sentence>
					<sentence id="S6.206">It also <xcope id="X6.206.1"><cue type="speculation" ref="X6.206.1">likely</cue> that the N-terminal RAG1 and RAG1-like proteins share an additional conserved motif W-x-p-h-x(3–6)-C-x2-C that resides between conserved motif 2 and the ring finger</xcope> (Figure 6).</sentence>
					<sentence id="S6.207">None of the sea urchin and lancelet proteins align to the approximately 100-aa N-terminus of RAG1, which <xcope id="X6.207.3"><cue type="speculation" ref="X6.207.3">may</cue> <xcope id="X6.207.2"><cue type="speculation" ref="X6.207.2">indicate that</cue> this portion is <xcope id="X6.207.1">missing from the genome <cue type="speculation" ref="X6.207.1">or</cue> highly diverged and difficult to detect</xcope></xcope></xcope>.</sentence>
					<sentence id="S6.208">It is also worth noting that this portion corresponds to a separate exon in some teleosts (see Discussion).</sentence>
					<sentence id="S6.209">The ring finger motif itself is also present in several sea urchin proteins unrelated to RAG1 but significantly similar to diverse proteins associated with immune and developmental systems as well as regulation of transcription.</sentence>
					<sentence id="S6.210">To test <xcope id="X6.210.1"><cue type="speculation" ref="X6.210.1">whether</cue> the reported sea urchin sequences represent a true RAG1-like match</xcope>, we cut off the ring finger motif and repeated the BLASTP search against all GenBank proteins.</sentence>
					<sentence id="S6.211">Even <xcope id="X6.211.1"><cue type="negation" ref="X6.211.1">without</cue> the finger</xcope>, the remaining portions of the sea urchin sequences were significantly similar to the corresponding portions of RAG1.</sentence>
					<sentence id="S6.212">BLASTP E-values were 9×10-9, 7×10-5, and 10-3 for the P5_SP, P4_SP, and 768_SP sequences, respectively; because both the low-complexity filter and composition-based statistics were applied, the corresponding E-values were estimated very conservatively.</sentence>
					<sentence id="S6.213">BLASTP searches of the sea urchin sequences against all GenBank proteins, <xcope id="X6.213.1"><cue type="negation" ref="X6.213.1">excluding</cue> RAG1</xcope>, detected only the ring finger domain of the sea urchin sequences.</sentence>
					<sentence id="S6.214">E-values of these matches were much higher than the E-values of similarities to the RAG1 proteins (SP_768: 0.04 versus 7×10-7; SP_86231: 3·10-4 versus 7×10-7; SP_1226: 10-4 versus 2×10-7; P4_SP: 10 versus 2×10-7; P5_SP does <xcope id="X6.214.1"><cue type="negation" ref="X6.214.1">not</cue> have ring finger</xcope> and matches RAG1 only, E-value == 9×10-7).</sentence>
					<sentence id="S6.215">Based on the same approach, our study found that the starlet sea anemone and hydra genomes also encode several families of the N-terminal RAG1 domain <xcope id="X6.215.2">that <cue type="speculation" ref="X6.215.2">appear</cue> to be separate from the RAG1 core–like proteins</xcope> (data <xcope id="X6.215.1"><cue type="negation" ref="X6.215.1">not</cue> shown)</xcope>.</sentence>
					<sentence id="S6.216">The only exception was the already mentioned sea anemone RAG1 core–like sequence.</sentence>
					<sentence id="S6.217">The approximately 90-aa N-terminus of the latter sequence is the ring finger (E &lt; 10-7, multiple BLASTP matches against known ring fingers in GenBank).</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S6.218">Multiple Alignment of the RAG1 N-Terminal Domain and Sea Urchin Protein Sequences</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.219">RAG1_HS, RAG1_PD, RAG1_SS, RAG1_RM, and RAG1_LM mark the human (GenBank accession number NP_000439), lungfish (AAS75810), pig (BAC54968), stripe-sided rhabdornis or Rhabdornis mysticalis bird (AAQ76078), and latimeria (AAS75807) proteins, respectively.</sentence>
					<sentence id="S6.220">The sea urchin and lancelet proteins are marked by “_SP” and “_BF” following the identification numbers of the corresponding contigs.</sentence>
					<sentence id="S6.221">Protein sequences assembled from the sea urchin and lancelet WGS Trace Archives are denoted as P4-P5_SP and P1-P5_BF, respectively.</sentence>
					<sentence id="S6.222">Three conserved motifs are underlined and numbered.</sentence>
					<sentence id="S6.223">The third conserved motif is known as the ring finger.</sentence>
					<sentence id="S6.224">Distances from the protein N-termini are indicated by numbers.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S6.225">Discussion</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.226">The significant similarity between the Transib TPases and RAG1 core, the common structure of the Transib TIRs and RSSs, as well as the similar size of TSDs characterizing transpositions of Transib transposons and transpositions catalyzed by RAG1 and RAG2, directly support the 25-year-old hypothesis of a transposon-related origin of the V(D)J machinery.</sentence>
					<sentence id="S6.227">Previously, the “RAG transposon” hypothesis was open to challenge by alternative models of convergent evolution.</sentence>
					<sentence id="S6.228">Because there were <xcope id="X6.228.4"><cue type="negation" ref="X6.228.4">no</cue> known TPases similar to RAG1</xcope>, it <xcope id="X6.228.3"><cue type="speculation" ref="X6.228.3">could</cue> be <xcope id="X6.228.2"><cue type="speculation" ref="X6.228.2">argued</cue> that RAG1 independently developed some TPase-like properties, <xcope id="X6.228.1"><cue type="negation" ref="X6.228.1">rather than</cue> deriving them from a TE-encoded TPase</xcope></xcope></xcope> [24].</sentence>
					<sentence id="S6.229">These arguments can now be put to rest.</sentence>
					<sentence id="S6.230">As shown in this paper, the RAG1 core was derived from a Transib TPase, but given the low identity between the Transib TPase and the RAG1 core (14%–17%) it is <xcope id="X6.230.3"><cue type="speculation" ref="X6.230.3">not clear</cue> <xcope id="X6.230.2"><cue type="speculation" ref="X6.230.2">whether</cue> the ancestral transposon was <xcope id="X6.230.1">a member of the group of canonical Transib transposons preserved in modern genomes of insects, hydra, and sea urchin (see Figure 1), <cue type="speculation" ref="X6.230.1">or</cue> a member of an unknown group of Transib transposons that encoded a TPase that was more similar to RAG1 core than to the canonical TPase from the currently known Transib transposons</xcope></xcope></xcope>.</sentence>
					<sentence id="S6.231">Furthermore, after its recruitment, the RAG1 core most <xcope id="X6.231.1"><cue type="speculation" ref="X6.231.1">likely</cue> went through a period of intensive transformations due to diversifying/positive selection, which further decreased its similarity to Transib TPase</xcope>.</sentence>
					<sentence id="S6.232">Afterwards, the RAG1 genes continued to evolve at a slow and steady pace under stabilizing selection, as indicated by the observed conservation of the RAG1 core (79% identity between sharks and mammals).</sentence>
					<sentence id="S6.233"><xcope id="X6.233.1">Some of the intermediate stages of RAG1 evolution can be <cue type="speculation" ref="X6.233.1">inferred</cue> from analysis of the sea urchin in which RAG1-like proteins were recently observed [37], and from analysis of the lancelet, starlet sea anemone, and hydra genomes</xcope>.</sentence>
					<sentence id="S6.234">Based on the presence of stop codons disrupting some of the RAG1-like sequences, it has been <xcope id="X6.234.1"><cue type="speculation" ref="X6.234.1">suggested</cue> [37] that the sea urchin sequences represent remnants of transposable elements</xcope>.</sentence>
					<sentence id="S6.235">Typically, TPase-coding autonomous DNA transposons are present in only a few complete copies per genome.</sentence>
					<sentence id="S6.236">At the same time, sequences homologous to their terminal portions, including specific TIRs, are usually abundant due to the proliferation of nonautonomous DNA transposons fueled by the TPase expressed by the corresponding low-copy autonomous elements.</sentence>
					<sentence id="S6.237">Therefore, even if only 30% of the sea urchin genome has been sequenced to date, it is <xcope id="X6.237.2"><cue type="speculation" ref="X6.237.2">expected</cue> that the regions flanking the TPase portions of potential autonomous elements <xcope id="X6.237.1"><cue type="speculation" ref="X6.237.1">should</cue> be similar to numerous nonautonomous elements</xcope></xcope>.</sentence>
					<sentence id="S6.238">So far, we have found <xcope id="X6.238.1"><cue type="negation" ref="X6.238.1">no</cue> evidence of such similarities</xcope>.</sentence>
					<sentence id="S6.239">Detailed analysis of regions flanking the sea urchin RAG1-like DNA coding sequences revealed a variety of different transposable elements inserted in the proximity of the coding sequences (see Figure 5).</sentence>
					<sentence id="S6.240">Nevertheless, based on the orientations and relative positions of these transposons, <xcope id="X6.240.1"><xcope id="X6.240.2"><cue type="negation" ref="X6.240.1">none</cue> of them <cue type="speculation" ref="X6.240.2">appears</cue> to be associated with the RAG1-like sequences</xcope></xcope> (see Figure 5).</sentence>
					<sentence id="S6.241">We also could <xcope id="X6.241.1"><cue type="negation" ref="X6.241.1">not</cue> identify the 5-bp TSDs and TIRs characteristic of the Transib superfamily</xcope>.</sentence>
					<sentence id="S6.242">Still, given that only one third of the sea urchin genome is currently assembled as a set of contigs longer than several thousand nucleotides (the remaining portion is represented by short WGS sequences), we <xcope id="X6.242.1"><cue type="speculation" ref="X6.242.1">cannot rule out the possibility</cue> that the sea urchin RAG1-like proteins are remnants of an unknown branch of Transib transposons</xcope>.</sentence>
					<sentence id="S6.243">Given that the genomes of lancelet, hydra, and starlet sea anemone are currently available only as unassembled WGS traces, the <xcope id="X6.243.2"><cue type="speculation" ref="X6.243.2">question</cue> <xcope id="X6.243.1"><cue type="speculation" ref="X6.243.1">whether</cue> the corresponding RAG1-like sequences are remnants of transposons or genes/pseudogenes</xcope> <cue type="speculation" ref="X6.243.2">must be left open</cue></xcope>.</sentence>
					<sentence id="S6.244">The alternative <xcope id="X6.244.1"><cue type="speculation" ref="X6.244.1">possibility</cue> is that the sea urchin RAG1 core–like sequences represent diverse genes and pseudogenes that belong to a rapidly evolving multigene family</xcope>.</sentence>
					<sentence id="S6.245">This opens the tantalizing <xcope id="X6.245.1"><cue type="speculation" ref="X6.245.1">possibility</cue> that the RAG1 core was recruited from a Transib TPase in a common ancestor of Bilaterians and Cnidarians, and subsequently lost in nematodes, insects, and sea squirts</xcope> (see Figure 1).</sentence>
					<sentence id="S6.246">Furthermore, given that the sea urchin, lancelet, hydra, and starlet sea anemone genomes harbor several highly divergent N-terminal–like domains, separate from the RAG1 core–like sequences and known transposable elements, it is very <xcope id="X6.246.1"><cue type="speculation" ref="X6.246.1">likely</cue> that the N-terminal–like domains of RAG1 also form a multigene family that can be traced back to a common ancestor of Deuterostomes</xcope> (see Figure 1).</sentence>
					<sentence id="S6.247">If so, then <xcope id="X6.247.1">both N-terminal and core domains of RAG1 <cue type="speculation" ref="X6.247.1">might</cue> have been derived from different genes present in a common ancestor of Deuterostomes</xcope>.</sentence>
					<sentence id="S6.248">Alternatively, <xcope id="X6.248.1">the N-terminal domain of RAG1 <cue type="speculation" ref="X6.248.1">might</cue> have been derived from a separate, unknown transposon</xcope>.</sentence>
					<sentence id="S6.249">The N-terminal domain of RAG1 has long been viewed as distinct from the core domain due to its <xcope id="X6.249.1"><cue type="negation" ref="X6.249.1">lack</cue> of direct involvement in the V(D)J recombination reaction</xcope>.</sentence>
					<sentence id="S6.250">In the sea urchin, lancelet, hydra, and starlet sea anemone genomes, <xcope id="X6.250.1"><xcope id="X6.250.2">the RAG1 core–like sequences and the N-terminal domain–like sequences do <cue type="negation" ref="X6.250.1">not</cue> <cue type="speculation" ref="X6.250.2">appear</cue> to be linked to each other or to any other proteins</xcope></xcope>.</sentence>
					<sentence id="S6.251">The only notable exception is the anemone RAG1 core–like protein sequence, which is capped by the 90-aa ring finger motif.</sentence>
					<sentence id="S6.252">Taken together with the fact that only the RAG1 core is significantly similar to Transib TPase, the data <xcope id="X6.252.1"><cue type="speculation" ref="X6.252.1">suggest</cue> that the vertebrate RAG1 represents a fusion of once separate proteins</xcope>.</sentence>
					<sentence id="S6.253">This is consistent with the observation that in teleosts, (bony fish) the RAG1 gene is divided into exons by <xcope id="X6.253.1"><cue type="speculation" ref="X6.253.1">either</cue> one <cue type="speculation" ref="X6.253.1">or</cue> two</xcope> introns.</sentence>
					<sentence id="S6.254">As a result, the RAG1 core is split into separate exons at the aa position that corresponds to position 460 in the human RAG1gene [29,34,38].</sentence>
					<sentence id="S6.255">The core-like sequences encoded by the sea urchin WGS sequence contigs 29068 and 12509 correspond to <xcope id="X6.255.1"><cue type="speculation" ref="X6.255.1">either</cue> the second <cue type="speculation" ref="X6.255.1">or</cue> third</xcope> RAG1 exon in teleosts (depending on the number of introns), which is remarkably consistent with the fusion model.</sentence>
					<sentence id="S6.256">The same model predicts that <xcope id="X6.256.1">the N-terminal domain of RAG1 <cue type="speculation" ref="X6.256.1">could</cue> also be assembled from two separate domains based on the presence of the second intron in some teleosts, splitting the N-terminal domain into the 102-aa N-terminal subdomain and the rest</xcope> [34].</sentence>
					<sentence id="S6.257">As indicated above, <xcope id="X6.257.1">this subdomain, corresponding to the first exon in the genes split by two introns, <cue type="speculation" ref="X6.257.1">appears</cue> to be missing in the sea urchin, lancelet, hydra, and starlet sea anemone N-terminal–like proteins</xcope>.</sentence>
					<sentence id="S6.258"><xcope id="X6.258.2"><xcope id="X6.258.3">It <cue type="speculation" ref="X6.258.2">may</cue> be encoded by a separate exon that is difficult to detect given its short length and the high level of sequence divergence between these species and vertebrates</xcope>, <cue type="speculation" ref="X6.258.3">or</cue> <xcope id="X6.258.1">it <cue type="speculation" ref="X6.258.1">might</cue> have been added in vertebrates</xcope></xcope>.</sentence>
					<sentence id="S6.259">Similarly, the RAG1 core–like protein in the sea urchin genome is shorter in its N-terminal part than the core domain in vertebrates and the corresponding Transib TPase.</sentence>
					<sentence id="S6.260">Again, it is <xcope id="X6.260.4"><cue type="speculation" ref="X6.260.4">unclear</cue> <xcope id="X6.260.3"><cue type="speculation" ref="X6.260.3">if</cue> this part is <xcope id="X6.260.1"><xcope id="X6.260.2"><cue type="negation" ref="X6.260.1">not</cue> present in sea urchins</xcope> <cue type="speculation" ref="X6.260.2">or</cue> simply undetectable</xcope> due to its small size and the high sequence divergence</xcope></xcope>.</sentence>
					<sentence id="S6.261">It is currently <xcope id="X6.261.1"><cue type="speculation" ref="X6.261.1">believed</cue> that both RAG1 and RAG2 proteins were originally encoded by the same transposon recruited in a common ancestor of jawed vertebrates</xcope> [3,12,13,16].</sentence>
					<sentence id="S6.262">However, <xcope id="X6.262.1"><cue type="negation" ref="X6.262.1">none</cue> of the Transib transposons identified so far encode any proteins other than the Transib/RAG TPase</xcope>.</sentence>
					<sentence id="S6.263">Also, we could <xcope id="X6.263.1"><cue type="negation" ref="X6.263.1">not</cue> find any RAG2-like sequences in the recently sequenced sea urchin, lancelet, hydra, and sea anemone genomes, which encode RAG1-like sequences</xcope>.</sentence>
					<sentence id="S6.264">Autonomous DNA transposons from the MuDR, Harbinger, and En/Spm superfamilies are each known to encode a second regulatory protein [23,24], whereas some transposons from these superfamilies encode the TPase only.</sentence>
					<sentence id="S6.265">Therefore, it is in principle <xcope id="X6.265.1"><cue type="speculation" ref="X6.265.1">possible</cue> that an ancient vertebrate Transib that was a direct ancestor of the RAG1 core also encoded a second protein, the direct ancestor of RAG2</xcope>.</sentence>
					<sentence id="S6.266">Nevertheless, the <xcope id="X6.266.3"><cue type="speculation" ref="X6.266.3">apparent</cue> <xcope id="X6.266.2"><cue type="negation" ref="X6.266.2">lack</cue> of RAG2-like proteins in the sequenced portion of the sea urchin, lancelet, hydra, and sea anemone genomes, as well as in Transib transposons</xcope></xcope> <xcope id="X6.266.1"><cue type="speculation" ref="X6.266.1">suggests</cue> that RAG2 was introduced in a separate event in jawless vertebrates</xcope>.</sentence>
					<sentence id="S6.267">However, given the low 30% identity between the RAG1 and sea urchin/lancelet/sea squirt RAG1-like proteins, we <xcope id="X6.267.3"><cue type="speculation" ref="X6.267.3">cannot exclude the possibility</cue> that the ancestral RAG2 protein went through a period of strong diversification driven by positive selection, and it can <xcope id="X6.267.2"><cue type="negation" ref="X6.267.2">no longer</cue> be identified by sequence comparisons</xcope> but <xcope id="X6.267.1"><cue type="speculation" ref="X6.267.1">may</cue> still be present in invertebrates</xcope></xcope>.</sentence>
					<sentence id="S6.268">In any case, <xcope id="X6.268.2">the origin of the V(D)J recombination system in jawless vertebrates <cue type="speculation" ref="X6.268.2">appears</cue> to be a culmination of earlier evolutionary processes <xcope id="X6.268.1"><cue type="negation" ref="X6.268.1">rather than</cue> an isolated event associated with insertion of a single transposon</xcope></xcope>.</sentence>
					<sentence id="S6.269">If so, detailed studies of individual components, including active Transib transposons and invertebrate proteins homologous to RAG1 elements <xcope id="X6.269.1"><cue type="speculation" ref="X6.269.1">can</cue> bring new breakthroughs in our understanding of evolutionary and mechanistic aspects of V(D)J recombination</xcope>.</sentence>
					<sentence id="S6.270">The observed sequence similarity between the RAG1 and Transib TPase protein <xcope id="X6.270.1"><cue type="speculation" ref="X6.270.1">can</cue> help to identify aa residues in the TPase that are crucial for transposition of Transib transposons</xcope>.</sentence>
					<sentence id="S6.271">For instance, on the basis of the TPase comparison to RAG1 (see Figures S1 and S3), we were able to identify correct positions of the last two aa residues in the DDE catalytic triad (see Figure 2 in [17]), missed in our previous study due to insufficient data.</sentence>
					<sentence id="S6.272">Interestingly, only two cysteines of the zinc finger B (ZFB) C2H2 motif in RAG1 (residues 695–761) involved in its binding to RAG2 [30,31] are perfectly conserved in the Transib TPases (motif 7; see Figures 3 and Figure S3).</sentence>
					<sentence id="S6.273"><xcope id="X6.273.2">The remaining portion of the ZFB motif was <cue type="speculation" ref="X6.273.2">probably</cue> lost in TPases of insect Transib transposons, which do <xcope id="X6.273.1"><cue type="negation" ref="X6.273.1">not</cue> encode RAG2-like proteins</xcope></xcope>.</sentence>
					<sentence id="S6.274">Notably, two ZFB cysteines are part of the conserved SxxCxxC motif, and mutations of the serine from the same motif cause severe defects in RAG1 transpositions in vitro [32].</sentence>
					<sentence id="S6.275">Therefore, <xcope id="X6.275.1">the presence of serine in this motif is <cue type="speculation" ref="X6.275.1">expected</cue> to be crucial to Transib transpositions</xcope>.</sentence>
					<sentence id="S6.276">After submission of our manuscript, additional biochemical evidence <xcope id="X6.276.1"><cue type="speculation" ref="X6.276.1">favoring</cue> evolution of V(D)J recombination from transposable elements</xcope> was reported [25].</sentence>
					<sentence id="S6.277">Analogously to V(D)J recombination, transposition of the fly Hermes transposon, which belongs to the hAT superfamily, is also characterized by a double-strand break via hairpin formation on flanking DNA and 3? OH joining to the target DNA [25].</sentence>
					<sentence id="S6.278">However, although the observed biochemical relationship between the hAT TPase and V(D)J recombination is a step forward in our understanding of transposition reaction, several arguments strongly <xcope id="X6.278.2"><cue type="speculation" ref="X6.278.2">suggest</cue> that V(D)J machinery evolved from a Transib <xcope id="X6.278.1"><cue type="negation" ref="X6.278.1">rather than</cue> from hAT transposon</xcope></xcope>.</sentence>
					<sentence id="S6.279">First, as we mentioned previously, there is no significant sequence identity between hAT TPases and RAG1, even if one employs a PSI-BLAST search with most relaxed parameters (i.e., E &lt; 10, <xcope id="X6.279.2"><cue type="negation" ref="X6.279.2">no</cue> filters</xcope>, <xcope id="X6.279.1"><cue type="negation" ref="X6.279.1">no</cue> composition-based statistics</xcope>).</sentence>
					<sentence id="S6.280">Second, although RAG1/2-mediated transpositions are characterized by 5-bp (sometimes 4-bp) TSDs, all known hAT transposons are characterized by 8-bp TSDs.</sentence>
					<sentence id="S6.281">Third, unlike in the case of Transib transposons, TIRs of hAT transposons are different from RSS both in terms of DNA sequence similarities and their conservation patterns (Figure S7).</sentence>
					<sentence id="S6.282">Fourth, hAT- and RAG1/2-mediated transpositions differ dramatically in terms of the GC content of their target sites: Unlike Transib transposons and RAG1 transpositions occurring in GC-rich DNA, hAT transposons tend to be integrated into AT-rich regions (Table S2).</sentence>
					<sentence id="S6.283">All four arguments strongly <xcope id="X6.283.1"><cue type="speculation" ref="X6.283.1">favor</cue> evolution of V(D)J machinery from a Transib transposon</xcope>.</sentence>
					<sentence id="S6.284">Most <xcope id="X6.284.1"><cue type="speculation" ref="X6.284.1">likely</cue>, the Transib transpositions are also characterized by hairpin intermediates formed by the ends of the donor DNA double-strand breaks</xcope>, as observed during V(D)J recombination and hAT transposition.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S6.285">Materials and Methods</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S6.286">DNA and protein sequences.</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.287">Assembled D. pseudoobscura sequences were downloaded from the Human Genome Sequencing Center at Baylor College of Medicine through the Web site at http://hgsc.bcm.tmc.edu/projects/drosophila/ on 2 March 2004.</sentence>
					<sentence id="S6.288">Preliminary A. aegypti sequence data were obtained from The Institute for Genomic Research through the Web site at http://www.tigr.org on 4 March 2004.</sentence>
					<sentence id="S6.289">Assembled D. melanogaster sequences were downloaded from the Berkeley Drosophila Genome Project at http://www.fruitfly.org/sequence/download.html on 17 February 2004.</sentence>
					<sentence id="S6.290">Partially assembled S. purpuratus contig sequences were downloaded on 12 August 2004 from the Baylor College of Medicine through the Web site at ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Spurpuratus/blast/Spur20030922-genome.</sentence>
					<sentence id="S6.291">In addition to the assembled contigs, Baylor College of Medicine, Human Genome Sequencing Center (http://www.hgsc.bcm.tmc.edu) produced an approximately 8-Gb set of short unassembled WGS sequences, called “traces”, which cover nearly the entire sea urchin genome.</sentence>
					<sentence id="S6.292">We downloaded these sequences from the GenBank Trace Archive at the National Center for Biotechnology Information (NCBI; ftp://ftp.ncbi.nih.gov/pub/TraceDB/strongylocentrotus_purpuratus/) on 17 November 2004.</sentence>
					<sentence id="S6.293">Also, we downloaded an approximately 5-Gb set of unassembled traces that cover almost completely the 600-Mb genome of Florida lancelet (ftp://ftp.ncbi.nih.gov/pub/TraceDB/branchiostoma_floridae/; 3 December 2004).</sentence>
					<sentence id="S6.294">These sequences were produced and deposited in the GenBank Trace Archive by Department of Energy Joint Genomic Institute (http://www.jgi.doe.gov/).</sentence>
					<sentence id="S6.295">All other DNA and protein sequences were accessed from GenBank (NCBI) through the server at http://www.ncbi.nih.gov/Genbank/ and from Ensembl (EMBL-EBI and Sanger Institute) via the server at http://www.ensembl.org.</sentence>
					<sentence id="S6.296">Sequences of the Transib1 through Transib4 and Transib1_AG through Transib3_AG transposons [17] were obtained from the D. melanogaster (drorep.ref) and A. gambiae (angrep.ref) sections of Repbase Update [39] at Genetic Information Research Institute (http://www.girinst.org).</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S6.297">Sequence analysis.</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.298">Computer-assisted identification and reconstruction of the Transib transposons was done as described previously [17,40–42].</sentence>
					<sentence id="S6.299">DNA sequence analysis including local sequence alignments, multiple alignments, and reconstruction of the Transib consensus sequences was done using software developed at Genetic Information Research Institute (available upon request) and WU-BLASTN 2.0 (http://blast.wustl.edu).</sentence>
					<sentence id="S6.300">To avoid background noise introduced by mutations, Transib relics, whose TPase-coding regions contained numerous stop codons and indels, were ignored unless several copies were available.</sentence>
					<sentence id="S6.301">(We included in the analysis incomplete relics of the Transib2–5_AA TPases represented by single DNA copies.)</sentence>
					<sentence id="S6.302">Prediction of <xcope id="X6.302.1"><cue type="speculation" ref="X6.302.1">putative</cue> exons and introns encoded by the Transib consensus sequences</xcope> was done with FGENESH [33] (at http://www.softberry.com).</sentence>
					<sentence id="S6.303">Multiple alignments of distantly related RAG1 and Transib TPase protein sequences were created by T-Coffee [40].</sentence>
					<sentence id="S6.304">Shading and minor manual refinements of the aligned sequences were done using Genedoc [43].</sentence>
					<sentence id="S6.305">Phylogenetic trees were produced by using MEGA3 [44].</sentence>
					<sentence id="S6.306">Some of the sea urchin sequences encoding the RAG1 N-terminal domain were assembled from traces based on the Baylor BAC-Fisher server at http://www.hgsc.bcm.tmc.edu/BAC-Fisher/ (the results of assembly were verified manually).</sentence>
					<sentence id="S6.307">All GenBank proteins were downloaded from ftp://ftp.ncbi.nih.gov/blast/DB/fasta/nr (February 2004) and were combined into a single set with the identified Transib TPases.</sentence>
					<sentence id="S6.308"><xcope id="X6.308.1"><cue type="negation" ref="X6.308.1">No</cue> Transib TPases had been deposited or annotated previously in GenBank</xcope>, except for two short hypothetical proteins predicted automatically during annotation of the D. melanogaster genome: 151-aa gi:30923617 and 123-aa gi:30923765.</sentence>
					<sentence id="S6.309">These proteins are apparent fragments of Transib TPases encoded by relics of Transib transposons, including Transib5_DM.</sentence>
					<sentence id="S6.310">A standalone 2001 version of PSI (Position-Specific Iterating)-BLAST [18,45] was used for detection of proteins that were significantly similar to TPases encoded by Transib and other superfamilies of DNA transposons.</sentence>
					<sentence id="S6.311">The PSI-BLAST program [18,45] is much more sensitive than a regular BLAST search due to the use of PSSM) PSI-BLAST first performs a standard BLASTP search of a protein query against a protein database and constructs a multiple alignment of matches exceeding a certain E-value threshold (called Ei value for the inclusion of sequences into PSI-BLAST iterations).</sentence>
					<sentence id="S6.312">From this alignment, a PSSM is constructed.</sentence>
					<sentence id="S6.313">The PSSM is a weight matrix indicating the relative occurrence of each of the 20 aa at each position in the alignment.</sentence>
					<sentence id="S6.314">This new PSSM is used as the score matrix for a new BLAST search in a second iteration.</sentence>
					<sentence id="S6.315">The process is repeated for a specific number of iterations or until convergence, when <xcope id="X6.315.1"><cue type="negation" ref="X6.315.1">no</cue> additional proteins are added on successive iterations</xcope>.</sentence>
					<sentence id="S6.316">The use of a PSSM in place of a fixed generic substitution matrix such as BLOSUM62 results in a much more sensitive BLAST search [18,45].</sentence>
					<sentence id="S6.317">Important practical aspects of using PSI-BLAST were recently described [46].</sentence>
					<sentence id="S6.318">To ensure that <xcope id="X6.318.1">a conservation profile for the Transib TPases and RAG1 proteins was <cue type="negation" ref="X6.318.1">not</cue> produced by a systematic error</xcope>, we employed a procedure of “step-wise” PSI-BLAST iterations.</sentence>
					<sentence id="S6.319">In this procedure we studied dependence of Ei values on the number of the Transib TPases combined with the GenBank proteins.</sentence>
					<sentence id="S6.320">The following protocol describes the procedure: (1) Use a GenBank set combined with N number of Transib TPases (in our studies, N was equal to 7, 13, and 18), (2) run PSI-BLAST against GenBank combined with TPases using each TPase as a query or seed, (3) select only Transib TPase sequences with E-values less than 10-5 to define the PSSM, (4) take the best E-value (Ei) obtained by PSI-BLAST for RAG1s when PSSM is constructed <xcope id="X6.320.1"><cue type="negation" ref="X6.320.1">without</cue> RAG1</xcope>, then (5) repeat these operations for different numbers (N) of TPases.</sentence>
					<sentence id="S6.321">Significant convergence of RAG1 and Transib TPases was observed to be independent of the particular type of substitution matrix (the same result was observed for both BLOSUM62 and PAM70 matrixes).</sentence>
					<sentence id="S6.322">To avoid detection of false similarities caused by simple repeats and coiled coils, the PSI-BLAST search was performed using stringent conditions with the SEG [47] and COILS [48] filters masking all low-complexity regions and coiled coils, respectively; composition-based statistics [45] were also employed.</sentence>
					<sentence id="S6.323">The probability P1 that the 5? terminus of a transposon from a particular Transib family would match by chance an RSS at its most conserved positions (positions 1-3 in the RSS heptamer, and positions 5 and 6 in the RSS nonamer) was estimated based on the following formula: P1 = fC × fA × fC × fA × fA, where fC (0.2) and fA (0.3) are frequencies of C and A in a set of 38-bp 5? termini of Transib transposons from 21 families (see Figure 4).</sentence>
					<sentence id="S6.324">The value of P1 is 0.001, indicating a significant similarity between Transib TIRs and RSS.</sentence>
					<sentence id="S6.325">Indeed, given that these five positions conserved in RSS are conserved in all TIRs from 21 families of Transib transposons, and the average identity between these 38-bp TIRs is only 49%, the chance of randomly matching these positions in TIRs from all 21 families is extremely small.</sentence>
					<sentence id="S6.326">TBLASTN searches against the Trace Archive were performed by using the BLAST client (blastcl3 or netblast at ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/, which accesses the NCBI BLAST search engine.</sentence>
					<sentence id="S6.327">Names of all available Trace Databases were taken from a list of databases at http://www.ncbi.nlm.nih.gov/blast/mmtrace.shtml.</sentence>
				</DocumentPart>
				<DocumentPart type="TableLegend">
					<sentence id="S6.328">Preferential Insertion of Transib transposons into GC-Rich Sites</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.329">Each of the 35-bp insertion sites corresponds to two 20-bp DNA fragments flanking a genomic Transib element at its 5? and 3? termini.</sentence>
					<sentence id="S6.330">One of the 5-bp TSDs flanking the 3? terminus of a Transib was excluded in each case.</sentence>
					<sentence id="S6.331">Analogously, the 15-bp insertion sites were composed of two 10-bp flanking fragm.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S6.332">Supporting Information</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S6.333">Similarity between C-Terminal Portions of the Transib2_AG TPase and RAG1</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.334">Two examples extracted from the NCBI BLASTP output illustrate similarity between the approximately 60-aa C-terminal portions of the Transib2_AG TPase (which we used as a query in a BLASTP search against all GenBank proteins) and the RAG1 core.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S6.335">Multiple Alignment of Transib TPases</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.336">The catalytic DDE triad is marked by black rectangles.</sentence>
					<sentence id="S6.337">Amino acids are shaded on the basis of their physiochemical properties according to the color scheme implemented in Genedoc [43]: Black shading marks hydrophobic residues, blue indicates charged (white font), positively charged (red font), and negatively charged (green font); red indicates proline (blue font) and glycine (green font); gray indicates aliphatic (red font) and aromatic (blue font); green indicates polar (black font) and amphoteric (red font); yellow indicates tiny (blue font) and small (green font).</sentence>
					<sentence id="S6.338">The species abbreviations are as follows: SP, sea urchin; DP, D. pseudoobscura fruit fly; AG, African malaria mosquito; AA, yellow fever mosquito.</sentence>
					<sentence id="S6.339">Transib1 through Transib5 are from the D. melanogaster fruit fly genome.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S6.340">Multiple Alignment of the RAG1 Core and Transib TPase Proteins</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.341">The shading scheme is the same as in Figure S2.</sentence>
					<sentence id="S6.342">The catalytic DDE triad is marked by black rectangles.</sentence>
					<sentence id="S6.343">RAG1 aa whose replacements resulted in previously detected defects of V(D)J recombination [31] are marked by color rectangles indicated below the alignment blocks; red indicates DNA binding defect; green indicates nicking defect; cyan indicates hairpin defect; blue indicates joining mutants; yellow indicates catalytic mutants; gray indicates joining/transposition.</sentence>
					<sentence id="S6.344">Presence and <xcope id="X6.344.1"><cue type="negation" ref="X6.344.1">absence</cue> of corresponding residues in the Transib TPases</xcope> are indicated by + and -, respectively.</sentence>
					<sentence id="S6.345">Conserved motifs are marked by lines numbered from 1 to 10.</sentence>
					<sentence id="S6.346">The species abbreviations are as follows: DP, D. pseudoobscura fruit fly; AG, African malaria mosquito; AA, yellow fever mosquito; GG, chicken; HS, human; XL, frog; CL, bull shark; FR, fugu fish.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S6.347">TSDs in Transposons from Different Transib Families</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.348">For each family, DNA copies of transposons are aligned to the corresponding consensus sequence.</sentence>
					<sentence id="S6.349">The consensus sequence is shown in the top line.</sentence>
					<sentence id="S6.350">Dots indicate nucleotide identity with the consensus sequence; hyphens represent alignment gaps.</sentence>
					<sentence id="S6.351"><xcope id="X6.351.1">Internal portions of transposons are <cue type="negation" ref="X6.351.1">not</cue> shown</xcope> and are marked by xxx.</sentence>
					<sentence id="S6.352">TSDs are highlighted.</sentence>
					<sentence id="S6.353">Coordinates of the reported elements are shown in the first two columns (sequence name, beginning to end).</sentence>
					<sentence id="S6.354">(A) TransibN1_AG family from mosquito.</sentence>
					<sentence id="S6.355">(B) TransibN2_AG family from mosquito.</sentence>
					<sentence id="S6.356">(C) TransibN3_AG family from mosquito.</sentence>
					<sentence id="S6.357">(D) TransibN1_DP family from fruit fly.</sentence>
					<sentence id="S6.358">(E) Hopper family from fruit fly.</sentence>
					<sentence id="S6.359">(F) TransibN1_DM family from fruit fly.</sentence>
					<sentence id="S6.360">(G) TransibN1_SP family from sea urchin.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S6.361">Multiple Alignment of the RAG1 Core and RAG1 Core–Like Proteins Encoded by the Sea Urchin and Lancelet Genomes</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.362">The shading scheme is the same as in Figure S2 and S3.</sentence>
					<sentence id="S6.363">The species abbreviations are as follows: SP, sea urchin; BF, lancelet; HS, human; CL, bull shark; GG, chicken; XL, frog; FR, fugu fish.</sentence>
					<sentence id="S6.364">The lancelet RAG1L_BF protein is encoded by several overlapping WGS trace sequences (for example, GenBank Trace Archive identification numbers 543943730, 538583629).</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S6.365">RAG1-Like Protein SP_29068 in the Sea Urchin Contig 29068</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.366">(A) Exon/intron structure of the SP_29068 gene is reported based on the FGENESH prediction.</sentence>
					<sentence id="S6.367">(B) Alignment of the predicted protein and human RAG1 (29% identity, E = 10-43.</sentence>
					<sentence id="S6.368">The intron in SP_29068 is inserted between residues shaded in green and red.</sentence>
					<sentence id="S6.369">Gly460 that harbors the intron in the teleost RAG1 is shaded in black.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S6.370">Structure of hAT 5? Termini</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.371">Non-gapped alignment of consensus sequences of 5? termini of transposons from 22 different families is shown beneath the RSS23 consensus sequence, composed of the RSS heptamer and nonamer.</sentence>
					<sentence id="S6.372">The most conserved nucleotides in the heptamer and nonamer, which are necessary for efficient V(D)J recombination, are highlighted.</sentence>
					<sentence id="S6.373">Among the necessary RSS nucleotides, only one, marked by a + corresponds to a nucleotide that is 100% conserved in hAT transposons.</sentence>
					<sentence id="S6.374">The critical third nucleotide of the hAT 5? termini is always G, as opposed to C in the RSS heptamer.</sentence>
					<sentence id="S6.375">It is also clear from the alignment that the hAT termini do <xcope id="X6.375.2"><cue type="negation" ref="X6.375.2">not</cue> have any second conserved block</xcope>, <xcope id="X6.375.1">which is <cue type="speculation" ref="X6.375.1">expected</cue> to be preserved if RSSs have evolved from hAT termini</xcope>.</sentence>
					<sentence id="S6.376">Hobo (GenBank number X04705), Homer (AF110403), Hermes (L34807), Ac9 (K01904), Tam3_AM (X55078), TAG1 (L12220), Pegasus (U47019) are active hAT transposons from fruit fly, Queensland fruit fly, house fly, maize, snapdragon, thale-cress, and African malaria mosquito, respectively.</sentence>
					<sentence id="S6.377">HOPPER_BD is from oriental fruit fly (GenBank AF486809).</sentence>
					<sentence id="S6.378">The consensus sequences of hAT-1N_DP and hAT-1N_DP (nonautonomous transposons from fruit fly, D. pseudoobscura); HAT1N_DR, hAT-2n1_DR, and hAT-N19_DR (nonautonomous transposons from zebrafish); CHARLIE1A and CHESHIRE (human); hAT-N1_SP (sea urchin); ATHAT1, ATHAT7, and ATHAT10 (thale-cress); PegasusA, HATN4_AG, and hAT-2N_AG (African malaria mosquito) were reported in Repbase Update.</sentence>
				</DocumentPart>
				<DocumentPart type="TableLegend">
					<sentence id="S6.379">Transib TPase in Eukaryotes</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.380">Columns 1 and 2 list common and Latin names of species whose genomes contain Transib TPase sequences.</sentence>
					<sentence id="S6.381">Column 3 shows GenBank sections collecting corresponding sequences: "NR", "WGS", "EST", and "HTGS" are names of GenBank sections; "tr" stands for “Trace Archives.” Column 4 shows a range of E-values of matches between the sea urchin Transib TPase (Transib1_SP) and TPases encoded by the listed species that were detected in TBLASTN searches against corresponding sections of GenBank.</sentence>
					<sentence id="S6.382">Matches to the Transib TPase observed for Oryza sativa indica (seven sequences from Trace Archives, 10-48 &lt; E &lt; 10-13) were discarded as a <xcope id="X6.382.1"><cue type="speculation" ref="X6.382.1">likely</cue> sequencing contamination</xcope>, based on the fact that these sequences were over 80% identical to Hydra magnipapillata traces (the hydra Trace Archive dataset contains over 100 sequences matching the TPase, and hydra Transib TPase sequences are also present in the dbEST section of GenBank).</sentence>
					<sentence id="S6.383">Analogously, matches to the Transib TPase detected in the AC011430 HTGs and AADC01054609 WGS GenBank sequences, which were annotated as portions of the human genome, were discarded as products of contamination (these sequences contain 100% identical copies of the non-long terminal repeat (LTR) retrotransposon G2_DM [17] from D. melanogaster).</sentence>
				</DocumentPart>
				<DocumentPart type="TableLegend">
					<sentence id="S6.384">GC Content of Target Sites for hAT Transposons</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S6.385">The table shows that hAT transposons are inserted preferentially into GC-rich sites.</sentence>
					<sentence id="S6.386">Each of the 35-bp insertion sites corresponds to two 14-bp and 13-bp DNA fragments flanking a genomic hAT element at its 5? and 3? termini; one of the 8-bp TSDs (flanking the 3? terminus of a transposon) was excluded in each case.</sentence>
					<sentence id="S6.387">Analogously, the 15-bp insertion sites were composed of two 4-bp and 3-bp flanking fragments.</sentence>
					<sentence id="S6.388">(1) GenBank accession number U47019; (2) Repbase Update, the angrep.ref section; (3) GenBank X04705; (4) Repbase Update, the drorep.ref section; (5) Repbase Update, spurep.ref; (6)Repbase Updates, the zebrep.ref section.</sentence>
					<sentence id="S6.389">Copies of Pegasus, HATN4_AG, and HAT2N_AG were identified in the mosquito A. gambiae genome; Hobo and hAT-1N_DP in the D. melanogaster and D. pseudoobscura fruit fly genomes, respectively; HAT-1N_SP in the sea urchin genome; and HAT1N_DR, HAT-2N1_DR, and HAT-N19_DR in the zebrafish genome.</sentence>
				</DocumentPart>
		</Document>
		<Document type="Biological_full_article">
			<DocID type="PMCID">PMC1135298</DocID>
				<DocumentPart type="Title">
					<sentence id="S7.1">A Role for Adenosine Deaminase in Drosophila Larval Development</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S7.2">Abstract</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.3">Adenosine deaminase (ADA) is an enzyme present in all organisms that catalyzes the irreversible deamination of adenosine and deoxyadenosine to inosine and deoxyinosine.</sentence>
					<sentence id="S7.4">Both adenosine and deoxyadenosine are biologically active purines that can have a deep impact on cellular physiology; notably, ADA deficiency in humans causes severe combined immunodeficiency.</sentence>
					<sentence id="S7.5">We have established a Drosophila model to study the effects of altered adenosine levels in vivo by genetic elimination of adenosine deaminase-related growth factor-A (ADGF-A), which has ADA activity and is expressed in the gut and hematopoietic organ.</sentence>
					<sentence id="S7.6">Here we show that the hemocytes (blood cells) are the main regulator of adenosine in the Drosophila larva, as was speculated previously for mammals.</sentence>
					<sentence id="S7.7">The elevated level of adenosine in the hemolymph due to <xcope id="X7.7.2"><cue type="negation" ref="X7.7.2">lack</cue> of ADGF-A</xcope> leads to <xcope id="X7.7.1"><cue type="speculation" ref="X7.7.1">apparently</cue> inconsistent phenotypic effects</xcope>: precocious metamorphic changes including differentiation of macrophage-like cells and fat body disintegration on one hand, and delay of development with block of pupariation on the other.</sentence>
					<sentence id="S7.8"><xcope id="X7.8.2">The block of pupariation <cue type="speculation" ref="X7.8.2">appears</cue> to involve signaling through the adenosine receptor (AdoR)</xcope>, but <xcope id="X7.8.1">fat body disintegration, which is promoted by action of the hemocytes, <cue type="speculation" ref="X7.8.1">seems</cue> to be independent of the AdoR</xcope>.</sentence>
					<sentence id="S7.9"><xcope id="X7.9.1">The existence of such an independent mechanism has also been <cue type="speculation" ref="X7.9.1">suggested</cue> in mammals</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S7.10">Introduction</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.11">Adenosine deaminase (ADA) is an enzyme present in all organisms that catalyzes the irreversible deamination of adenosine and deoxyadenosine to inosine and deoxyinosine.</sentence>
					<sentence id="S7.12">It is a critically important enzyme for human survival because its congenital absence causes severe combined immunodeficiency disease (SCID).</sentence>
					<sentence id="S7.13">ADA deficiency accounts for about 20% of all types of SCID [1].</sentence>
					<sentence id="S7.14">It is one of the most severe human immunodeficiencies and is associated with depletion of all three major categories of lymphocytes: T cells, B cells, and natural killer cells, resulting in impaired cellular immunity and decreased production of immunoglobulins [2].</sentence>
					<sentence id="S7.15"><xcope id="X7.15.1"><cue type="negation" ref="X7.15.1">Without</cue> intervention</xcope>, the affected individuals die from opportunistic infections within the first few months of life.</sentence>
					<sentence id="S7.16">ADA occurs as a soluble monomer in all human cells, but also exists as “ecto-ADA,” bound to the membrane glycoprotein CD26/dipeptidyl peptidase IV, and it has been <xcope id="X7.16.1"><cue type="speculation" ref="X7.16.1">suggested</cue> that this form of ADA regulates extracellular adenosine levels</xcope> [3].</sentence>
					<sentence id="S7.17">ADA deficiency is accompanied by greatly elevated levels of the ADA substrates adenosine and deoxyadenosine, both of which are biologically active purines that can have a deep impact on cellular physiology.</sentence>
					<sentence id="S7.18">Adenosine is not just a metabolite; it is also a signaling molecule that regulates numerous cellular functions by binding to G protein-coupled adenosine receptors (A1, A2a, A2b, and A3 in mammals) that can regulate intracellular cyclic adenosine monophosphate [4].</sentence>
					<sentence id="S7.19">Deoxyadenosine is a cytotoxic metabolite released by various cell populations that undergo programmed cell death; it can kill cells through a mechanism that includes disturbances in deoxynucleotide metabolism [5].</sentence>
					<sentence id="S7.20">Extracellular adenosine is now considered an important stress hormone that is released in excessive amounts in the vicinity of immune cells during both systemic and cellular stress [6].</sentence>
					<sentence id="S7.21">The predominant source of extracellular adenosine during systemic activation of the stress system is the sympathetic nervous system [7].</sentence>
					<sentence id="S7.22">Specific inflammatory stimuli such as bacterial products are also capable of triggering adenosine release from immune cells [8].</sentence>
					<sentence id="S7.23">These data are in line with evidence demonstrating a dramatic increase in extracellular adenosine levels under conditions associated with multiple organ failure, which is the cause of 50%–80% of all deaths in surgical intensive care units [6].</sentence>
					<sentence id="S7.24">ADA is <xcope id="X7.24.1"><cue type="negation" ref="X7.24.1">not</cue> the only adenosine deaminase in mammalian cells</xcope>.</sentence>
					<sentence id="S7.25">Recently, the cat eye syndrome critical region protein 1 (CECR1) gene was identified and shown to encode a protein representing a subfamily of proteins related to but distinct from classical ADAs [9].</sentence>
					<sentence id="S7.26">The duplication of a small region of chromosome 22 containing this gene is associated with “cat eye syndrome,” a disorder characterized by hypoplastic kidneys, congenital heart malformation, and anomalous pulmonary venous connections.</sentence>
					<sentence id="S7.27">The founding member of this subfamily is encoded by insect-derived growth factor(IDGF) [10], and homologs have been described in various organisms [11–14].</sentence>
					<sentence id="S7.28">We have previously found six Drosophila genes with sequence similarity to the CECR1 subfamily [15].</sentence>
					<sentence id="S7.29">Their products are mitogenic on Drosophila cells, and at least two of them (ADGF-A and ADGF-D) exhibit strong ADA activity, which is necessary for their mitogenic function.</sentence>
					<sentence id="S7.30">We therefore named them adenosine deaminase-related growth factors (ADGFs).</sentence>
					<sentence id="S7.31">We also demonstrated that adenosine functions as a negative signal for cell proliferation and concluded that ADGFs stimulate cell growth in vitro by depletion of extracellular adenosine [16].</sentence>
					<sentence id="S7.32">Drosophila also contains a gene, termed Ada, with sequence similarity to human ADA, but as we have previously shown the product of this gene is most <xcope id="X7.32.2"><cue type="speculation" ref="X7.32.2">likely</cue> <xcope id="X7.32.1"><cue type="negation" ref="X7.32.1">not</cue> an active ADA</xcope></xcope> [16].</sentence>
					<sentence id="S7.33">In this report we show that a null mutation in Drosophila ADGF-A gene leads to dramatically increased levels of adenosine and deoxyadenosine in the larval hemolymph.</sentence>
					<sentence id="S7.34">This increase leads to larval death associated with the disintegration of fat body and the development of melanotic tumors.</sentence>
					<sentence id="S7.35">We present a detailed analysis of the hematopoietic defects associated with the adgf-a mutation, show a genetic interaction of this mutation with signaling through the Drosophila adenosine receptor (AdoR, encoded by the gene CG9753) and with regulation of premetamorphic changes by ecdysone, as well as a genetic interaction of ADGF-A with a major innate immunity regulator—the Toll signaling pathway.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S7.36">Results</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S7.37">Mutation in the ADGF-A Gene Causes Larval Death and Melanotic Tumors</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.38">We produced mutations in five of the six ADGF genes by homologous recombination mutagenesis [17] and showed that loss of the most abundantly expressed gene, ADGF-A, leads to death in the larval or pupal stage.</sentence>
					<sentence id="S7.39">Under optimal conditions (20–30 isolated homozygous larvae per vial), about 60% of larvae homozygous for the adgf-a mutation reach the third instar.</sentence>
					<sentence id="S7.40">Development during the third larval instar is significantly delayed, and wandering homozygous larvae usually appear 2 d after their heterozygous siblings, which start wandering at about 5 d of development.</sentence>
					<sentence id="S7.41">Some homozygous third-instar larvae can be found alive in the vial even after 10 d of development.</sentence>
					<sentence id="S7.42">Mutant third-instar larvae show fat body disintegration (Figure 1A and 1B) and multiple melanotic tumors (Figure 1C), predominantly in the caudal part of the body and accompanied by disintegration of the fat body.</sentence>
					<sentence id="S7.43"><xcope id="X7.43.1">Melanization of the lymph glands was <cue type="negation" ref="X7.43.1">never</cue> observed in these larvae</xcope>, and the imaginal discs and brain appear normal.</sentence>
					<sentence id="S7.44">Less than 30% of homozygotes eventually pupate.</sentence>
					<sentence id="S7.45">Homozygous pupae usually die soon after pupariation; in some cases they develop normal head and thorax imaginal structures; however, abdominal parts usually do <xcope id="X7.45.1"><cue type="negation" ref="X7.45.1">not</cue> develop</xcope>.</sentence>
					<sentence id="S7.46">There is also an abnormal curvature (to the right) of the pupal abdomen (Figure 1D).</sentence>
					<sentence id="S7.47">Less than 2% of mutant pupae develop normally and eventually emerge as adults <xcope id="X7.47.1"><cue type="negation" ref="X7.47.1">without</cue> any obvious abnormalities besides the abdominal curvature</xcope>; some of them are sterile.</sentence>
					<sentence id="S7.48"><xcope id="X7.48.1"><cue type="speculation" ref="X7.48.1">To confirm</cue> that the mutant phenotype is caused solely by a mutation in the ADGF-A gene</xcope>, we created transgenic flies carrying the ADGF-A gene under a heat-shock promoter (HS-ADGF-A).</sentence>
					<sentence id="S7.49">The adgf-a homozygous flies carrying the HS-ADGF-A construct showed survival rates significantly higher than adgf-a even <xcope id="X7.49.2"><cue type="negation" ref="X7.49.2">without</cue> heat shock</xcope>, <xcope id="X7.49.1"><cue type="speculation" ref="X7.49.1">probably</cue> due to leaky expression of the HS-ADGF-A construct</xcope> (Figure 2A).</sentence>
					<sentence id="S7.50">However, while non-heat shocked animals still produced many melanotic tumors, only 22% of animals that were heat shocked as late embryos/early first instar developed these tumors (Figure 2B).</sentence>
					<sentence id="S7.51">This result confirms that the mutant phenotype is caused by the mutation in the ADGF-A gene.</sentence>
					<sentence id="S7.52">This conclusion is further supported by the even more efficient rescue achieved by expression of transgenically provided ADGF-A in the lymph glands using the Gal4/UAS system (see below).</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S7.53">adgf-a Mutant Phenotype</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.54">(A and B) Fat body disintegration visualized by GFP expression driven by Cg-Gal4 driver in the fat body.</sentence>
					<sentence id="S7.55">While adgf-a/+ heterozygous third instar larvae have normal flat layers of fat body (A),adgf-a mutant showed extensive fat body disintegration into small pieces of tissue (B).</sentence>
					<sentence id="S7.56">(C) Multiple melanotic tumors present in adgf-a mutant third-instar larva.</sentence>
					<sentence id="S7.57">(D) An adgf-a mutant pupa with typical abdominal curvature.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S7.58">Rescue of the adgf-a Mutant Phenotype by Expression of ADGF-A in Different Tissues</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.59">(A) Percentage of pupae (blue bars) and adult flies (purple bars) demonstrating the larval and pupal survival, respectively, of the adgf-a mutant flies rescued by expression of transgenic ADGF-A in different tissues.</sentence>
					<sentence id="S7.60">Along the x-axis (which is shared with [B]), the rescue experiments are shown (marked by the Gal4 driver used for expression of ADGF-A except for first three sets of bars—the first set presents only an adgf-a mutant, the second an adgf-a mutant carrying HS-ADGF-A construct <xcope id="X7.60.1"><cue type="negation" ref="X7.60.1">without</cue> heat shock</xcope>, and the third with heat shock) and the y-axis represents percentage of pupae and adult flies out of the total number of transferred first-instar larvae of particular genotype.</sentence>
					<sentence id="S7.61">Each experiment was repeated at least four times (with 20–30 animals in each vial) and the standard error is shown.</sentence>
					<sentence id="S7.62">(B) Percentage of late third-instar larvae with melanotic tumors.</sentence>
					<sentence id="S7.63">The x-axis is shared with (A) (described above).</sentence>
					<sentence id="S7.64">The y-axis shows the percentage of larvae with tumors out of all larvae of each genotype examined for (A).</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S7.65">The adgf-a Mutant Phenotype Is Associated with Elevated Levels of Adenosine and/or Deoxyadenosine</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.66">Using liquid chromatography and mass spectrometry of deproteinated hemolymph samples, we measured adenosine concentrations in hemolymph of mutant and wild-type third-instar larvae.</sentence>
					<sentence id="S7.67">The adenosine concentration in the adgf-a mutant was 1.14 ± 0.26 ?M compared to less than 0.08 ?M in the wild type, and the deoxyadenosine concentration in mutants was 1.66 ± 0.99 ?M compared to an undetectable level in the wild type.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S7.68">The Catalytic Activity of ADGF-A Is Required for Its Function</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.69">To test <xcope id="X7.69.1"><cue type="speculation" ref="X7.69.1">whether</cue> the function of ADGF-A in vivo is also dependent on its catalytic activity</xcope>, we produced two versions of the UAS-ADGF-A construct [18]: one carrying wild-type cDNA of ADGF-A and one carrying an ADGF-A cDNA with a mutation causing a substitution of two amino acids (H386G and A387E) in the catalytic domain [16].</sentence>
					<sentence id="S7.70">Two different lines carrying the wild-type UAS-ADGF-A expression construct together with an Actin-Gal4 driver (providing ubiquitous expression) both completely rescued the mutant phenotype, whereas larvae with UAS-ADGF-A but <xcope id="X7.70.1"><cue type="negation" ref="X7.70.1">without</cue> the driver</xcope> showed the typical mutant phenotype.</sentence>
					<sentence id="S7.71">However, <xcope id="X7.71.1"><cue type="negation" ref="X7.71.1">neither</cue> of the two lines carrying the mutated version of the UAS-ADGF-A (producing full-length protein detected by anti-myc antibody; see Materials and Methods) showed any rescue of the mutant phenotype</xcope>.</sentence>
					<sentence id="S7.72">This result therefore demonstrates that the catalytic activity of ADGF-A is required for its function in vivo.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S7.73">Hemocyte Development Is Affected in the adgf-a Mutant</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.74">We investigated the number and morphology of hemocytes (blood cells) in the hemolymph of the adgf-a late third-instar larvae (Figures 3 and 4).</sentence>
					<sentence id="S7.75">These larvae contain an average of seven-fold more hemocytes in circulation than wild-type larvae (Figure 3).</sentence>
					<sentence id="S7.76">In contrast to normal larval plasmatocytes, which remain rounded after settling down on the substrate (Figure 4A), most of the cells in the adgf-a mutant (more than 75%) are strongly adhesive and, after they are deposited in a drop of hemolymph on a microscope slide, develop filamentous and membranous extensions (Figure 4B–4D).</sentence>
					<sentence id="S7.77">An average of 7% of hemocytes in the adgf-a mutant are lamellocytes (Figures 3 and 4E), large flat cells that are <xcope id="X7.77.1"><cue type="negation" ref="X7.77.1">not</cue> present in circulation of wild-type larvae under normal conditions</xcope> [19].</sentence>
					<sentence id="S7.78">Crystal cells were also detected in excess, with mutant larvae carrying several hundred while there are fewer than a hundred of these cells in the wild type (Figure 5).</sentence>
					<sentence id="S7.79">The lymph glands normally do <xcope id="X7.79.1"><cue type="negation" ref="X7.79.1">not</cue> release hemocytes into the hemolymph before metamorphosis</xcope> [20]; instead, they are released during metamorphosis when the lymph glands disperse [19].</sentence>
					<sentence id="S7.80">However, the lymph glands of adgf-a mutant larvae are already dispersed in the late third instar.</sentence>
					<sentence id="S7.81">This process is similar to normal metamorphic changes, in which the hemocytes are first released from the front lobes, and the posterior lobes disperse later.</sentence>
					<sentence id="S7.82">To analyze hemocytes in living larvae, we used the Hemolectin marker (Hml) [21].</sentence>
					<sentence id="S7.83">We compared the number and distribution of hemocytes stained by GFP in flies carrying hml-Gal4 UAS-GFP in wild-type and mutant backgrounds.</sentence>
					<sentence id="S7.84">While there are relatively few hemocytes, mostly free-floating in the hemolymph, in early third-instar wild-type larvae (see Figure 4I), a much higher number of hemocytes, which are mostly attached to the tissues under the integument (described as sessile hemocytes in [19]), was observed in mutant larvae (see Figure 4J).</sentence>
					<sentence id="S7.85">A similar behavior was detected later in wild-type larvae, toward the end of the third instar (see Figure 4H).</sentence>
					<sentence id="S7.86">At this stage, the Hml marker disappeared from the most of the hemocytes in mutants (see Figure 4F and 4G).</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S7.87">Number of Circulating Hemocytes in Late Third-Instar Larvae</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.88">Genotypes are shown along the x-axis, and the number of hemocytes/larva along the y-axis.</sentence>
					<sentence id="S7.89">Each bar shows the number of all circulating hemocytes, and the gray part of the bars represent the lamellocyte population.</sentence>
					<sentence id="S7.90">Each count was repeated five to ten times and the standard error is shown.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S7.91">Hemocyte Abnormalities in adgf-a Mutant Larvae</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.92">(A–E) Differential interference contrast microscopy of living circulating hemocytes (magnification 200×; scale bar, 10 ?m).</sentence>
					<sentence id="S7.93">Round, nonadhesive plasmatocytes from wild-type larva (A).</sentence>
					<sentence id="S7.94">Hemocytes from the adgf-a mutant developing filamentous extensions (B and C) or membranous extension surrounding the cell (D).</sentence>
					<sentence id="S7.95">Large flat lamellocyte from the adgf-a mutant (E).</sentence>
					<sentence id="S7.96">(F and G) Differential interference contrast and fluorescent microscopy (merged image) of living circulating hemocytes stained by the Hml-GFP marker (magnification 100×; scale bar, 10 ?m).</sentence>
					<sentence id="S7.97">While most of the cells from wild-type larvae are GFP-positive (F), just few of the cells from late third instar adgf-a larvae are stained by GFP at this stage (G).</sentence>
					<sentence id="S7.98">(H–J) Fluorescence microscopy of living larvae with Hml-GFP stained hemocytes (magnification 40×; scale bar, 100 ?m).</sentence>
					<sentence id="S7.99">Posterior part of late third-instar wild-type larva (H).</sentence>
					<sentence id="S7.100">Middle sections of early third-instar larvae of wild type (I) and adgf-a mutant (J).</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S7.101">Crystal Cells in Late Third Instar Larvae</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.102">Crystal cells were visualized by heating larvae of different genotypes at 60 °C for 10 min [46].</sentence>
					<sentence id="S7.103">(A) Wild-type larva, (B) adgf-a single mutant, (C) adoR adgf-a double mutant (scale bar, 0.5 mm).</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S7.104">The adgf-a Mutant Phenotype Is Rescued by Expression of ADGF-A in the Lymph Glands</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.105">To distinguish which tissues require ADGF-A expression for proper development, we <xcope id="X7.105.1"><cue type="speculation" ref="X7.105.1">tested</cue> for rescue of adgf-a lethality by expressing ADGF-A in specific subsets of larval tissues</xcope>.</sentence>
					<sentence id="S7.106">A transgenic line carrying the UAS-ADGF-A construct on Chromosome II was crossed to lines expressing the Gal4 driver [18] in different tissues (Table 1).</sentence>
					<sentence id="S7.107">Since ADGF-A is normally expressed in the larval lymph glands [16], and the mutant phenotype is characterized by abnormal hemocyte development, special consideration was given to lines expressing the Gal4 driver in the lymph glands and/or circulating hemocytes.</sentence>
					<sentence id="S7.108"><xcope id="X7.108.1"><cue type="negation" ref="X7.108.1">No</cue> line expressing the Gal4 driver exclusively in the lymph glands has been reported</xcope>, so we used a combination of lines sharing in common the feature of Gal4 driver expression in the lymph glands.</sentence>
					<sentence id="S7.109">The results (see Figure 2 and Table 1) clearly demonstrate that expression of ADGF-A in the lymph glands (driven by Cg-Gal4,e33C-Gal4, or c564-Gal4), but <xcope id="X7.109.1"><cue type="negation" ref="X7.109.1">not</cue> in any other tissue examined</xcope>, is necessary and sufficient to fully rescue the adgf-a lethality.</sentence>
					<sentence id="S7.110">In e33C-Gal4/UAS-ADGF-A, strong expression of ADGF-A in all lobes of developing lymph glands (but not in circulating hemocytes) reduces the number of hemocytes in the hemolymph to almost normal levels (see Figure 3).</sentence>
					<sentence id="S7.111">The number of hemocytes is also reduced, but to a lesser extent in larvae rescued by Cg-Gal4/UAS-ADGF-A.</sentence>
					<sentence id="S7.112">However, when assayed by survival rate and melanotic tumor formation, the rescue by Cg-Gal4 is full and similar to that of e33C-Gal4 (see Figure 2).</sentence>
					<sentence id="S7.113"><xcope id="X7.113.1">The difference in effectiveness <cue type="speculation" ref="X7.113.1">may</cue> be explained by the different expression patterns of the drivers</xcope>.</sentence>
					<sentence id="S7.114">Cg-Gal4 is expressed only in certain compartments of lymph gland lobes containing relatively mature hemocytes, and strongly in most circulating hemocytes [22, 23].</sentence>
					<sentence id="S7.115">The C564-Gal4 driver is <xcope id="X7.115.1"><cue type="negation" ref="X7.115.1">not</cue> expressed as strongly as e33C-Gal4</xcope>, but is still uniformly expressed in the lymph glands; it also fully rescued the mutant phenotype.</sentence>
					<sentence id="S7.116">We have tried two different insertions of the Dot-Gal4 construct.</sentence>
					<sentence id="S7.117">The Dot-Gal411C on Chromosome II, which shows weak expression [24], did <xcope id="X7.117.1"><cue type="negation" ref="X7.117.1">not</cue> rescue the phenotype</xcope>, but a Dot-Gal443A insertion on Chromosome X, which shows stronger expression, rescued approximately half of the mutant animals (Figure 2).</sentence>
					<sentence id="S7.118">Nearly all rescued individuals were males, <xcope id="X7.118.2"><cue type="speculation" ref="X7.118.2">suggesting</cue> that expression of the Gal4 driver was influenced by X-chromosome dosage compensation, and expression in females heterozygous</xcope> for <xcope id="X7.118.1">Dot-Gal4 was <cue type="negation" ref="X7.118.1">not</cue> strong enough for rescue</xcope>.</sentence>
					<sentence id="S7.119"><xcope id="X7.119.2">Expression of ADGF-A in salivary glands and fat body (as well as in other tissues) is <cue type="negation" ref="X7.119.2">not</cue> required for full rescue</xcope>, as demonstrated by use of the Cg-Gal4,Dot-Gal4, but especially by e33C-Gal4 driver, and is also <xcope id="X7.119.1"><cue type="negation" ref="X7.119.1">not</cue> sufficient to rescue the phenotype at all</xcope>, as demonstrated by T110-Gal4 and Lsp2-Gal4 (Table 1).</sentence>
					<sentence id="S7.120">Since ADGF-A is strongly expressed in embryonic mesoderm [16], we have tried to rescue the phenotype by the expression of ADGF-A in embryonic and larval muscle cells using the Dmef2-Gal4 driver [25].</sentence>
					<sentence id="S7.121"><xcope id="X7.121.1"><cue type="negation" ref="X7.121.1">No</cue> rescue of the phenotype, including body shape of escaping pupae, was observed</xcope>.</sentence>
					<sentence id="S7.122">The only line showing significant (but <xcope id="X7.122.2"><cue type="negation" ref="X7.122.2">not</cue> complete)</xcope> rescue of adgf-a survival <xcope id="X7.122.1"><cue type="negation" ref="X7.122.1">without</cue> expression in the lymph glands</xcope> was GawB5015 (see Figure 2), which expresses the Gal4 driver very strongly and specifically in the ring gland and salivary glands (as well as very weak and spotty expression in imaginal discs [unpublished data]).</sentence>
					<sentence id="S7.123">However, expression of ADGF-A driven by GawB5015 does <xcope id="X7.123.1"><cue type="negation" ref="X7.123.1">not</cue> prevent the formation of melanotic tumors</xcope> (see Figure 2B).</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S7.124">Ablation of Hemocytes in Mutant Larvae Reduces Fat Body Disintegration and Melanotic Tumor Formation</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.125">The l(3)hematopoiesis missing(l[3]hem) mutation reduces cell division in larval proliferating tissues and thus dramatically reduces the number of hemocytes in larvae.</sentence>
					<sentence id="S7.126">It also suppresses the hemocyte overproliferation and associated defects observed in the hopscotchTumorous-lethal mutant [26].</sentence>
					<sentence id="S7.127">We therefore used the l(3)hem1 mutation to test <xcope id="X7.127.1"><cue type="speculation" ref="X7.127.1">whether</cue> the reduction of hemocyte number in the adgf-a mutant affects the phenotype</xcope>.</sentence>
					<sentence id="S7.128">We recombined this mutation onto the chromosome containing the adgf-a mutation and found that in homozygous l(3)hem1,adgf-a double mutants the number of hemocytes is significantly reduced compared to the adgf-a single mutants (see Figure 3).</sentence>
					<sentence id="S7.129">Furthermore, while 90% of adgf-a mutant larvae showed disintegration of fat body, only 40% of l(3)hem1,adgf-a double mutants (total number of counted animals was 82) show the disintegration (Figure 6A).</sentence>
					<sentence id="S7.130">Similarly, melanotic tumor formation is significantly suppressed by l(3)hem1, with only 55% of double mutants showing melanotic tumors compared to more than 83% in adgf-a (Figure 6A).</sentence>
					<sentence id="S7.131">However, <xcope id="X7.131.1">the delay in development and block of pupariation (Figure 6B), as well as the pupal body shape, were <cue type="negation" ref="X7.131.1">not</cue> influenced by this mutation</xcope>.</sentence>
					<sentence id="S7.132">This shows that the effect on hemocyte development is related to only one other aspect of the adgf-a phenotype—namely, fat body disintegration—and the developmental arrest of adgf-a mutants is probably independent of this process.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S7.133">   Suppression of the adgf-a Mutant Phenotype by Mutations in Other Genes</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.134">(A) Percentage of late third-instar larvae with melanotic tumors (black bars) and fat body disintegration (green bars).</sentence>
					<sentence id="S7.135">The x-axis (which is shared with [B]), shows the genotype.</sentence>
					<sentence id="S7.136">The y-axis shows the percentage of larvae with tumors and fat body disintegration.</sentence>
					<sentence id="S7.137">(B) Survival rate of double mutants compared to single adgf-a mutant.</sentence>
					<sentence id="S7.138">The y-axis shows the percentage of the pupae (blue bars) and adult flies (purple bars) demonstrating the larval and pupal survival, respectively.</sentence>
					<sentence id="S7.139">Each experiment was repeated at least four times (with 20–30 animals in each vial) and the standard error is shown.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S7.140">Block in Activation of Macrophages Suppresses Disintegration of Fat Body</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.141">Previous results <xcope id="X7.141.2"><cue type="speculation" ref="X7.141.2">suggest</cue> that <xcope id="X7.141.1">fat body disintegration <cue type="speculation" ref="X7.141.1">might</cue> be caused by the action of hemocytes</xcope></xcope>.</sentence>
					<sentence id="S7.142">Embryonic macrophages express the scavenger receptor encoded by croquemort(crq), which allows them to bind and remove apoptotic corpses [27].</sentence>
					<sentence id="S7.143">We therefore tested <xcope id="X7.143.3"><cue type="speculation" ref="X7.143.3">whether</cue> <xcope id="X7.143.2">a mutation in the crq gene <cue type="speculation" ref="X7.143.2">would</cue> block the <xcope id="X7.143.1"><cue type="speculation" ref="X7.143.1">suggested</cue> interaction between hemocytes and fat body</xcope> in adgf-a mutant larvae</xcope></xcope>.</sentence>
					<sentence id="S7.144">We used the mutation crqKG01679, caused by a P-element insertion in the first untranslated exon of crq, which leads to pupal lethality.</sentence>
					<sentence id="S7.145"><xcope id="X7.145.2">The number of crystal cells was <cue type="negation" ref="X7.145.2">not</cue> increased</xcope> and <xcope id="X7.145.1">lamellocytes were <cue type="negation" ref="X7.145.1">not</cue> detected in crq, adgf-a double mutants</xcope> (see Figure 3).</sentence>
					<sentence id="S7.146">The double mutants showed a lower number of circulating hemocytes than the single mutant, but there was still a significant increase in this number compared to wild type (see Figure 3), and the cells showed increased clumping.</sentence>
					<sentence id="S7.147"><xcope id="X7.147.1"><cue type="negation" ref="X7.147.1">None</cue> of the double-mutant larvae showed either disintegration of fat body or melanotic tumor formation</xcope> (Figure 6A).</sentence>
					<sentence id="S7.148">Even the adgf-a mutant larvae heterozygous for the crq mutation (crq/CyO GFP; adgf-a/adgf-a) showed significant suppression of the fat body disintegration, with most of the tissue staying compact in bigger pieces and <xcope id="X7.148.1"><cue type="negation" ref="X7.148.1">never</cue> disintegrating to single adipose cells</xcope>; melanotic tumors were rarely observed.</sentence>
					<sentence id="S7.149">This shows that the block of the <xcope id="X7.149.2"><cue type="speculation" ref="X7.149.2">putative</cue> interaction between fat body and macrophage-like cells (which are still present in double mutants)</xcope> suppresses the fat body disintegration, further strengthening the <xcope id="X7.149.1"><cue type="speculation" ref="X7.149.1">hypothesis</cue> that the disintegration is caused by hemocytes</xcope>.</sentence>
					<sentence id="S7.150">In addition, the absence of lamellocytes and the normal number of crystal cells in the double mutant strongly <xcope id="X7.150.2"><cue type="speculation" ref="X7.150.2">suggest</cue> that the differentiation of these cells and thus melanotic tumor formation is a secondary reaction to fat body disintegration, <xcope id="X7.150.1"><cue type="negation" ref="X7.150.1">rather than</cue> a primary effect of the adgf-a mutation</xcope></xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S7.151">Mutation in a <xcope id="X7.151.1"><cue type="speculation" ref="X7.151.1">Putative</cue> Adenosine Receptor</xcope> Suppresses the Block of Pupariation in adgf-a</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.152">We have identified a putative homolog of the mammalian adenosine receptor family in the Drosophila genome, AdoR, and produced a null mutation in this gene using homologous recombination (adoR; ED, unpublished data).</sentence>
					<sentence id="S7.153">The adoR mutants are fully viable.</sentence>
					<sentence id="S7.154">We used this mutant to test the <xcope id="X7.154.1"><cue type="speculation" ref="X7.154.1">hypothesis</cue> that the increased level of adenosine in the adgf-a mutant contributes to the mutant phenotype by its effect on signaling through the adenosine receptor</xcope>.</sentence>
					<sentence id="S7.155">The results show that introducing the adoR mutation into the adgf-a background significantly increases pupariation, as well as adult emerging rate, compared to the adgf-a single mutant (Figure 6B).</sentence>
					<sentence id="S7.156">When the earlier lethality was avoided by picking up larvae after molt to the third instar, the pupariation rate of adoR, adgf-a double mutant was comparable to wild type as well as to the single adgf-a mutant treated with ecdysone (Figure 7A).</sentence>
					<sentence id="S7.157">Development during the third instar is much less delayed in the double mutant, with most of the larvae pupariating within 1 d after their heterozygous siblings (Figure 7A).</sentence>
					<sentence id="S7.158">The adoR mutation also significantly reduced melanotic tumor formation in the adgf-a mutant (see Figure 6A), but disintegration of the fat body appeared at the same rate as in the single mutant (see Figure 6A).</sentence>
					<sentence id="S7.159">While the number of macrophage-like cells in circulation is not significantly changed in the double mutant, the number of lamellocytes is decreased (see Figure 3), but the number of crystal cells is normal (see Figure 5A and 5C).</sentence>
					<sentence id="S7.160">These results demonstrate that adenosine signaling through the adenosine receptor is involved in the developmental arrest of adgf-a mutant, but that it does <xcope id="X7.160.1"><cue type="negation" ref="X7.160.1">not</cue> play a role in fat body disintegration and macrophage differentiation</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S7.161">Ecdysone Regulation of Development in adgf-a</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.162">(A) Larvae of different genotypes were collected after L2/L3 molt, and the number of puparia was counted at different time points (x-axis: hours after egg laying).</sentence>
					<sentence id="S7.163">The y-axis shows the percentage of puparia out of all collected third-instar larvae (three vials each with 30 animals; the standard error is shown).</sentence>
					<sentence id="S7.164">(B and C) Ring gland morphology in arrested adgf-a larvae.</sentence>
					<sentence id="S7.165">Approximately 8-d old mutant larva (i.e., 3 d after normal pupariation) with very extensive fat body disintegration (note the transparency of larva in the middle part with small white pieces of fat body) (B).</sentence>
					<sentence id="S7.166">The ring gland dissected from this larva (C) shows morphology of the normal ring gland before the degenerative changes of prothoracic gland starts (compare to schematic diagram to the left of [C], from [28]).</sentence>
					<sentence id="S7.167">(D–F) Expression of GFP-marked glue protein (Sgs?3-GFP) in salivary gland of the adgf-a mutant larvae and pupae.</sentence>
					<sentence id="S7.168">All late third-instar larvae express the glue protein as shown on dissected salivary gland (D).</sentence>
					<sentence id="S7.169">Some mutants show typical expulsion from the glands with GFP totally external to the puparial case (E), while others do <xcope id="X7.169.1"><cue type="negation" ref="X7.169.1">not</cue> expel glue proteins even after puparium formation (F)</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S7.170">Hormonal Regulation in the adgf-a Mutant</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.171"><xcope id="X7.171.1">The delayed development and low pupariation rate in the adgf-a mutant larvae (see Figures 2A and 7A) <cue type="speculation" ref="X7.171.1">could</cue> be caused by an effect on hormonal regulation of development</xcope>.</sentence>
					<sentence id="S7.172">The main source of developmental hormones in the Drosophila larva is the ring gland, composed of the prothoracic gland, corpus allatum, and corpus cardiacum [28].</sentence>
					<sentence id="S7.173">The prothoracic gland releases the steroid molting hormone ecdysone, which is converted to an active form, 20-hydroxyecdysone (20E), by the fat body as well as some of the target organs [29].</sentence>
					<sentence id="S7.174">The block of pupariation in the adgf-a mutant <xcope id="X7.174.3"><cue type="speculation" ref="X7.174.3">suggested</cue> that the level of ecdysone in these larvae <xcope id="X7.174.2"><cue type="speculation" ref="X7.174.2">might</cue> <xcope id="X7.174.1"><cue type="negation" ref="X7.174.1">not</cue> be sufficient to initiate pupariation</xcope></xcope></xcope>.</sentence>
					<sentence id="S7.175">To test this possibility, we tried to rescue the phenotype by feeding mutant larvae 20E, which can initiate pupariation in the ecd1 mutant, which has an extremely low level of ecdysone [30,31].</sentence>
					<sentence id="S7.176">The results (Figure 7A) clearly demonstrate that the adgf-a mutant larvae are responsive to ecdysone and that this treatment restores the pupariation frequency to almost wild-type level.</sentence>
					<sentence id="S7.177">The delay in development is also significantly reduced (Figure 7A).</sentence>
					<sentence id="S7.178">Since the adgf-a mutant shows certain precocious metamorphic changes (macrophage differentiation and fat body disintegration), we <xcope id="X7.178.2"><cue type="speculation" ref="X7.178.2">speculated</cue> that <xcope id="X7.178.1">a reduced ecdysteroid level <cue type="speculation" ref="X7.178.1">could</cue> be caused by precocious degeneration of the prothoracic part of the ring gland</xcope></xcope>.</sentence>
					<sentence id="S7.179">However, <xcope id="X7.179.1">the overall structure of the ring gland is <cue type="negation" ref="X7.179.1">not</cue> visibly affected even in the oldest larvae (10 d, i.e., 5 d after the heterozygous siblings pupariated) with a fully disintegrated fat body</xcope> (Figure 7B and 7C).</sentence>
					<sentence id="S7.180">We also used a transgenic line carrying the Sgs?3-GFP construct, which was previously used to monitor the effects of ecdysteroid levels on glue protein expression in salivary glands [32].</sentence>
					<sentence id="S7.181">All analyzed adgf-a mutant larvae carrying the Sgs?3-GFP construct showed normal expression of Sgs-GFP in salivary glands (Figure 7D).</sentence>
					<sentence id="S7.182">Mutants that pupariated usually showed typical GFP expectoration, <xcope id="X7.182.1"><cue type="speculation" ref="X7.182.1">indicating</cue> the presence of a high premetamorphic peak of ecdysteroids</xcope> (Figure 7E).</sentence>
					<sentence id="S7.183">In some cases, GFP was secreted into the lumen of salivary glands, but <xcope id="X7.183.1">was <cue type="negation" ref="X7.183.1">not</cue> expectorated</xcope> (Figure 7F), which is similar to the defect seen in animals expressing the dominant-negative form of ecdysone receptor driven by the Sgs3-Gal4 driver [33].</sentence>
					<sentence id="S7.184">These results demonstrate that the target tissues of adgf-a mutants are normally responsive to ecdysteroids and that they are <xcope id="X7.184.2"><cue type="speculation" ref="X7.184.2">probably</cue> capable of releasing ecdysteroids</xcope>, although the level of ecdysteroids <xcope id="X7.184.1"><cue type="speculation" ref="X7.184.1">might</cue> vary</xcope>.?&gt;</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S7.185">ADGF-A Genetically Interacts with Toll Signaling Pathway</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.186">The antimicrobial response of Drosophila includes at least two distinct signaling pathways [34]—the Toll signaling pathway, which leads to the activation of two nuclear factor kappa B (NF-?B) factors, Dorsal-related immunity factor (DIF) and dorsal (DL); and the immune deficiency protein pathway activating the third NF-?B factor, Relish (REL).</sentence>
					<sentence id="S7.187">A zygotic null mutation in cactus (cact; a Drosophila inhibitor of NF-?B) leads to hyperproliferation of hemocytes, melanotic tumor formation, disintegration of fat body, and slower larval development, with 60% larval lethality, as well as a thin body-shape phenotype [35].</sentence>
					<sentence id="S7.188">All of these phenotypes are strikingly similar to the abnormalities seen in adgf-a mutants, which was our first clue as to a <xcope id="X7.188.1"><cue type="speculation" ref="X7.188.1">possible</cue> interaction of ADGF-A with the Toll signaling pathway</xcope>.</sentence>
					<sentence id="S7.189">We <xcope id="X7.189.1"><cue type="speculation" ref="X7.189.1">hypothesized</cue> that the activity of ADGF-A is suppressed by Toll signaling, resulting in similar phenotypes of the adgf-a mutation and constitutive activation of Toll pathway</xcope>.</sentence>
					<sentence id="S7.190">To test this hypothesis, we crossed transgenic flies carrying ADGF-A gene under the control of a heat-shock promoter on Chromosome II (HS-ADGF-A) with cactE8 (a lethal allele of cact on Chromosome II, which, in combination with cactD13, results in a zygotic null combination, or, with cactIIIG, results in zygotic hypomorphic combination).</sentence>
					<sentence id="S7.191">Overexpression of ADGF-A in animals with a hypomorphic cact combination (cactE8/cactIIIG) increased the adult survival rate almost 4-fold (Figure 8A).</sentence>
					<sentence id="S7.192">The rescue could be increased by multiple heat shocks before pupariation to 7-fold (unpublished data).</sentence>
					<sentence id="S7.193">The suppression of melanotic tumor formation is also significant (from more than 80% down to 26%, Figure 8B).</sentence>
					<sentence id="S7.194">The most severe cact null mutation (cactE8/cactD13), leading to developmental arrest in larvae (less than 8% pupate), is partially rescued in animals with overexpression of ADGF-A when the pupariation rate is increased 3-fold (Figure 8A).</sentence>
					<sentence id="S7.195">These results demonstrate that ADGF-A overexpression can partially rescue the effects of constitutively active Toll signaling in larvae, mainly the developmental arrest, but also the melanotic tumor formation, in the case of hypomorphic cact mutants.</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S7.196">Genetic Interactions of Toll Signaling and ADGF-A</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.197">Survival rate and melanotic tumor formation were compared in mutants in the Toll signaling pathway and in similar mutants with overexpression of ADGF-A using the HS-ADGF-A construct.</sentence>
					<sentence id="S7.198">(A) The bar graph shows the percentage of the pupae (blue bars) and adult flies (purple bars) demonstrating the larval and pupal survival of each genotype.</sentence>
					<sentence id="S7.199">The x-axis shows the genotypes and is shared with (B).</sentence>
					<sentence id="S7.200">Flies heterozygous for the cact mutation were used as a control.</sentence>
					<sentence id="S7.201">(B) Percentage of late third instar larvae presenting melanotic tumor formation.</sentence>
				</DocumentPart>
				<DocumentPart type="SectionTitle">
					<sentence id="S7.202">Discussion</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S7.203">ADA Deficiency in Drosophila Causes Abnormal Hemocyte Development, Melanotic Tumor Formation, Fat Body Degeneration, and Delayed Development</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.204">We have established an ADA deficiency model in Drosophila in order to study the effects of altered adenosine levels in vivo.</sentence>
					<sentence id="S7.205">We produced a loss-of-function mutation in the ADGF-A gene, which produces a product (ADGF-A) with ADA activity.</sentence>
					<sentence id="S7.206">When homozygous, the mutation causes abnormal hemocyte development, leading to melanotic tumor formation [36], as well as fat-body disintegration associated with death during the larval stage or delayed transition to the pupal stage of development.</sentence>
					<sentence id="S7.207">In agreement with our previous study using cells cultured in vitro [16], here we have shown that ADA enzymatic activity is essential for ADGF-A function in vivo, when this function is assayed by testing for rescue of the mutant phenotype.</sentence>
					<sentence id="S7.208">Just as increased levels of both ADA substrates, adenosine and deoxyadenosine, are found in blood of SCID patients [5], adgf-a mutant larvae also have elevated levels of adenosine and deoxyadenosine, <xcope id="X7.208.1"><cue type="speculation" ref="X7.208.1">indicating that</cue> the mutant phenotype is caused by disturbance in the turnover of these nucleosides</xcope>.</sentence>
					<sentence id="S7.209">Expression of ADGF-A only in the lymph glands is sufficient to fully rescue the mutant phenotype, <xcope id="X7.209.1"><cue type="speculation" ref="X7.209.1">indicating that</cue> the hemocytes within the lymph glands play a major role in regulation of adenosine levels in the hemolymph</xcope>.</sentence>
					<sentence id="S7.210">A similar regulatory role has also been attributed to blood cells in humans [5].</sentence>
					<sentence id="S7.211">This <xcope id="X7.211.1"><cue type="speculation" ref="X7.211.1">suggests</cue> a function for ADGF-A within the lymph gland</xcope>.</sentence>
					<sentence id="S7.212">However, ADGF-A behaves as a soluble growth factor and <xcope id="X7.212.1"><cue type="speculation" ref="X7.212.1">could</cue> be released from the lymph gland to activate targets elsewhere in the larval body</xcope>.</sentence>
					<sentence id="S7.213">Our results show that ADGF-A functions by limiting the level of extracellular adenosine, and in this way the protein <xcope id="X7.213.1"><cue type="speculation" ref="X7.213.1">could</cue> have a systemic function even if it were restricted to its tissue of origin</xcope>.</sentence>
					<sentence id="S7.214">Although our tests did <xcope id="X7.214.2"><cue type="speculation" ref="X7.214.2">not exclude</cue> a role for ADGF-A in circulating hemocytes (which constitute a separate lineage from the lymph gland hemocytes</xcope> [20]), we showed that <xcope id="X7.214.1">expression of ADGF-A in circulating hemocytes is <cue type="negation" ref="X7.214.1">not</cue> required for rescue of the adgf-a mutant phenotype</xcope>, since e33C-Gal4/UAS-ADGF-A?which expresses ADGF-A in the lymph gland but not in circulating hemocytes?fully rescued the phenotype.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S7.215">ADGF-A Is Involved in Hemocyte Differentiation in the Lymph Glands</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.216">Late third-instar larvae homozygous for the adgf-a mutation contain, on average, seven times more hemocytes in circulation than wild-type larvae, and most of these cells show strong adhesive properties compared to normal larval plasmatocytes, which remain rounded after settling down on the substrate.</sentence>
					<sentence id="S7.217">Although these cells share other characteristics with plasmatocytes, <xcope id="X7.217.1">they are normally <cue type="negation" ref="X7.217.1">not</cue> seen in circulation</xcope> until they are released from the lymph glands at the onset of metamorphosis under the regulation of ecdysone to serve as phagocytes for histolysing tissues during metamorphosis—thus, they are referred to as pupal macrophages [19].</sentence>
					<sentence id="S7.218">In agreement with the presence of these cells in circulation, at least the first lobes of the lymph glands are usually completely dispersed in late third-instar mutant larvae.</sentence>
					<sentence id="S7.219">This indication of precocious metamorphic changes [36] in the mutant is further supported by the finding that hemocytes aggregate in a segmental pattern in early <xcope id="X7.219.2"><cue type="negation" ref="X7.219.2">rather than</cue> late</xcope> third instar (see Figure 4H–4J), and that the hemocytes lose expression of Hemolectin in late third-instar larvae <xcope id="X7.219.1"><cue type="negation" ref="X7.219.1">rather than</cue> at the onset of metamorphosis</xcope> (see Figure 4G) [21].</sentence>
					<sentence id="S7.220">Recent studies show that <xcope id="X7.220.1">the Toll signaling pathway, which is already known to be involved in the control of innate immunity of both Drosophila and mammals [34], <cue type="speculation" ref="X7.220.1">may</cue> also be involved in the control of hemocyte differentiation in the Drosophila larva</xcope>.</sentence>
					<sentence id="S7.221">Constitutive activation of Toll signaling leads to developmental arrest and hematopoietic defects associated with melanotic tumor formation [35], similar to the phenotype of the adgf-a mutant.</sentence>
					<sentence id="S7.222">Our work also shows that forced expression of the ADGF-A gene can rescue the effects of overactive Toll signaling, <xcope id="X7.222.2"><cue type="speculation" ref="X7.222.2">suggesting</cue> that ADGF-A <xcope id="X7.222.1"><cue type="speculation" ref="X7.222.1">might</cue> function downstream of Toll signaling to control its effects</xcope></xcope>.</sentence>
					<sentence id="S7.223">This conclusion is consistent with the existence of a putative binding site for Dorsal (one of two known effectors of Toll signaling) in the ADGF-A promoter (Figure 9).</sentence>
					<sentence id="S7.224">It will be important to explore this connection further, since recent studies <xcope id="X7.224.1"><cue type="speculation" ref="X7.224.1">suggest</cue> an interaction between adenosine signaling and the NF-?B signaling pathway, which is the mammalian counterpart of the Toll pathway</xcope> [37].</sentence>
				</DocumentPart>
				<DocumentPart type="FigureLegend">
					<sentence id="S7.225">Schematic Map of the ADGF-A Gene with Promoter Analysis</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.226">The ADGF-A gene contains four exons and two transcriptional starts [17,47].</sentence>
					<sentence id="S7.227">We analyzed sequences preceding both transcriptional starts for the presence of known transcriptional factor binding sites using the software program Gene2Promoter (Genomatix Software GmbH).</sentence>
					<sentence id="S7.228">Selected sites are represented by color bars in approximate positions of promoter regions.</sentence>
					<sentence id="S7.229">The legend under the sequence show the names of transcription factors binding to matching colored binding sites.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S7.230">Precocious Fat-Body Disintegration Caused by Mutant Hemocytes</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.231">One of the most remarkable features of the adgf-a mutant phenotype is the disintegration of the fat body in third-instar larvae, another <xcope id="X7.231.1"><cue type="speculation" ref="X7.231.1">indication</cue> of precocious metamorphic changes</xcope> since the disintegration normally occurs much later, during pupal life.</sentence>
					<sentence id="S7.232">Furthermore, our study of this mutant provides strong evidence that the fat body disintegration is promoted by the action of hemocytes.</sentence>
					<sentence id="S7.233">Fat body disintegration was significantly suppressed when the hemocyte number was reduced using the l(3)hem1 mutation [26], and fully blocked by the croquemort (crq) mutation [27] which affects a CD36-related receptor (Croquemort) expressed on macrophages and required in phagocytosis of apoptotic cells.</sentence>
					<sentence id="S7.234">Human CD36 is a scavenger receptor which, in combination with the macrophage vitronectin receptor and thrombospondin, binds apoptotic cells.</sentence>
					<sentence id="S7.235"><xcope id="X7.235.3">A similar role of Croquemort for removing histolysing tissues during Drosophila metamorphosis has <cue type="negation" ref="X7.235.3">not</cue> yet been tested</xcope>, but <xcope id="X7.235.2"><cue type="speculation" ref="X7.235.2">seems</cue> <xcope id="X7.235.1"><cue type="speculation" ref="X7.235.1">likely</cue></xcope></xcope> since the crq mutant used in this study (crqKG01679) is lethal in pupae.</sentence>
					<sentence id="S7.236">The <xcope id="X7.236.1"><cue type="speculation" ref="X7.236.1">idea</cue> that hemocytes are involved in fat body dissociation in Drosophila</xcope> is further supported by work on the flesh fly Sarcophaga.</sentence>
					<sentence id="S7.237">Natori's group showed that proteinase cathepsin B was released from pupal hemocytes when they interacted with the fat body, and that this enzyme digested the basement membrane of the fat body, causing the tissue to dissociate [38,39].</sentence>
					<sentence id="S7.238">They also showed that the interaction of hemocytes with the fat body is mediated by a 120-kDa membrane protein localized specifically on pupal hemocytes [40].</sentence>
					<sentence id="S7.239"><xcope id="X7.239.3">This protein was <cue type="speculation" ref="X7.239.3">suggested</cue> to be a scavenger receptor</xcope>, but <xcope id="X7.239.1"><xcope id="X7.239.2">it does <cue type="negation" ref="X7.239.1">not</cue> <cue type="speculation" ref="X7.239.2">seem</cue> to be homologous to Drosophila Croquemort</xcope></xcope> (unpublished data).</sentence>
					<sentence id="S7.240">Work by Franc et al. [27] is consistent with the idea that more than one scavenger receptor is involved in this process.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S7.241"><xcope id="X7.241.1"><cue type="speculation" ref="X7.241.1">Possible</cue> Signaling Role for Adenosine</xcope></sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.242">The precocious metamorphic changes <xcope id="X7.242.3">that <cue type="speculation" ref="X7.242.3">appear</cue> to occur in response to elevated adenosine in the adgf-a mutant larvae</xcope> lead to the <xcope id="X7.242.2"><cue type="speculation" ref="X7.242.2">suggestion</cue> that adenosine <xcope id="X7.242.1"><cue type="speculation" ref="X7.242.1">may</cue> act as a regulatory signal for these processes during normal development</xcope></xcope>.</sentence>
					<sentence id="S7.243">One <xcope id="X7.243.1"><cue type="speculation" ref="X7.243.1">possibility</cue> is that adenosine acts as a downstream effector of ecdysone-regulated prepupal changes, and that the increase in adenosine concentration is mediated by ecdysone-induced down-regulation of ADGF-A expression</xcope>.</sentence>
					<sentence id="S7.244">This is supported by the presence of multiple sites for ecdysone-inducible transcription regulators in the ADGF-A promoter (Figure 9).</sentence>
					<sentence id="S7.245">Adenosine <xcope id="X7.245.3"><cue type="speculation" ref="X7.245.3">could</cue> serve as a signal for macrophage differentiation</xcope>, and the <xcope id="X7.245.2"><cue type="negation" ref="X7.245.2">lack</cue> of adenosine deaminase activity</xcope> due to the adgf-a mutation <xcope id="X7.245.1"><cue type="speculation" ref="X7.245.1">could</cue> cause precocious differentiation of these cells in mutant larvae</xcope>.</sentence>
					<sentence id="S7.246">We are now carrying out direct tests of the <xcope id="X7.246.1"><cue type="speculation" ref="X7.246.1">idea</cue> that the differentiation of hemocytes in mutant larvae is caused by elevated adenosine</xcope>.</sentence>
					<sentence id="S7.247">If confirmed, this effect <xcope id="X7.247.2"><cue type="speculation" ref="X7.247.2">would</cue> have general significance</xcope>, since in ADA-deficient mice, inflammatory changes in the lungs include an accumulation of activated alveolar macrophages [41], and <xcope id="X7.247.1">this <cue type="speculation" ref="X7.247.1">could</cue> also be mediated by elevated adenosine</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S7.248">Elevated Adenosine Delays Development and Inhibits Pupariation</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.249">The elevated adenosine in the adgf-a mutant larvae leads to precocious changes (hemocyte differentiation and fat body disintegration) resembling those normally occurring at the time of metamorphosis, but it also is associated with an <xcope id="X7.249.1"><cue type="speculation" ref="X7.249.1">apparently</cue> opposite effect</xcope>, in that it causes a significant delay in progress through the third larval instar and a decrease in the frequency of successful pupariation (formation of the puparium from the larval cuticle), which is one of the earliest steps in metamorphosis.</sentence>
					<sentence id="S7.250">We conclude that the mutation has additional effects on the hormonal regulation of development.</sentence>
					<sentence id="S7.251">One <xcope id="X7.251.2"><cue type="speculation" ref="X7.251.2">possible</cue> explanation for the developmental delay and <xcope id="X7.251.1"><cue type="negation" ref="X7.251.1">failure</cue> to pupariate</xcope></xcope> is that the adgf-a mutation affects the production or release of ecdysteroid hormones from the major endocrine organ of the Drosophila larva—the ring gland.</sentence>
					<sentence id="S7.252">This is supported by the fact that <xcope id="X7.252.1">pupariation rate and survival of the adgf-a mutant <cue type="speculation" ref="X7.252.1">can</cue> be significantly improved by expression of transgenic ADGF-A in the ring gland and salivary glands</xcope>.</sentence>
					<sentence id="S7.253">We <xcope id="X7.253.1"><cue type="speculation" ref="X7.253.1">suggest</cue> that this somehow interferes with the regulation of hormone release</xcope>.</sentence>
					<sentence id="S7.254">Other mutants with hormonal dysregulation show delayed larval development and <xcope id="X7.254.1"><cue type="negation" ref="X7.254.1">failure</cue> to pupariate</xcope> [42,43].</sentence>
					<sentence id="S7.255"><xcope id="X7.255.1"><cue type="speculation" ref="X7.255.1">Presumably</cue> the elevated adenosine in the adgf-a mutant blocks the production or release of ecdysone from the ring gland by an unknown mechanism</xcope>.</sentence>
					<sentence id="S7.256">This idea is supported by our finding that <xcope id="X7.256.1">both pupariation rate and survival of the adgf-a mutant <cue type="speculation" ref="X7.256.1">can</cue> also be improved by feeding the mutant larvae with 20E in the diet</xcope> (see Figure 7A).</sentence>
					<sentence id="S7.257">Thus it is clear that the adgf-a mutant is arrested in development due to an effect of the mutation on hormone production from the ring gland.</sentence>
					<sentence id="S7.258">The arrest of development in the adgf-a mutants was significantly suppressed by loss of the adenosine receptor caused by the adoR mutation: larvae simply homozygous for adgf-a pupated after two or more days, whereas larvae also homozygous for adoR pupated within 1 d after their heterozygous siblings (see Figure 7A).</sentence>
					<sentence id="S7.259">Therefore, adenosine signaling through the AdoR <xcope id="X7.259.2"><cue type="speculation" ref="X7.259.2">must</cue> play a role in the developmental arrest of the adgf-a mutant</xcope>, and <xcope id="X7.259.1">this is most <cue type="speculation" ref="X7.259.1">likely</cue> mediated by signaling to the ring gland, where AdoR is expressed (ED</xcope>, unpublished data).</sentence>
					<sentence id="S7.260">The mutation in AdoR does <xcope id="X7.260.2"><cue type="negation" ref="X7.260.2">not</cue> block macrophage differentiation and fat-body disintegration</xcope>, so this effect <xcope id="X7.260.1"><cue type="speculation" ref="X7.260.1">must</cue> involve another, as yet uncharacterized mechanism independent of AdoR signaling</xcope>.</sentence>
					<sentence id="S7.261">Work using adenosine-receptor deficient mammalian cells also <xcope id="X7.261.1"><cue type="speculation" ref="X7.261.1">suggested</cue> the existence of a novel, undefined adenosine signaling mechanism</xcope> [44].</sentence>
					<sentence id="S7.262">However, we <xcope id="X7.262.1"><cue type="speculation" ref="X7.262.1">cannot exclude</cue> the role of elevated deoxyadenosine in these effects</xcope>.</sentence>
					<sentence id="S7.263">Drosophila, now with the advantage of the well-characterized adgf-a mutant, <xcope id="X7.263.1"><cue type="speculation" ref="X7.263.1">could</cue> serve as an ideal model system in which to investigate this mechanism</xcope>.</sentence>
				</DocumentPart>
				<DocumentPart type="SubSectionTitle">
					<sentence id="S7.264">Concluding Remarks</sentence>
				</DocumentPart>
				<DocumentPart type="Text">
					<sentence id="S7.265">In our previous work using cells cultured in vitro, we showed that, as in mammals, adenosine can block proliferation and/or survival of some Drosophila cell types [16].</sentence>
					<sentence id="S7.266">In the present work, we have established a Drosophila model to study altered levels of adenosine and deoxyadenosine in vivo, and we have shown that loss of ADGF-A function causes an increase of these nucleosides in larval hemolymph.</sentence>
					<sentence id="S7.267">Although the adgf-a mutation leads to larval or pupal death, we have shown that this is <xcope id="X7.267.3"><cue type="negation" ref="X7.267.3">not</cue> due to the adenosine or deoxyadenosine simply blocking cellular proliferation or survival</xcope>, as the experiments in vitro <xcope id="X7.267.2"><cue type="speculation" ref="X7.267.2">would</cue> <xcope id="X7.267.1"><cue type="speculation" ref="X7.267.1">suggest</cue></xcope></xcope>.</sentence>
					<sentence id="S7.268">Rather, this mutation leads to an increase in number of hemocytes at the end of larval development due to the differentiation and release of hemocytes from the lymph glands.</sentence>
					<sentence id="S7.269">Hemocytes also differentiate and are released from the lymph glands during systemic infection [19].</sentence>
					<sentence id="S7.270">Together with our result <xcope id="X7.270.2"><cue type="speculation" ref="X7.270.2">suggesting</cue> an interaction between Toll signaling and ADGF-A</xcope>, this leads to the <xcope id="X7.270.1"><cue type="speculation" ref="X7.270.1">hypothesis</cue> that adenosine controls hemocyte differentiation in response to infection, and that it signals through the adenosine receptor to postpone the next developmental step, metamorphosis</xcope>.</sentence>
					<sentence id="S7.271">This <xcope id="X7.271.1"><cue 
