Heather L. Merk, The Ohio State University
Allen Van Deynze, University of California Davis
I'd like to welcome everybody to our second seminar as part of our potato breeding and genetics solCAP workshop. Our next speaker today is Dr. Allen Van Deynze from UC Davis and he's on our solCAP project that is leading the SNP marker development and analysis at UC Davis.
We're very fortunate that Illumina was able to give us a license to have this class and able to download on all these computers. This is proprietary software that you do have to download to be able to use. If everybody can click on the Genome Studio icon—it should be on the top left of your computer we'll get started.
When you open up, what opens up in Genome Studio is a series of windows and the first thing we're going to do is actually load a project. To do that, we're going to hit File like you would always do when typically when starting a project. We're going to hit New Project and it's a genotyping project. Genome Studio has also sequencing modules and things which we're not going to cover today. You should have Genotyping. Project should come up and then the wizard basically opens up and starts asking you a couple questions. We're going to&elips; load the data to tell the software what data we're going to work on.
You click on the wizard and the first thing it asks you is where you're going to save the data in the project repository. We're just going to hit Browse and in this case if you go under scf Sol under My Documents you should see 1Potato testdata. We're just going to click that. We're going to save on the folder called 1Potato testdata. So that's where we're going to save our project to. So has everybody got to that? That has the software. Click Ok and that fills in the window. The next things is you have to name the project and you can call it whatever you'd like. I'm going to call it Project1. Put in the name of your project and you Click Next. Every time you get a new set of SNPs you need to do this for a new project that you might separate out.
Is there a question? Oh, ok. We just had a new person. Once you hit that in the wizard it's going to ask you, do you want to load from a central directory of files where your genome center might load project data or just have a local sample sheet that you have? We're going to leave the Use sample sheet to load sample intensities. We're going to use that, which is the default.
We're going to click Next. And then it asks basically three things that it needs. And so you're telling the software where your data's going to be. First thing you want to know is where your sample sheet is. And you should see this screen. You should go to the 1Potato testdata. Did you hit Browse there? I hit Browse. Sorry. We hit Browse on the first screen. It's asking the question, where's your sample sheet? Which sample sheet do you have? Every time you have a new set of samples there's an associated sample sheet that your lab is going to generate to get the data.... This one is called Potato Sample GS. They're just Excel sheets with information and we're going to open that. These are comma separated files as they're csv files that are automatically generated by Illumina when it collects the data off the scanner. Your lab is going to give you these. So that's the first line telling it where's the sample sheet. Then it asks the question, where's the data? You click Browse. Let me double check. You can click Browse. If it goes back down. No, it doesn't remember. You click Browse. We're going to have to find this Potato testdata. So scf sol, My Documents. In this case, instead of pressing 1Potato testdata, we go right to Data. We're going to just leave it at Data. No, I'm sorry, we're going to hit Potato A1 in this case. That'll be our first set of data, Potato_A1. So did everybody get to that? It's actually the address to where it should be on your computer is there as well. Documents, Potato testdata, Data, Potato_A1. So what that is, is the actual data that Illumina collected or your lab collected. Finally there's a manifest. Where's that going to go? We're going to go back to Potato testdata. Documents, Potato testdata. And then we're going to hit Next.
Then you get this file and it asks a couple of questions on what do I want to do when I load the data? We want to cluster the SNPs automatically. You can do this later or right now. I just do it right now because it becomes automatic. The program can auto-cluster SNPs and I'll explain what that means once we have it done. (Inaudible question from audience) The repository was just Potato testdata, the prepared folder. It's the same as the sample datasheet folder. We're going to count the sample statistics around the SNPs. Right now we don't have heritability data associated, but if you have say a population, it could be a diversity panel where you already know the pedigree of different accessions in your data and you have that data in a separate file you would calculate that. We're not going to do that because we don't have it.
Gen call threshold. Gen is basically Genome Studio, they have a threshold that they give when they look at SNPs and it's something that's calculated in the background. For Infinium data this should be 0.15 and that's the default. For GoldenGate data, so these are usually samples with less than 3000 SNPs, that should be set to 0.25. This is just an internal call that they use. Right now let's just leave it at the default, 0.15 for Infinium data. Then we'll just hit Finish.
Now it's loading all your data. There's a number of windows that open if everything works out. There's actually five, six windows that open and there's a number of them that are a waste of time that are basically taking up space. This is just a log on the bottom window. If I close the bottom window, so you notice it just increases the size of your window to full screen. The errors table we're not going to use. Let's just drop that. The window that we like to see a lot is actually this one up here in the top left. We can make these windows as big or as small as we like. Something like this is a good way to work. If you want to see what windows are available you can go down to the windows and you can click and unclick different windows as you want. You notice there's three main windows but there's more than three windows clicked here. Notice there's some tabs in these other windows, so these represent these tabs in the window, like the paired sample and the SNP table, which we're going to go to.
One of the first things that you want to do is you want to define the clusters. A cluster is, let's click on number one, the top window over here. That's a bad example. Let's click on number two. What the software does, it color-codes the clusters where it thinks the genotype classes are. The clusters that are in the red are what's called the AA genotype. The A genotype is the first genotype that we gave it. When we defined a SNP we give them a sequence and we say this is a C/G SNP, the A is a C the B is a G, but you don't have to remember that because it's all in these tables which I'll show.
The question is what kind of data are we looking at? This particular data that we're talking about is a set of 96 SNPs that we developed with the solCAP project to validate how good our SNP detection algorithms are. You'll get this. What's unique about all datasets is how the SNPs are defined. That's the first thing you need to know. And that's called your OPA file, which was one of the files that you're going to find that was already loaded on the computer.
The OPA file defines what the SNP is and at what position. It basically has the criteria, what the sequence is, what the SNP is, what the allele is. You know, is it a C/G SNP? An A/T SNP? And so on. Then the names of those SNPs. The names of the SNPs in this case are in the table over here. If I stretch this out, the name of the first SNP in the table is solCAPC213199. That data's captured in the OPA file. These are predefined SNPs.
The Illumina genotyping is a true genotyping platform; it's not a sequencing platform. These are SNPs that are pre-defined that you've designed an Illumina assay on. We're not going to talk about how that's done. SolCAP, in this case we worked with Robyn who designed SNPs based on the alignments that she was talking about in some other data that we generated. We send the sequences to Illumina and they design an assay around it; basically a set of primers that they send to&elips; they design the physical oligos for those primers they put on an Illumina chip. The Illumina chip actually looks like an array, like a glass slide and in one well so to speak, or one pin, in this case there's 96 SNPs. You add one sample of DNA and it will interrogate 96 SNPs.
What you're going to see in the Infium array that solCAP is generating is we're going to have, there's actually 10,000 data points on this array that we're generating. It will represent about 8,300 different loci that we're interrogating. It's still the same thing. One sample, 8,300 loci.
In this case there's 96 lines if you scroll around to the bottom. There's 96 different samples or SNPs that are being interrogated here. What we're seeing is the data generated from that. In this case they're SNPs within genes, but they don't need to be. It's whatever you've submitted to Illumina to generate. It could be any SNP that you want. In solCAP they're all SNPs that are from expressed sequences. They're expressed sequences; they may not be in genes. They could be three prime end and so on. They are in genes that are necessarily expressed.
(Inaudible background question)
I'm not positive on our nomenclature on that. I'm going to have to ask Robin.
(Inaudible background talking)
The question was, what do the numbers mean in the name. The answer is not very much as far as you're concerned right here. We have a cross-reference table at solCAP which we'll put online shortly which will show what the sequence is, what the SNP is, and so on.
I'm going to show you. In the software it will tell you what the SNP is in these tables. You can capture that. The actual sequence that it's from and so on that'll be something we'll cross-reference as part of the solCAP project in the panel.
(Inaudible background question)
This software is not fully integrated into the genome browser or any of the other solCAP tools. This is independent software. What we've loaded is a set of SNPs. We've told it what the SNPs are. We told it what the samples are and the software's capturing, now putting together what the combination of those two are when we put them in the lab. We'll put them through the lab. What you see graphically in the top left corner is the results of putting the SNPs and the samples. What you're seeing is all the samples at once. It will automatically try to cluster the data based on what it sees. But you don't have to accept their algorithm. You can refine anything you like and make these clusters tighter. Say you know that the heterozygous classes should be out here somewhere. I clicked on a marker here, number five in this case. To change the clusters and what you keep, you can hit the Shift key. You see this cross-hair when you go over the middle of it? You can say I really think the heterozygous cluster should be a little bit over to this side. So what did that do? All I did is I shifted to the right. Notice when I do that, the purple cluster, the point down here in the red that is now red and that's off the axis is now going to be called an AA allele instead of an AB allele, a heterozygote because I've moved that cluster.
You can do the same thing here. Say I'm not very confident with these guys on the edge. I don't want to even make that call. Hitting the shift key, you can make this a little tighter. You notice now I'm excluding this data point. I'm actually going to make the red a little tighter. I see the size window. You can make this really tight. Notice this guy? Now I'm not sure. It turned black. That's going to be a no call in your data now. You can exclude data points and say I'm not sure, I really don't want to make those calls.
We're going to go there. The question was, since these are tetraploids, should there be more genotypic classes? The answer is yes. This software right now is only set up for diploids, so it's only going to call three classes: homozygotes and one heterozygous class. We're working with Illumina to add that feature but we've found a way right now to do it and I'll show you how to do that independently using some Excel macros that we've developed. Right now, just assume we're working with a diploid because that's all we can call at this point using the current Illumina software. Within a few months they hope to have a polyploid module for this. We've provided them with this very data set and they're going to work with that to make those calls.
Oh, what are the graphs? The graphs are a combination of intensity, let me make sure I get this right. The graph on this scale is the intensities of red and green. Theta is a measure of that. That's what you're looking for on a scale from one to zero. Where the a's would be all red and they actually captured green and these look more yellow. This is kind of the fraction of green and red mix is what you're capturing. We're defining clusters while we're doing this but you have to hit the shift key and so you notice that the circle went out kind of into this homozygous class over here. You can shrink that and make it as tight as you like. Notice that I've made the circle a little tighter.
I've also excluded these data points. Now these data points are going to be excluded and are going to get a missing data point. Anything black is a missing data point. It's not going to make a call on that. The other thing I want to show is I like you can only change things in what's called the polar, this is called the polar view.
If you click here on the second little graph it's called the Cartesian view. This is looking at the relative intensity using a little different algorithm. I like to look at it. You really see the clusters a lot better from this view. I'll try to pick a SNP. For example, if you at SNP number four, you can start seeing that there's clearly four clusters there in this tetraploid data. There's five clusters if you click on number six for example. It's not perfect, but when you look at this data you can actually start there's one, two, three. Sorry, one, two, three, four, five clusters there. We're seeing, just to give a little background. We have 96 SNPs here that we're interrogating and we're able to call five clusters using our software, which I'll show you, in 56% of the data points. So 55 data points we actually saw five clusters that we can make sense of. We saw four clusters in about another 25% of those data points. Just to give you an idea of the sensitivity of the Illumina assays at least for this set of data points. So for over half of the data points we're able to see basically the five genotypes that we expect. Does everybody capture the five genotypes? Basically, the first is four A's all the way to four B's with different doses of A in it and you get five possible genotypes in a tetraploid.
For now let's just work with the diploid features and then we'll show how the tetraploid data can be analyzed. You scroll through different markers and you start seeing, so here's a nice one, number twenty where again you see the five clusters pretty nicely. Unfortunately you can't change any of the clusters here. To change, you have to go back to the polar view and you have to start playing with using the shift key and changing where the clusters are. What you really like to see in the perfect data point is all the A's. Basically you want to see a cluster on each side tight to the axes. You've got homozygotes on the y axis and you've got homozygotes on the x axis. That's a nice clustering. That's a marker you're really confident in. You've got those homozygous classes that are really tight. In this case you could really see, I think, the five classes quite nicely.
Does Illumina make a mistake? Absolutely.
Illumina has the same, is limited to the same problems we all are. It's garbage in, garbage out. You have good markers, good primers, you're going to get good data. If you have good DNA, if your DNA samples are not great, if they're not normalized to a nice uniform concentration and so on, that's going to play into effect how good your results are going to be.
Actually, number nine is not that bad. I'll give you an example. The best way to figure out good from bad, and this is kind of a next thing. You click on Gen Train Score under the full table view. You now have sort tools. Why don't we sort from top to bottom. The Gen Train Score is the confident it has in calling all the data for that particular SNP. Click this guy. The one at the top has a really low Gen Trait Score. Notice there's no homozygotes there on the x axis. Everything's kind of bunched up really tightly so it's not very confident in those calls. That's a good way of actually. Instead of going through, you know 96 samples, you can say, well, I'm just going to look at the ones with a bad Gen Train Score. What I've found though is you really have to look at both bands and the bottom. If you click anywhere in the middle in the Gen Train these are nice. These are beautiful clusters. Those are kind of 0.5 Gen Train Scores. Those are really good. But the ones with really high Gen Train Scores are really bad too because it's really confident that that marker is not polymorphic. It has a high Gen Train Score but it's a bad marker because there's no polymorphism in it. It's all one cluster. So you've got to look at both sides. You've got to look at both ends of the data. This is perfect timing on the question. It's a great way of looking which markers are good, which markers are bad.
Can this technology handle more than two alleles? The answer is yes. Look at the Gen Train Scores in the middle and you'll see there are five clusters. Can the software handle it? In the current version, the answer is no.
The question is, can this technology handle a SNP that has more than two alleles. The answer is no. But is that something you want to do? Probably not. It makes it pretty difficult and for the most part we haven't seen any of those in our data. For the most part we've seen bi-allelic SNPs as opposed to tri-allelic. You see tri-allelic but they're not that common.
The question is, if you're testing a wild species for a particular SNP are you likely to detect this? The answer is, we do actually have quite a few species here. We did actually do this data with a bunch of diploid species and so on like that. We did get good data from that. It's kind of the same question that we had in the previous seminar. If I design a set of primers in the DM, can I get it in the others and so on. The answer was yes. What we've seen the answer is yes. You may get some that fall out if they're too broad. But that will show up as missing data points. From what we've seen—we're fairly new on this—we've tested about a dozen diploid species and about 500 lines in total with this data and we've had a good success rate on the calls. We had very few missing data points.
The question is, is the number of SNPs quite limited across species? I don't know about the number, but the position is conserved. The SNPs are conserved. As long as the sequence aligns and there's not an insertion/deletion there should be an allele at that place, right? Is it a new allele or not? If it's actually a C where we've only interrogated an A and a T in the SNP, the assay won't work. The assay was designed for an A and a T. It wasn't trying to interrogate a C. So if you actually have a C in your wild species in that space it'll be a missing data point. It's limited to the assay basically. If you have a C/G SNP that you're assaying, that's all it's going to assay. If that sequence doesn't exist in your wild species, it's not going to show up on either the C or the G. From what we've seen from the SNPs that we've checked it's been pretty useful across the diploid species we've tested for the vast majority of the markers.
I'm going to go through the different options we have here. What's nice about this software. You can click&elips; when you right click, you can get context specific menus and one of the things you can do is&elips; let's see&elips; what did I want to show here? You can choose different colors for different&elips; notice here I have one sample highlighted and the sample that's highlighted is in yellow in this case. I can actually right click and say "Configure Mark" and I can say I'm going to call this data point, point one. You can call it a set. You can choose what color those points are going to be. I'm going to show that data point as white from now on. I hit Ok. Now that data point, no matter what SNP I have, that SNP is going to show up white.
You can highlight some samples. They can be your control samples and so on, things like that. You can do this. You can highlight a number of data points. It's called lasso mode. You can click this little lasso at the top. You click on the lasso and it highlights a whole bunch of data points in this area. Now those are highlighted as yellow and you can define those as separate things. Think about this as, let's say, in the context of clusters. You see different clusters that are not highlighted here, you can highlight the clusters differently. Right now I'm highlighting samples, not SNPs. Those clusters will change with the different doses of those SNPs of course. It also highlights in the sample which samples I've highlighted.
Is there a reset key? Just click out of it. Just click out of it and then the highlight's different. The yellow guy to reset I think, you'd have to re-click. I forget if there's a reset key. If you right click, there's a clear marks on. That's it. There's a clear marks on point one. You can decide which cluster you want to clear in there and you can just hit all. One of the things that you can look at, if you don't like, so they've chosen red, purple and blue, which, if you're color-blind, you might not see the red so well. You can choose what colors you want. So if you go to Options for the project, you can actually choose what colors you want for the different backgrounds that you like. For the AA in this case is always red, purple for heterozygote, blue for the other homozygote. You can exclude samples. You can actually highlight samples and say I don't want them anymore. Those can be highlighted in different ways.
How do you exclude a sample? If you highlight a couple, say this is a bad sample, I don't want to see it. You click on the sample and you can click exclude selected sample. By doing that, the data will not be collected or presented for that data point. You can do the same for SNPs by clicking on a SNP. I guess not in this set. I'll show you in a report. You can export only certain SNPs if you want. If you forgot to click. If you want to re-cluster. If you don't like to see the shades, you can click this little shade thing and you can take the shading out. You can also move this up. You can also move this up and down. You can zoom in to certain clusters if you want to be able to fine-tune these a little more. And so on. To get back you can just hit these guys. This optimizes your screen horizontally. This optimizes it vertically. If you want bigger, smaller dots you can do bigger dots, smaller dots. Let's see, that's the first part. If you want to autocluster all the SNPs, this little guy here, when we started our project we forgot to click autocluster the SNPs, we can just hit that.
Right now, if I autocluster all the SNPs, all the movements I made before now won't be saved. I hit yes. It goes through and re-calculates the statistics again. You can do that at any time. We've moved around, we've defined clusters.
Now we want to be able to export data. If you want to export data. Let's look at the SNP table. In the SNP table we have a whole bunch of different things that you can see at the SNP table. You've got your names. We never defined the chromosome because we don't know it. You see chi-square tests for a 1:2:1 in this case. There's a number of statistics here. You've got frequency of As and Bs, the call frequency. This is something you might want to sort by for your SNPs. Let's sort by call frequency. You've got 100%. Let's see some guys in this data set. Basically this data set is a very high call frequency. There's only one SNP that really failed, that had poor calls.
Not the references, that's all your data. It's looking at all the data that you've loaded here. In this case there's 96 samples. It's called only half of those samples in this SNP that I did. 56% of those SNPs. The call frequency's very high so you notice there's one at 56%. The rest are at 90 and above, so it's able to make a call. You can change that by playing with those clusters like we did. You know, you can exclude or include samples.
That's not the allele frequency. The call frequency and the chi-square are not directly related because you might have called 50% of the data but that data might still be in equilibrium 1:2:1 but you only have 50 samples now instead of 96, but there's 12, 25, and 12 to give you a high chi-square in this case. They're not related at all actually. Minor allele frequency, that's basically the B allele. These 10% GC, 50% GC. GC is a Gentrain call. That's just a percentile. It's the percentile of SNPs. You get a Gentrain call score and this is the 50th percentile, the 10th percentile. This one has a high Gentrain score. The 50th percentile is 0.28. Those are relevant. What you do want to see is in this column the SNPs so it tells you it's an A/G SNP in my case that I'm showing. 95 of the samples out of 96 were actually called and there's only one missing. I'm not interested in all these columns when I want to export things.
I can hit what's called the column chooser and that's this little button over here on the top right. Does everybody see that? The column chooser. The column chooser is a little button over here and you can decide which columns I really want to see. Some of the things are hidden. In the display column I want to see the name, the chromosome, the chi-square. Let's see I'm really not interested in this auxiliary column and everything in between. All I have done is I have hit the shift key, hit everything in between or I can hit the control key and highlight things or un-highlight things and I hide those. This is the name of my OPA file. I don't need to see that and so on. I hit Ok. It reconfigures it and now it's only showing the columns that I've highlighted. That's important. Actually I'm going to hit, choose, there's nothing in the chromosome position or the position, I'm going to hide those guys. Now I've got the index, the name of the marker, the chi-square and basically the calls, the allele frequency and the SNPs. If I go back to full table that's really your data. You can do the same thing. See choose. I don't want the chromosome, I don't want the position. I really only want to see the genotypes. Notice you're starting to see your samples. I'm going to highlight everything in between here, hide the rest. This will be the data for your samples. These are the names of the samples that are in there. I want to see the data for all these guys. I'm going to hit ok.
Then I can export that data to a file. I can just export. It's just a little icon over here. Export displayed data. Where do I want to put that? It automatically will go to your project file. Give it a name. Call it full data table. I'm going to call it 1. I hit save. Can I select just certain SNPs or do I want to export everything? I hit yes. Do I want to view the file? I'm going to say no. I've exported that file right now. You can upload that into Excel and then upload that into something, into your favorite software. You can do exactly the same thing on the sample data sheet. You can choose what you want. These p are really percent. So the five percent green, the 50% green, these are like medians or percentiles on the green. Green-red scale. They're not that informative to me on the sample sheet. If you want a sample sheet, you might just highlight everything and it just shows you that. To undo that, if you want to see everything, you can just highlight and then show the column you like after that and so on.
What I'm going to show you now is something solCAP is going to do is provide for our chip arrays is what called a cluster file. We're going to have to go through all our SNPs and we're going to define the clusters. Instead of everybody having different cluster files and not be having the same calls necessarily, you can export that data and you can import that cluster file from solCAP and say, ok, this is the way solCAP would have called it based on the data set they have.
The question is, how many SNPs do we have. Right now we've only collected data on 96 SNPs times about 500 samples.
Yes, we don't know where they are in the genome. The SNPs were chosen to represent as many scaffolds as possible based on the genome sequence that Robin talked about. They're also chosen on candidate genes that people are interested in around the carbohydrate pathway and then some others. That's the general selection of SNPs. The criteria for that is on the solCAP website on how we selected those SNPs. We haven't mapped these. We don't know where they are exactly, but we know what sequences they are and which scaffolds they're associated with on the genome. That's what we know for now.
No, these are SNPs. Most of the SNPs were between the three varieties that we sequenced: Atlantic, Snowden, and Premiere. We did the transcriptome sequence only with Illumina. They represent SNPs around genes. Would that be accurate Robin?
(Inaudible background talking)
Ok. The answer was, how do we know where the SNPs are from? Right now the public can't know that. In two months Robin, Robin knows where they are, in the context of which scaffold they're on, what the base of that SNP is, from what variety. We used the three varieties, Atlantic, Snowden, and Premiere, but we also used the EST sequence and the genomic sequence that was available in the public as well. There was old variety EST databases from Kennebec and also the Rh and the Dm. Stay tuned for about two months and you'll have really the whole history and be able to follow where that SNP is really from. We'll have all that data on the solCAP website linked.
The question is, can you link phenotype data in this software? This is not QTL analysis software in any way. This is a just a way to manage your SNP and export genotypes for the most part.
As far as annotation you mean? This software you can annotate SNPs within this software to do that but I don't think this would be the best spot to do it to be honest. You're better off to get your genotypes and probably use something more integrated as the tools develop on that. They do have a phenotype column that you can add phenotypic data in CSV format but all it really does is display data. They'll show your phenotype red and your SNP red so you can kind of make pattern type associations in a heat map, but there's no statistics around it whatsoever. It might be pleasing, but I don't think it's going to hold water and I wouldn't be making any decisions on it to be honest. But the answer is you probably shouldn't use this as a phenotype-genotype tool.
I'm running a little short on time. What I'm going to show you, a couple more things. I'm going to pick a SNP that's fairly clean. I sorted back by index. I just hit the sort key and went over to index. Hit index, sorted, and I picked SNP number four. What I'm going to do is, I'm going to shift a couple things and it doesn't matter what you do. I'm going to shrink this and I'm going to move this. Now I've changed from the automatic clustering file. What I'm going to do is, I'm going to export this cluster position. This is what, say, solCAP will do. We'll define all the clusters and we're going to export cluster positions. A cluster is basically these different shades. This is a cluster. Purple's a cluster. A cluster's a SNP. This is a heterozygous cluster, a homozygous cluster and another homozygous cluster. It's really clustering of genotypes is what it is. You can export all the clustering, basically all the defined clusters for the data. You click File, export the cluster positions, and we're going to say for the selected SNPs or all SNPs. I'm going to hit For all SNPs. So file, export cluster positions for all SNPs. Where do I want to put these files? It goes right back to that directory we defined initially. Let's call it Cluster one and we hit Save. That changes nothing. All we've done is export the current data. I'm going to now. I'm going to hit re-cluster my files back to automatic. So I'm going to click this cluster all SNPs again so hit the little three marks over here. Hit Yes. Calculate the data? Yup. Notice it went back to the original thing. Ok. We're back to auto. That's not what we want to see. We want to something. I'll give you an example. I want to do what solCAP did. I go back to file. On our website we'll have say a cluster file for these SNPs. We're going to import that cluster file. Then we see. We're going to import the EGT file. That's a cluster file. In this case it's the one we just did. We hit import, choose our file. We're going to open it. Now it re-clustered the file so that it's the way we had it re-clustered. It's going to put the data in the context of the redefined clusters that somebody else did in that cluster file. That's actually a very important thing to know, so that you don't. For example we'll have 8,300 markers in our Infinium array. You don't want to go through 8,300 markers. Let's let one person do it and let everyone use that dataset. That's something really important to know.
Well it won't be in the help file but it'll be in the solCAP file, exactly. Of course you have the option of saying I didn't really like what solCAP did and you can move those around yourself. I mean, your data may look different from ours as well. We've played around with the clusters, we've selected loci and SNPs, we've sorted samples, we've exported clusters. What we're going to do now, is we're going to add new data. This is 96 samples with 96 SNPs. I've run a new plate, let's combine two datasets with these 96 samples and 96 new samples. You can go File, Load additional samples. You see this screen, which you've seen before. Use sample sheet local intensities. We'll click the default. Next. It needs to know what those samples are. So you need a new sample datasheet. I'm going to go Browse and that second sample sheet in this case is called Potato Control Samples, so we're to click that. We hit Open. Where's the data? We're going to go back to My Documents, Data, and we'll capture the data for that plate as well. Click solCAP potato control as well. Ok. That's the two things it needs. What are the samples? Where's the data?
It already knows what the SNPs are because it's the same set of SNPs that we analyzed. Click that. If you look on the bottom left, now it's displaying 192 rows times 96 SNPs. So, there's 96 SNPs but 192 samples on that so you now have new samples. Some of these samples, let's see. This control plate, what we've done with the control plate is we've actually spiked some controls in to try to define the clusters. We used a double monoploid and we spiked other diploids in there in ratios of four to zero and zero to four, so three to one, two to two, one to three. We should be able to see where those samples are if they're polymorphic. It follows the dose. I'm still on marker number four.
Let's look in the Cartesian view, it's easier to see. If you click these, you can see these samples three to one, one to one, and you notice the dose is moving. Actually, just highlight the samples. Number four SNP, let's highlight samples 102, scroll down 102 to 105 and this is what we get. We see a nice dose response in the clusters. This is a control plate that solCAP put together. We capture about 40% of the loci, with about three, four mixtures of diploids. This is what we're going to use to help define those clusters. We have spiked samples that we see the dose. We're seeing this works pretty well. So you can click the markers, this isn't going to work for all markers. Here in this case, dm times x08 is not polymorphic when you click on marker number 13. It doesn't work for all the markers. You'll see the doses here. You click the different markers. If you highlight 109 to 113, you'll get different doses. This one doesn't seem to separate out as well as the other ones. You can walk through and start defining doses and so on.
So the question is, I expected to see the points in the center of the cluster. That's a comment. It's not perfect. As you can tell, if it was all perfect, you'd only see five data points. Five points there with a bunch of them on top of each other, right. There's movement due to DNA samples, and how well the assay worked, and so on.
That's already been done. There's no normalization right now. We're going with what the software's doing alright. So you can't play with that normalization that I know of. I really like these sample controls. This is a perfect time to take those samples and say now I want to see those controls and I'm going to configure those markers and I'm going to highlight these. I right click on those samples. I'm going to add this. I'm going to call this CMM63 controls. Here, I'll type better. I'm going to highlight those as actually gray is not bad. You pick your favorite color and hit Ok. So those are always going to be gray now. No matter what marker you put them in. They don't always cluster perfectly. You have to find a marker that they're actually separate in. This is a good use of the configure mark with the different colors. I want to sample these controls as I go through.
I'm running out of time. There's two things that we need to do to get the tetraploid data, which is what you guys are here for. Make sure you're on the full tab table. We're going to choose columns and we're going to keep the index, the name, the address. Delete everything else. What I've done is clicked on Chromosome position, I'm going to slide down, hit the Shift key, highlight everything, hide the rest. I also don't want. No, sorry, I made a mistake. I'll undo that. I'm going to take away chromosome up to where the samples start, Atlantic rep one. Chromosomes up to fraction t and I'm going to hide that and I'm going to keep the samples, which is the rest. I'm only going to capture the theta score, the data point. I want to highlight genotype, score, and I'm hitting the Control key and R, and I'm going to hide those. Does everybody have that? Ok, so the only thing I'm keeping is theta. I highlight all the others and I hit Hide. Then I hit Ok.
Now I basically have the SNPs, the address, and the theta for each of the samples. I'm going to export that. I'm going to click Export displayed data. I'm going to call this SNP table, or whatever you like. It's a text format. I'm going to save it in my project directory. Hit Save. SNP table text already exists. Do you want to replace it? I'm going to hit No. I'm going to call it SNP1. It asks you, only the selected columns, or would you prefer to export the entire table? Say Yes. Do you want to view the file? You hit Yes It comes up. It's not that interesting. It's in notepad. That's all I need to export right now to capture the four genotypes. Sorry, the five genotypes. I'm going to minimize the program window. I'm going to find that file. If you go under My Documents, highlight the folder, go under My Documents. You're going to find the Potato testdata. I'm going to open the two files. I'm going to find my SNP table file. I'm going to right click and I'm going to open that with Excel. Let's do it a different way.
First I'm going to open the polyploid data converter. This is just a file that is actually specific to the 96 SNPs we have. So if everybody can open the polyploid data converter. It takes a little time to open, but it will get there. This is a file that my technician, a graduate student in my lab put together with a series of macros that use the theta values to make the calls. Everybody should come up with this file over here. What Kevin in my lab has done, he's looked at the theta values and he's said cluster one is going to be at 0.2 for this SNP, 0.55 will be cluster two and so on. He's captured the theta values that are in between each of the clusters for the file. Then he's got a series of macros which you don't need to know about. We're going to open now one more file, so we're going to go File Open and we're going to find our SNP table. We have to double click, look at all files. Where did it go? Under Potato testdata. We're going to find SNP table one. We're going to open that data. Delimited should be fine. Just delimited is fine. What it does it just opens your SNP table and it should be 192 samples times 96 SNPs. These are the theta values for all those. What we're going to do is, we're going to capture the marker name and all the data. You can highlight B1 to the rest of the data. Capture all that data. I just hit the Shift key and I highlighted everything in between. Went to the bottom right corner, top left. Ok? We're going to copy that, so we're going to go Copy, you can hit this little icon over here or hit Control-C. That's in the clipboard. Then I can go back to this file over here, the polyploid converter file. Has everybody got that to capture the data? We're going to click on Marker Name and we're going to paste. So that's the data and if everything worked we're going to look at the genotypes. That's your genotypes. So there's a macro that captured the genotypes and now of course you can zoom in to the genotypes. What I've done, I've clicked the genotype table and there's a macro that captured that. So your samples in this case are only numbered in the same order that they are over here. But those are your genotypes with the four alleles that it was able to capture for this set of SNPs. Some of these guys the SNPs didn't work so you can see some blanks. The clusters are no good so those SNPs didn't have good data, so you get blanks. This polyploid file converter is specific to the 96 SNPs only. Right now we generated our own cluster file based on the theta values. It seems to work pretty well. I know that Dave Douches has looked at this data relative to others and it seems to make sense.
What we're doing is we're working. The question is, what kind of task is this to do for 8,000? The task that I don't want to hear is to have my technician come up and do this one by one. We actually had a summer student do most of this and we verified it. We had an intern do this. For 96 is very do-able. For 8,300 loci it won't be. Like I said at the beginning. We're working with Illumina, we've provided Illumina with this data and the clustering and they're actually developing the polyploid data software so they can actually capture this data and you'll be able to export it right from the Genome Studio software. They hope to really have this in a couple months. About the same time you should be getting your array data from your chip. We should be capturing that.
The question is, can you make this haplotypes instead of just genotypes. That's kind of the next step because we don't know how close these SNPs are. Some of these SNPs, we know the haplotype already within genes, and so that's really the data that Robin is putting together because she knows where these are within the sequence. Some of them really are haplotypes because there's more than one SNP per contig. So that'll give you haplotype, but right now we're not there. We've chosen the SNPs to try to represent as many genes as possible, except the candidate genes that people are interested in, where we may have chosen more than one SNP within those genes in general. The haplotype part is actually defined on how you pick your SNPs, right? Based on your sequence data. So that's, in a sense, before this. When it comes to mapping data, if this was actually mapping data, then you could start defining haplotypes in the context of mapping because haplotypes are bins in a sense, genetic bins, which Chris might talk about as well. If you don't have polymorphism between markers, even though they're unique SNPs in the genome, your mapping population might not have recombination between those two SNPs and they become basically bins, which is equivalent to a haplotype in a sense. Where all genotypes within that bin have the same haplotype. So, we're not at haplotypes yet. That's really the goal, yes, to look at haplotypes.
The question is, can you embark on QTL analysis with theta values? We haven't tried that. I mean, yeah, is that a relative thing? Chris is going to answer that next time. Chris thinks yes.
The way we're set up right now, you have to be able to access an Illumina lab. It doesn't have to be our lab. It could be your own lab or a third party and so on. They're going to give you, they're going to export the files you need for Genome Studio or export genotypes depending how close you are with that lab. So they can export genotypes. Right now the only genotypes they're going to export are the diploid ones, so you'll be limited to that. You have to buy Genome Studio, unfortunately, to be able to play like we just did. If you have access to a lab that has Genome Studio you can do that. There will be different levels of flexibility with your genotyping data where you can say this is your genotyping data. We've called it based on what solCAP clusters did it, something you would ask your lab to do if you don't have Genome Studio. It's fairly quick and easy. You saw how fast that happened. That's something you'd have to work with your service provider. We would certainly provide the files to do so. Is there a question, maybe from in Cyberspace? Nope, not right now. Ok.
I'm sorry, can you repeat that?
Does Genome Studio compute LD? No. It really doesn't. Genome Studio is really a software package to export genotypes and from there you import it into mapping and QTL and LD and so on. So LD, we've used Structure for example is a nice tassle in Structure, a nice program to upload this data into. We've actually done that already with this very data set and it works quite well. Thank you.
Development of this page was supported in part by the National Institute of Food and Agriculture (NIFA) Solanaceae Coordinated Agricultural Project, agreement 2009-85606-05673, administered by Michigan State University. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the view of the United States Department of Agriculture.
Mention of specific companies is not intended for promotional purposes.