Monday, March 12, 2012

How can we improve the cover rate of the model?

Hi, all here, I found that in my case when I trained the data mining models, the model cover rate is very low (in my case, the train data set has 82 rows but the case occuring in the models I trained is only 25). How can I improve the cover rate to improve the quality of the models? (if it is possible in SQL Server 2005) I am using SQL Server 2005.

Cheers.

Please explain what you mean by "cover rate" and what you are trying to accomplish. 82 rows is a pretty small data set in general, especially if you hae many attributes.|||

Hi, yes. my training data set is quite small which is some university marks data for analysis.

my inputs for the data set are like: mark1, mark 2, marker1, marker2, then the output which i wanna predict is the agreed mark based on mark 1 and mark2 marked by marker1 and marker2 respectively..

In order to make the classification task easily I have discretized the continous mark to be categorical values.

In my data analysis case, the "cover rate" I mean is the cases covered by the mining model. Cos the training data set actually got 82 rows but the mining model just covered 25 cases of it.

Hope this explanation is clear for your help.

Thanks a lot.

|||Do you mean it accurately classified 25 cases?|||

Hi, Jamie, the number is what all the models covered in the trainings. Like I have 82 rows in the trainning set, for example, clustering model only grouped 25 cases (rows) into clusters.

Thanks a lot.

|||I'm not sure what's going on here - I think what you are saying is that the cluster support of all clusters when added together is only 25 when it should be 82. They should all be accounted for. You can always go to the prediction query tab and do a prediction query for Cluster() against your training data. You may also want to check the cluster probability as well. If the cluster probability is something very close to 1/# clusters, then it's likely the clusters aren't very strong. This isn't unexpected for such a small data set, but I'm not sure why you're only getting 25 altogether.|||

Hi, Jamie, is it that the model wont cover all the records during its training process?

Thanks a lot

|||No - it covers all of the data. If you can post the 82 rows we can take a look at it here.|||

Hi, Jamie, below is my training data for the mining models building.

.........................................................................................................................................

mark1,mark2,Agreed,marker1,marker2,difference
55,55,55,TC,GR,0
66,69,68,CM,JQ,3
38,37,38,JT,JQ,1
52,57,57,ST,GR,5
64,60,63,JWh,GR,4
54,54,54,BW,JW,0
65,69,67,CM,JQ,4
68,68,68,JWh,MW,0
76,68,72,CM,JWh,8
48,47,48,CL,GR,1
68,62,64,AL,JWh,6
43,40,43,BW,Rdu,3
64,,64,MW,ST,
65,65,65,DG,MB,0
,,,GR,ST,
64,,61,CM,DG,
55,56,56,CM,TC,1
67,61,65,DG,CM,6
54,54,54,JM,CL,0
46,46,46,GR,BS,0
68,72,68,JWh,CM,4
60,60,60,BW,BS,0
65,65,65,DG,GR,0
57,,57,CH,DG,
58,55,58,CS,JT,3
51,51,51,BS,EC,0
,64,65,MW,TC,
66,70,67,DG,RC,4
,,,DG,CM,
66,57,64,JWh,AL,9
77,,74,CM,DG,
54,56,55,MW,TC,2
61,,57,JQ,JT,
61,61,61,DG,CH,0
68,68,68,CH,JM,0
61,66,61,CH,RC,5
68,68,68,BS,CS,0
62,63,63,DG,CH,1
62,63,63,TC,DG,1
63,58,60,JQ,CM,5
76,70,74,JWh,TC,6
68,68,68,CS,BS,0
60,56,58,CM,JWh,4
66,64,65,JW,BW,2
63,,64,TC,DG,
58,,58,RD,DG,
64,72,68,TC,MH,8
72,,72,CH,DG,
66,60,66,GR,RC,6
58,58,58,BSm,GM,0
51,49,50,GR,CM,2
62,,62,BW,CH,
53,50,50,TC,GR,3
,74,74,JWh,MW,
61,60.5,61,JT,CS,0.5
62,63,63,DG,CH,1
58,58,58,BS,JM,0
64,66,66,JT,BS,2
63,60,62,JWh,TC,3
55,,55,MB,DG,
72,,72,BSm,SS,
62,67,64,RD,GM,5
71,67,70,GR,CM,4
62,66,62,CS,CM,4
,68,52,DF,MH,
65,65,65,GM,TC,0
55,56,56,EC,TC,1
66,,66,CH,DG,
70,72,70,JM,CL,2
,,,BW,JT,
57,57,57,DG,CM,0
58,58,58,TC,GM,0
60,60,60,DG,RD,0
54,55,55,TC,CS,1
,64,67,MW,JWh,
60,60,60,DG,GR,0
78,68,72,JWh,MW,10
60,60,60,GR,BS,0
43,48,46,ST,SS,5
57,61,60,SS,ST,4
60,55,60,MW,DF,5
62,62,62,ST,MW,0

...............................................................................................................................

The first column is about the data attributes names.

Thanks a lot for that.

|||

Does you data have a key column? The algorithms need a unique key column to determine that each row is a case. The "Agreed" column has 25 distinct values, so it looks like that is what you are using for your key.

I think you need to add an additional "ID" column to uniquely identify each row.

|||

Hi, Jamie, thanks a lot. Got it done as your suggestion.

Yes, the problem is that the column I used as the key column is only with 25 distinct values which resulted in only 25 cases were covered by the training model.

Thanks a lot.

No comments:

Post a Comment