0%

Coursera公开课-Algorithm-WordNet

问题

Part-Two week1的作业是wordnet。wordnet是一个单词网络。用有向边来表示单词的关系。比如miracle —>event,说明miracle是event的一种,他们是isa关系。

第一个问题是对于给定的一个有向图,找出两者之间的最短距离和公共祖先。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public class SAP {
// constructor takes a digraph (not necessarily a DAG)
public SAP(Digraph G)
// length of shortest ancestral path between v and w; -1 if no such path
public int length(int v, int w)
// a common ancestor of v and w that participates in a shortest ancestral path; -1 if no such path
public int ancestor(int v, int w)
// length of shortest ancestral path between any vertex in v and any vertex in w; -1 if no such path
public int length(Iterable<Integer> v, Iterable<Integer> w)
// a common ancestor that participates in shortest ancestral path; -1 if no such path
public int ancestor(Iterable<Integer> v, Iterable<Integer> w)
// do unit testing of this class
public static void main(String[] args)
}

第二个问题是对于给定的一个wordnet,找出wordnet中的所有单词,并计算任意两个单词之间的距离。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public class WordNet {
// constructor takes the name of the two input files
public WordNet(String synsets, String hypernyms)
// returns all WordNet nouns
public Iterable<String> nouns()
// is the word a WordNet noun?
public boolean isNoun(String word)
// distance between nounA and nounB (defined below)
public int distance(String nounA, String nounB)
//distance(A, B) = distance is the minimum length of any ancestral path between any synset v of A and any synset w of B.
// a synset (second field of synsets.txt) that is the common ancestor of nounA and nounB
// in a shortest ancestral path (defined below)
public String sap(String nounA, String nounB)
// do unit testing of this class
public static void main(String[] args)
}

最后一个问题是找出wordnet当中和其他单词关联性最小的一个单词,换句话说,我们要找出一个单词Ai使得

1
di   =   dist(Ai, A1)   +   dist(Ai, A2)   +   ...   +   dist(Ai, An)

最大。

1
2
3
4
5
public class Outcast {
public Outcast(WordNet wordnet) // constructor takes a WordNet object
public String outcast(String[] nouns) // given an array of WordNet nouns, return an outcast
public static void main(String[] args) // see test client below
}

更具体的题目描述,请看:http://coursera.cs.princeton.edu/algs4/assignments/wordnet.html

输入数据格式
wordnet输入有两个文件,一个是synsets,存储了wordnet当中所有单词,主要有三部分,第一部分是一个整数,是sysnets中的id,第二部分是一组单词,第三部分是这组单词的解释。如

1
2
3
> 36,AND_circuit AND_gate,a circuit in a computer that fires only when all of its inputs fire 
>
>

表示单词集合{AND_circuit, AND_gate}在sysnet中的id是36,他们有个共同的解释就是a circuit in a computer that fires only when all of its inputs fire 。

wordnet第二个输入文件是hypernyms,存储了synsets之间的关系,如

1
2
> 164,21012,56099
>

表示sysnet 164 (“Actifed”)有两个子类,21012(“antihistamine”)和56099 (“nasal_decongestant”),这和sysnets文件的输入是对应的

164,Actifed,trade name for a drug containing an antihistamine and a decongestant…
21012,antihistamine,a medicine used to treat allergies…
56099,nasal_decongestant,a decongestant that provides temporary relief of nasal…

思路分析

从哪里入手呢?官方很贴心地给出了步骤(Possible progress steps):http://coursera.cs.princeton.edu/algs4/checklists/wordnet.html

  • Create the data type SAP
  • Implement the remaining WordNet methods.
  • Implement Outcast.
  • 如果没有思路,可以先看下官方给的checklists,非常有用。

我们看SAP,求最短路径和公共祖先。因为wordnet是一个有向图,并且不会有环(这点题目中已说明),因此求最短路径只需要走一遍bfs即可,同时记录下可达点。有了SAP后,求wordnet中的distance其实就是直接套用SAP的代码。wordnet中最关键的还是数据如何存储。如果是以id来作为key的话,遇到一个id对应多个单词的情况处理起来比较麻烦,因此我们以单词为key,存储单词对应的id(可能有多个),求wordnet中的distance(String nounA, String nounB),其实就是求单词对应id的最短路径,直接用SAP来求即可。Outcast应该是好实现的了。直接遍历wordnet当中所有单词两两之间的距离,求一个最小的即可。时间复杂度是O(N^2)。

代码实现

SAP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import edu.princeton.cs.algs4.Digraph;
import edu.princeton.cs.algs4.In;
import edu.princeton.cs.algs4.StdIn;
import edu.princeton.cs.algs4.StdOut;
public class SAP {
private static final int INFINITY = Integer.MAX_VALUE;
private Digraph G;
public SAP(Digraph G) {
checkCondition(G == null);
this.G = new Digraph(G);
}
// length of shortest ancestral path between v and w; -1 if no such path
public int length(int v, int w) {
checkCondition(v < 0 || v >= G.V());
checkCondition(w < 0 || w >= G.V());
return getLengthOf(v, w);
}
// a common ancestor of v and w that participates in a shortest ancestral path; -1 if no such path
public int ancestor(int v, int w) {
checkCondition(v < 0 || v >= G.V());
checkCondition(w < 0 || w >= G.V());
return getAncestorOf(v, w);
}
// length of shortest ancestral path between any vertex in v and any vertex in w; -1 if no such path
public int length(Iterable<Integer> v, Iterable<Integer> w) {
checkCondition(v == null);
checkCondition(w == null);
for (int v1 : v) checkCondition(v1 < 0 || v1 >= G.V());
for (int w1 : w) checkCondition(w1 < 0 || w1 >= G.V());
return getLengthOf(v, w);
}
// a common ancestor that participates in shortest ancestral path; -1 if no such path
public int ancestor(Iterable<Integer> v, Iterable<Integer> w) {
checkCondition(v == null);
checkCondition(w == null);
for (int v1 : v) checkCondition(v1 < 0 || v1 >= G.V());
for (int w1 : w) checkCondition(w1 < 0 || w1 >= G.V());
return getAncestorOf(v, w);
}
private int getAncestorOf(Object v, Object w) {
int ancestor = -1;
DeluxeBFS uv = new DeluxeBFS(G, v);
DeluxeBFS uw = new DeluxeBFS(G, w);
int result = INFINITY;
for(int i = 0; i < G.V(); i++) {
if(uv.hasPathTo(i) && uw.hasPathTo(i)) {
int distance = uv.distTo(i) + uw.distTo(i);
if(distance < result) {
result = distance;
ancestor = i;
}
}
}
return ancestor;
}
private int getLengthOf(Object v, Object w) {
DeluxeBFS uv = new DeluxeBFS(G, v);
DeluxeBFS uw = new DeluxeBFS(G, w);
int result = INFINITY;
for(int i = 0; i < G.V(); i++) {
if(uv.hasPathTo(i) && uw.hasPathTo(i)) {
int distance = uv.distTo(i) + uw.distTo(i);
if(distance < result) {
result = distance;
}
}
}
return result == INFINITY ? -1 : result;
}
private void checkCondition(boolean inValid) {
if (inValid)
throw new IllegalArgumentException("param must be not null");
}
public static void main(String[] args) {
In in = new In(args[0]);
Digraph G = new Digraph(in);
SAP sap = new SAP(G);
while (!StdIn.isEmpty()) {
int v = StdIn.readInt();
int w = StdIn.readInt();
int length = sap.length(v, w);
int ancestor = sap.ancestor(v, w);
StdOut.printf("length = %d, ancestor = %d\n", length, ancestor);
}
}
}

WordNet

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import edu.princeton.cs.algs4.DirectedCycle;
import edu.princeton.cs.algs4.Bag;
import edu.princeton.cs.algs4.Digraph;
import edu.princeton.cs.algs4.In;
import edu.princeton.cs.algs4.ST;
import java.util.ArrayList;
public class WordNet {
private ST<String, Bag<Integer>> mDataSet = new ST<>();
private ArrayList<String> mAllNouns = new ArrayList<>();
private Digraph mDigraph;
private SAP mSAP;
// constructor takes the name of the two input files
public WordNet(String synsets, String hypernyms) {
validateParam(hypernyms);
validateParam(synsets);
In synsetsIn = new In(synsets);
int V = 0;
while (synsetsIn.hasNextLine()) {
String[] data = synsetsIn.readLine().split(",");
int id = Integer.parseInt(data[0]);
String[] nouns = data[1].split(" ");
for (String noun : nouns) {
if (mDataSet.contains(noun)) {
mDataSet.get(noun).add(id);
} else {
Bag<Integer> bag = new Bag<Integer>();
bag.add(id);
mDataSet.put(noun, bag);
}
}
V++;
mAllNouns.add(data[1]);
}
mDigraph = new Digraph(V);
boolean[] isNotRoots = new boolean[V];
In hypernymsIn = new In(hypernyms);
while (hypernymsIn.hasNextLine()) {
String[] data = hypernymsIn.readLine().split(",");
int childId = Integer.parseInt(data[0]);
isNotRoots[childId] = true;
for (int i=1; i< data.length; i++) {
int parentId = Integer.parseInt(data[i]);
mDigraph.addEdge(childId, parentId);
}
}
if (mSAP == null) {
mSAP = new SAP(mDigraph);
}
int rootCount = 0;
for (boolean notRoot : isNotRoots)
if (!notRoot) rootCount++;
DirectedCycle cycle = new DirectedCycle(mDigraph);
cycle.hasCycle();
if (rootCount > 1 || cycle.hasCycle())
throw new IllegalArgumentException("not a valid WordNet! " +
"please check your input!");
}
// returns all WordNet nouns
public Iterable<String> nouns() {
return mDataSet.keys();
}
// is the word a WordNet noun?
public boolean isNoun(String word) {
validateParam(word);
return mDataSet.contains(word);
}
// distance between nounA and nounB (defined below)
public int distance(String nounA, String nounB) {
validateParam(nounA);
validateParam(nounB);
Bag<Integer> bagOfNounA = mDataSet.get(nounA);
Bag<Integer> bagOfNounB = mDataSet.get(nounB);
return mSAP.length(bagOfNounA, bagOfNounB);
}
// a synset (second field of synsets.txt) that is the common ancestor of nounA and nounB
// in a shortest ancestral path (defined below)
public String sap(String nounA, String nounB) {
validateParam(nounA);
validateParam(nounB);
Bag<Integer> bagOfNounA = mDataSet.get(nounA);
Bag<Integer> bagOfNounB = mDataSet.get(nounB);
int ancestor = mSAP.ancestor(bagOfNounA, bagOfNounB);
return mAllNouns.get(ancestor);
}
private void validateParam(Object param) {
if (param == null)
throw new IllegalArgumentException("param must be not null");
}
// do unit testing of this class
public static void main(String[] args) {
}
}

Outcast

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import edu.princeton.cs.algs4.In;
import edu.princeton.cs.algs4.StdOut;
public class Outcast {
private WordNet mWordNet;
// constructor takes a WordNet object
public Outcast(WordNet wordnet) {
validateParam(wordnet);
mWordNet = wordnet;
}
// given an array of WordNet nouns, return an outcast
public String outcast(String[] nouns) {
validateParam(nouns);
int[] distance = new int[nouns.length];
for (int i=0; i<nouns.length-1; i++){
for (int j=i+1; j<nouns.length; j++){
int dist = mWordNet.distance(nouns[i], nouns[j]);
if (i != j) distance[j] += dist;
distance[i] += dist;
}
}
int maxDistance = 0;
int maxIndex = 0;
for (int i=0; i < distance.length; i++){
if (distance[i] > maxDistance){
maxDistance = distance[i];
maxIndex = i;
}
}
return nouns[maxIndex];
}
private void validateParam(Object param) {
if (param == null)
throw new IllegalArgumentException("param must be not null");
}
// see test client below
public static void main(String[] args) {
WordNet wordnet = new WordNet(args[0], args[1]);
Outcast outcast = new Outcast(wordnet);
for (int t = 2; t < args.length; t++) {
In in = new In(args[t]);
String[] nouns = in.readAllStrings();
StdOut.println(args[t] + ": " + outcast.outcast(nouns));
}
}
}

总结

不得不说官方的API设计得非常好,甚至连test case都帮你搞定,让学生可以专注于算法的设计而不用去考虑其他琐碎的事情,你需要做的仅仅是想如何数据结构,使用什么算法去实现。checklists也很详细,做题目之前把checklists看一遍可以省掉你不少时间。更令人称赞的是他们设计得这套评分系统,不仅仅考察输出结果的正确性,包括对于不合法的输入检查,时间复杂度和内存的使用都在考察范围之内,要想一次拿100分还真不容易。