问题 Part-Two week1的作业是wordnet。wordnet是一个单词网络。用有向边来表示单词的关系。比如miracle —>event,说明miracle是event的一种,他们是isa关系。
第一个问题是对于给定的一个有向图,找出两者之间的最短距离和公共祖先。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 public class SAP { public SAP (Digraph G) public int length (int v, int w) public int ancestor (int v, int w) public int length (Iterable<Integer> v, Iterable<Integer> w) public int ancestor (Iterable<Integer> v, Iterable<Integer> w) public static void main (String[] args) }
第二个问题是对于给定的一个wordnet,找出wordnet中的所有单词,并计算任意两个单词之间的距离。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 public class WordNet { public WordNet (String synsets, String hypernyms) public Iterable<String> nouns () public boolean isNoun (String word) public int distance (String nounA, String nounB) public String sap (String nounA, String nounB) public static void main (String[] args) }
最后一个问题是找出wordnet当中和其他单词关联性最小的一个单词,换句话说,我们要找出一个单词Ai使得
1 di = dist(Ai, A1) + dist(Ai, A2) + ... + dist(Ai, An)
最大。
1 2 3 4 5 public class Outcast { public Outcast (WordNet wordnet) public String outcast (String[] nouns) public static void main (String[] args) }
更具体的题目描述,请看:http://coursera.cs.princeton.edu/algs4/assignments/wordnet.html
输入数据格式 wordnet输入有两个文件,一个是synsets,存储了wordnet当中所有单词,主要有三部分,第一部分是一个整数,是sysnets中的id,第二部分是一组单词,第三部分是这组单词的解释。如
1 2 3 > 36,AND_circuit AND_gate,a circuit in a computer that fires only when all of its inputs fire > >
表示单词集合{AND_circuit, AND_gate}在sysnet中的id是36,他们有个共同的解释就是a circuit in a computer that fires only when all of its inputs fire 。
wordnet第二个输入文件是hypernyms,存储了synsets之间的关系,如
表示sysnet 164 (“Actifed”)有两个子类,21012(“antihistamine”)和56099 (“nasal_decongestant”),这和sysnets文件的输入是对应的
164,Actifed,trade name for a drug containing an antihistamine and a decongestant… 21012,antihistamine,a medicine used to treat allergies… 56099,nasal_decongestant,a decongestant that provides temporary relief of nasal…
思路分析 从哪里入手呢?官方很贴心地给出了步骤(Possible progress steps):http://coursera.cs.princeton.edu/algs4/checklists/wordnet.html
Create the data type SAP
Implement the remaining WordNet methods.
Implement Outcast.
如果没有思路,可以先看下官方给的checklists,非常有用。
我们看SAP,求最短路径和公共祖先。因为wordnet是一个有向图,并且不会有环(这点题目中已说明),因此求最短路径只需要走一遍bfs即可,同时记录下可达点。有了SAP后,求wordnet中的distance其实就是直接套用SAP的代码。wordnet中最关键的还是数据如何存储。如果是以id来作为key的话,遇到一个id对应多个单词的情况处理起来比较麻烦,因此我们以单词为key,存储单词对应的id(可能有多个),求wordnet中的distance(String nounA, String nounB),其实就是求单词对应id的最短路径,直接用SAP来求即可。Outcast应该是好实现的了。直接遍历wordnet当中所有单词两两之间的距离,求一个最小的即可。时间复杂度是O(N^2)。
代码实现 SAP 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 import edu.princeton.cs.algs4.Digraph;import edu.princeton.cs.algs4.In;import edu.princeton.cs.algs4.StdIn;import edu.princeton.cs.algs4.StdOut;public class SAP { private static final int INFINITY = Integer.MAX_VALUE; private Digraph G; public SAP (Digraph G) { checkCondition(G == null ); this .G = new Digraph (G); } public int length (int v, int w) { checkCondition(v < 0 || v >= G.V()); checkCondition(w < 0 || w >= G.V()); return getLengthOf(v, w); } public int ancestor (int v, int w) { checkCondition(v < 0 || v >= G.V()); checkCondition(w < 0 || w >= G.V()); return getAncestorOf(v, w); } public int length (Iterable<Integer> v, Iterable<Integer> w) { checkCondition(v == null ); checkCondition(w == null ); for (int v1 : v) checkCondition(v1 < 0 || v1 >= G.V()); for (int w1 : w) checkCondition(w1 < 0 || w1 >= G.V()); return getLengthOf(v, w); } public int ancestor (Iterable<Integer> v, Iterable<Integer> w) { checkCondition(v == null ); checkCondition(w == null ); for (int v1 : v) checkCondition(v1 < 0 || v1 >= G.V()); for (int w1 : w) checkCondition(w1 < 0 || w1 >= G.V()); return getAncestorOf(v, w); } private int getAncestorOf (Object v, Object w) { int ancestor = -1 ; DeluxeBFS uv = new DeluxeBFS (G, v); DeluxeBFS uw = new DeluxeBFS (G, w); int result = INFINITY; for (int i = 0 ; i < G.V(); i++) { if (uv.hasPathTo(i) && uw.hasPathTo(i)) { int distance = uv.distTo(i) + uw.distTo(i); if (distance < result) { result = distance; ancestor = i; } } } return ancestor; } private int getLengthOf (Object v, Object w) { DeluxeBFS uv = new DeluxeBFS (G, v); DeluxeBFS uw = new DeluxeBFS (G, w); int result = INFINITY; for (int i = 0 ; i < G.V(); i++) { if (uv.hasPathTo(i) && uw.hasPathTo(i)) { int distance = uv.distTo(i) + uw.distTo(i); if (distance < result) { result = distance; } } } return result == INFINITY ? -1 : result; } private void checkCondition (boolean inValid) { if (inValid) throw new IllegalArgumentException ("param must be not null" ); } public static void main (String[] args) { In in = new In (args[0 ]); Digraph G = new Digraph (in); SAP sap = new SAP (G); while (!StdIn.isEmpty()) { int v = StdIn.readInt(); int w = StdIn.readInt(); int length = sap.length(v, w); int ancestor = sap.ancestor(v, w); StdOut.printf("length = %d, ancestor = %d\n" , length, ancestor); } } }
WordNet 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 import edu.princeton.cs.algs4.DirectedCycle;import edu.princeton.cs.algs4.Bag;import edu.princeton.cs.algs4.Digraph;import edu.princeton.cs.algs4.In;import edu.princeton.cs.algs4.ST;import java.util.ArrayList;public class WordNet { private ST<String, Bag<Integer>> mDataSet = new ST <>(); private ArrayList<String> mAllNouns = new ArrayList <>(); private Digraph mDigraph; private SAP mSAP; public WordNet (String synsets, String hypernyms) { validateParam(hypernyms); validateParam(synsets); In synsetsIn = new In (synsets); int V = 0 ; while (synsetsIn.hasNextLine()) { String[] data = synsetsIn.readLine().split("," ); int id = Integer.parseInt(data[0 ]); String[] nouns = data[1 ].split(" " ); for (String noun : nouns) { if (mDataSet.contains(noun)) { mDataSet.get(noun).add(id); } else { Bag<Integer> bag = new Bag <Integer>(); bag.add(id); mDataSet.put(noun, bag); } } V++; mAllNouns.add(data[1 ]); } mDigraph = new Digraph (V); boolean [] isNotRoots = new boolean [V]; In hypernymsIn = new In (hypernyms); while (hypernymsIn.hasNextLine()) { String[] data = hypernymsIn.readLine().split("," ); int childId = Integer.parseInt(data[0 ]); isNotRoots[childId] = true ; for (int i=1 ; i< data.length; i++) { int parentId = Integer.parseInt(data[i]); mDigraph.addEdge(childId, parentId); } } if (mSAP == null ) { mSAP = new SAP (mDigraph); } int rootCount = 0 ; for (boolean notRoot : isNotRoots) if (!notRoot) rootCount++; DirectedCycle cycle = new DirectedCycle (mDigraph); cycle.hasCycle(); if (rootCount > 1 || cycle.hasCycle()) throw new IllegalArgumentException ("not a valid WordNet! " + "please check your input!" ); } public Iterable<String> nouns () { return mDataSet.keys(); } public boolean isNoun (String word) { validateParam(word); return mDataSet.contains(word); } public int distance (String nounA, String nounB) { validateParam(nounA); validateParam(nounB); Bag<Integer> bagOfNounA = mDataSet.get(nounA); Bag<Integer> bagOfNounB = mDataSet.get(nounB); return mSAP.length(bagOfNounA, bagOfNounB); } public String sap (String nounA, String nounB) { validateParam(nounA); validateParam(nounB); Bag<Integer> bagOfNounA = mDataSet.get(nounA); Bag<Integer> bagOfNounB = mDataSet.get(nounB); int ancestor = mSAP.ancestor(bagOfNounA, bagOfNounB); return mAllNouns.get(ancestor); } private void validateParam (Object param) { if (param == null ) throw new IllegalArgumentException ("param must be not null" ); } public static void main (String[] args) { } }
Outcast 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 import edu.princeton.cs.algs4.In;import edu.princeton.cs.algs4.StdOut;public class Outcast { private WordNet mWordNet; public Outcast (WordNet wordnet) { validateParam(wordnet); mWordNet = wordnet; } public String outcast (String[] nouns) { validateParam(nouns); int [] distance = new int [nouns.length]; for (int i=0 ; i<nouns.length-1 ; i++){ for (int j=i+1 ; j<nouns.length; j++){ int dist = mWordNet.distance(nouns[i], nouns[j]); if (i != j) distance[j] += dist; distance[i] += dist; } } int maxDistance = 0 ; int maxIndex = 0 ; for (int i=0 ; i < distance.length; i++){ if (distance[i] > maxDistance){ maxDistance = distance[i]; maxIndex = i; } } return nouns[maxIndex]; } private void validateParam (Object param) { if (param == null ) throw new IllegalArgumentException ("param must be not null" ); } public static void main (String[] args) { WordNet wordnet = new WordNet (args[0 ], args[1 ]); Outcast outcast = new Outcast (wordnet); for (int t = 2 ; t < args.length; t++) { In in = new In (args[t]); String[] nouns = in.readAllStrings(); StdOut.println(args[t] + ": " + outcast.outcast(nouns)); } } }
总结 不得不说官方的API设计得非常好,甚至连test case都帮你搞定,让学生可以专注于算法的设计而不用去考虑其他琐碎的事情,你需要做的仅仅是想如何数据结构,使用什么算法去实现。checklists也很详细,做题目之前把checklists看一遍可以省掉你不少时间。更令人称赞的是他们设计得这套评分系统,不仅仅考察输出结果的正确性,包括对于不合法的输入检查,时间复杂度和内存的使用都在考察范围之内,要想一次拿100分还真不容易。