pith. machine review for the scientific record. sign in

arxiv: 1706.05098 · v1 · submitted 2017-06-15 · 💻 cs.LG · cs.AI· stat.ML

Recognition: 3 theorem links

· Lean Theorem

An Overview of Multi-Task Learning in Deep Neural Networks

Authors on Pith no claims yet

Pith reviewed 2026-05-13 09:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords multi-task learningdeep neural networksparameter sharingauxiliary tasksgeneralizationinductive bias
0
0 comments X

The pith

Multi-task learning improves deep network performance by sharing parameters across related tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys multi-task learning in deep neural networks and shows how it has succeeded in applications from natural language processing to drug discovery. It introduces the two main methods: hard parameter sharing, where hidden layers are shared across tasks, and soft parameter sharing, where each task has separate parameters that are regularized to stay similar. The overview reviews existing work and supplies guidelines for selecting auxiliary tasks that provide useful inductive bias to the main task. Practitioners can apply these ideas to reduce overfitting and improve generalization when data for the primary task is limited.

Core claim

Multi-task learning (MTL) has led to successes in many applications of machine learning by sharing information between tasks. In deep neural networks this is achieved primarily through hard parameter sharing, in which the same hidden layers serve all tasks, and soft parameter sharing, in which each task maintains its own parameters but they are encouraged to be similar through regularization. The paper reviews the literature on these approaches and provides concrete guidelines for choosing auxiliary tasks that benefit the main task.

What carries the argument

Hard parameter sharing and soft parameter sharing, the two dominant ways to transfer knowledge between tasks inside a single deep network.

Load-bearing premise

The reviewed literature and the two parameter-sharing approaches plus auxiliary-task guidelines are sufficient to cover the practical choices most practitioners need.

What would settle it

A working MTL system that achieves strong gains without using either hard or soft parameter sharing or without following the auxiliary-task guidelines would challenge the paper's framing.

read the original abstract

Multi-task learning (MTL) has led to successes in many applications of machine learning, from natural language processing and speech recognition to computer vision and drug discovery. This article aims to give a general overview of MTL, particularly in deep neural networks. It introduces the two most common methods for MTL in Deep Learning, gives an overview of the literature, and discusses recent advances. In particular, it seeks to help ML practitioners apply MTL by shedding light on how MTL works and providing guidelines for choosing appropriate auxiliary tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper provides a general overview of multi-task learning (MTL) in deep neural networks. It introduces the two most common MTL methods (hard and soft parameter sharing), surveys the existing literature on MTL applications and techniques, discusses recent advances, and offers guidelines for selecting auxiliary tasks to assist ML practitioners in applying MTL effectively.

Significance. If the literature coverage is representative, the survey is significant for organizing MTL methods into a clear framework, highlighting empirical successes across NLP, speech recognition, computer vision, and drug discovery, and supplying practical heuristics for auxiliary-task design. As an expository review with no new theorems or experiments, its value lies in consolidation and guidance rather than novel claims.

minor comments (2)
  1. [Introduction] Introduction: the statement that MTL 'has led to successes in many applications' would be strengthened by citing at least one or two quantitative performance gains from the referenced works rather than remaining at a high-level assertion.
  2. [Auxiliary-task guidelines] Guidelines for auxiliary tasks: the heuristics are presented at a conceptual level; adding brief concrete examples drawn from the cited papers (e.g., specific task pairs in vision or NLP) would make the advice more immediately usable for practitioners.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for the positive recommendation to accept. The referee's summary accurately reflects the scope of our overview, which focuses on introducing hard and soft parameter sharing, surveying the literature across application domains, and providing practical guidelines for auxiliary task selection.

Circularity Check

0 steps flagged

No significant circularity: expository review of external literature

full rationale

The paper is a survey that organizes existing MTL methods from the cited literature into categories (hard/soft parameter sharing) and provides practitioner guidelines drawn from prior work. No new derivations, equations, fitted parameters, or theorems are advanced. All claims are summaries or heuristics referencing external sources, with no self-referential reduction of any result to the paper's own inputs. The central contribution is classification and exposition, which is self-contained against the cited benchmarks and does not rely on any load-bearing self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a review paper, the central contribution rests on accurate summarization of prior work rather than new postulates or parameters.

pith-pipeline@v0.9.0 · 5372 in / 928 out tokens · 32018 ms · 2026-05-13T09:46:56.621684+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Multi-task learning (MTL) has led to successes in many applications of machine learning, from natural language processing and speech recognition to computer vision and drug discovery. This article aims to give a general overview of MTL, particularly in deep neural networks. It introduces the two most common methods for MTL in Deep Learning, gives an overview of the literature, and discusses recent advances.

  • IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Hard parameter sharing is the most commonly used approach to MTL in neural networks... Soft parameter sharing on the other hand, each task has its own model with its own parameters. The distance between the parameters of the model is then regularized in order to encourage the parameters to be similar.

  • IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    MTL acts as a regularizer by introducing an inductive bias. As such, it reduces the risk of overfitting as well as the Rademacher complexity of the model.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

    cs.CL 2022-02 accept novelty 8.0

    Randomly replacing labels in in-context demonstrations barely hurts performance, showing that label space, input distribution, and sequence format drive in-context learning more than ground-truth labels.

  2. Finetuned Language Models Are Zero-Shot Learners

    cs.CL 2021-09 accept novelty 8.0

    Instruction tuning a 137B language model on over 60 NLP tasks described by instructions substantially boosts zero-shot performance on unseen tasks, outperforming larger GPT-3 models.

  3. Constrained Contextual Bandits with Adversarial Contexts

    cs.LG 2026-05 unverdicted novelty 7.0

    A modular reduction from budget-constrained contextual bandits with adversarial contexts to unconstrained bandits via surrogate rewards, yielding improved guarantees and an efficient algorithm based on SquareCB.

  4. DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures

    cs.SE 2026-04 unverdicted novelty 7.0

    DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.9...

  5. Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Supervised fine-tuning of LLMs often fails to fully internalize all training instances due to five recurring causes including missing prerequisites and data conflicts, as diagnosed via a new framework across multiple models.

  6. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  7. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    cs.LG 2019-10 unverdicted novelty 7.0

    T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...

  8. PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

    cs.CL 2026-05 unverdicted novelty 6.0

    PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.

  9. Bayesian Model Merging

    cs.LG 2026-05 unverdicted novelty 6.0

    Bayesian Model Merging introduces a bi-level optimization framework that merges task-specific models via closed-form Bayesian regression with an anchor prior and global hyperparameter search, outperforming baselines a...

  10. Learning Large-Scale Modular Addition with an Auxiliary Modulus

    cs.LG 2026-05 unverdicted novelty 6.0

    An auxiliary modulus during training reduces wrap-around issues and preserves train-test input distributions, enabling better accuracy and sample efficiency for large N and q in modular addition learning.

  11. Query-efficient model evaluation using cached responses

    cs.LG 2026-05 unverdicted novelty 6.0

    DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.

  12. FryNet: Dual-Stream Adversarial Fusion for Non-Destructive Frying Oil Oxidation Assessment

    cs.CV 2026-04 unverdicted novelty 6.0

    FryNet combines RGB and thermal imaging with adversarial regularization to segment oil areas, classify usability, and predict oxidation levels like PV and Totox with high accuracy on video data.

  13. From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation

    cs.CV 2026-04 unverdicted novelty 6.0

    Petro-SAM adapts SAM via a Merge Block for polarized views plus multi-scale fusion and color-entropy priors to jointly achieve grain-edge and lithology segmentation in petrographic images.

  14. Parameter-efficient Quantum Multi-task Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    QMTL uses shared VQC encoding plus task-specific quantum ansatz heads to achieve linear parameter scaling with the number of tasks while matching or exceeding classical multi-task baselines on three benchmarks.

  15. A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks

    cs.LG 2026-03 unverdicted novelty 6.0

    iAmTime is a time-series foundation model that uses instruction-conditioned in-context learning from demonstrations to perform zero-shot adaptation on forecasting, imputation, classification, and related tasks.

  16. ST-MoE: Designing Stable and Transferable Sparse Expert Models

    cs.CL 2022-02 unverdicted novelty 6.0

    ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...

  17. FLAME: Adaptive Mixture-of-Experts for Continual Multimodal Multi-Task Learning

    cs.LG 2026-05 unverdicted novelty 5.0

    FLAME is an MoE architecture using modality-specific routers and low-rank compression of expert knowledge to support efficient continual multimodal multi-task learning while reducing catastrophic forgetting.

  18. Lost in the Tower of Babel: The Adverse Effects of Incidental Multilingualism in LLMs

    cs.CL 2026-05 unverdicted novelty 5.0

    Incidental multilingualism from uneven web training makes LLMs unequal, brittle, and opaque across languages.

  19. DynoSys: A Dynamic Systems Framework for Multimodal Integration of Genetic, Environmental, and Neurobiological Signals

    q-bio.OT 2026-05 unverdicted novelty 5.0

    DynoSys offers a unified dynamic systems model integrating genetic, environmental, and neurobiological signals to analyze longitudinal behavioral phenotypes in adolescents via harmonized representations and survival o...

  20. Learning the Weather-Grid Nexus via Weather-to-Voltage (W2V) Predictive Modeling

    eess.SY 2026-04 unverdicted novelty 5.0

    A compact neural network surrogate maps weather features to grid voltages on a 6717-bus Texas system, enabling grid-aware weather forecasting that prioritizes operationally critical conditions like wind drops.

  21. Exploring climate change effects on concurrent floods and concurrent droughts via statistical deep learning

    stat.AP 2026-04 unverdicted novelty 5.0

    The deep SPAR model shows concurrent floods and droughts becoming more likely in the Upper Danube by 2100 under high emissions, with changes in the dependence between catchments contributing substantially to the increase.

  22. Harmonizing MR Images Across 100+ Scanners: Multi-site Validation with Traveling Subjects and Real-world Protocols

    eess.IV 2026-04 conditional novelty 5.0

    HACA3^+ improves upon HACA3 with better artifact encoding, attention mechanisms, and training on 100+ scanners, validated via traveling subjects for better downstream performance.

  23. A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks

    cs.LG 2026-03 unverdicted novelty 5.0

    iAmTime is a hierarchical transformer-based time series foundation model that uses semantic tokens and instruction-conditioned prompts to infer tasks from demonstrations, achieving improved zero-shot performance on fo...

  24. Multi-Dataset Cross-Domain Knowledge Distillation for Unified Medical Image Segmentation, Classification, and Detection

    cs.CV 2026-05 unverdicted novelty 4.0

    A multi-dataset cross-domain knowledge distillation approach improves unified performance on medical image segmentation, classification, and detection by transferring domain-invariant features from a joint teacher mod...

  25. Learning Coarse-to-Fine Osteoarthritis Representations under Noisy Hierarchical Labels

    cs.CV 2026-05 unverdicted novelty 4.0

    Dual-head training on hierarchical OA labels yields backbone-dependent gains in KL metrics, more ordered latent severity axes, and better saliency alignment with cartilage for some 3D backbones.

  26. Opportunistic Bone-Loss Screening from Routine Knee Radiographs Using a Multi-Task Deep Learning Framework with Sensitivity-Constrained Threshold Optimization

    cs.CV 2026-04 unverdicted novelty 4.0

    STR-Net achieves AUROC of 0.933 for binary bone-loss screening and 0.801 correlation for T-score estimation from knee X-rays on a held-out test set.

  27. SG-UniBuc-NLP at SemEval-2026 Task 6: Multi-Head RoBERTa with Chunking for Long-Context Evasion Detection

    cs.CL 2026-04 unverdicted novelty 3.0

    A multi-head RoBERTa model with overlapping chunking and max-pooling achieves Macro-F1 of 0.80 on 3-way clarity classification and 0.51 on 9-way evasion strategy detection, ranking 11th in both subtasks of SemEval-202...

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 26 Pith papers

  1. [1]

    Abu-Mostafa, Y. S. (1990). Learning from hints in neural networks . Journal of Complexity , 6(2):192--198

  2. [2]

    Alonso, H. M. and Plank, B. (2017). When is multitask learning effective? Multitask learning for semantic sequence prediction under varying data conditions . In EACL

  3. [3]

    Ando, R. K. and Tong, Z. (2005). A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data . Journal of Machine Learning Research , 6:1817--1853

  4. [4]

    and Pontil, M

    Argyriou, A. and Pontil, M. (2007). Multi-Task Feature Learning . In Advances in Neural Information Processing Systems

  5. [5]

    \" O ., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Raiman, J., Sengupta, S., and Shoeybi, M

    Arık, S. \" O ., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Raiman, J., Sengupta, S., and Shoeybi, M. (2017). Deep Voice: Real-time Neural Text-to-Speech . In ICML 2017

  6. [6]

    and Heskes, T

    Bakker, B. and Heskes, T. (2003). Task Clustering and Gating for BayesianMultitask Learning . Journal of Machine Learning Research , 1(1):83--99

  7. [7]

    Baxter, J. (1997). A Bayesian/information theoretic model of learning to learn via multiple task sampling . Machine Learning , 28:7--39

  8. [8]

    Baxter, J. (2000). A Model of Inductive Bias Learning . Journal of Artificial Intelligence Research , 12:149--198

  9. [9]

    and Schuller, R

    Ben-David, S. and Schuller, R. (2003). Exploiting task relatedness for multiple task learning . Learning Theory and Kernel Machines , pages 567--580

  10. [10]

    and S gaard, A

    Bingel, J. and S gaard, A. (2017). Identifying beneficial task relations for multi-task learning in deep neural networks . In EACL

  11. [11]

    Caruana, R. (1993). Multitask learning: A knowledge-based source of inductive bias . In Proceedings of the Tenth International Conference on Machine Learning

  12. [12]

    Caruana, R. (1998). Multitask Learning . Autonomous Agents and Multi-Agent Systems , 27(1):95--133

  13. [13]

    and de Sa, V

    Caruana, R. and de Sa, V. R. (1997). Promoting poor features to supervisors: Some inputs work better as outputs . Advances in Neural Information Processing Systems 9: Proceedings of The 1996 Conference , 9:389

  14. [14]

    Cavallanti, G., Cesa-Bianchi, N., and Gentile, C. (2010). Linear Algorithms for Online Multitask Classification . Journal of Machine Learning Research , 11:2901--2934

  15. [15]

    G., and Xing, E

    Chen, X., Kim, S., Lin, Q., Carbonell, J. G., and Xing, E. P. (2010). Graph-Structured Multi-task Regression and an Efficient Optimization Method for General Fused Lasso . pages 1--21

  16. [16]

    Cheng, H., Fang, H., and Ostendorf, M. (2015). Open-Domain Name Error Detection using a Multi-Task RNN . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages 737--746

  17. [17]

    and Weston, J

    Collobert, R. and Weston, J. (2008). A unified architecture for natural language processing . Proceedings of the 25th international conference on Machine learning - ICML '08 , 20(1):160--167

  18. [18]

    and Mansour, Y

    Crammer, K. and Mansour, Y. (2012). Learning Multiple Tasks Using Shared Hypotheses . Neural Information Processing Systems (NIPS) , pages 1484--1492

  19. [19]

    Daum \' e III , H. (2009). Bayesian multitask learning with latent hierarchies . pages 135--142

  20. [20]

    E., and Kingsbury, B

    Deng, L., Hinton, G. E., and Kingsbury, B. (2013). New types of deep neural network learning for speech recognition and related applications: An overview . 2013 IEEE International Conference on Acoustics, Speech and Signal Processing , pages 8599--8603

  21. [21]

    Duong, L., Cohn, T., Bird, S., and Cook, P. (2015). Low Resource Dependency Parsing: Cross-lingual Parameter Sharing in a Neural Network Parser . Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers) , pages 845--850

  22. [22]

    A., and Pontil, M

    Evgeniou, T., Micchelli, C. A., and Pontil, M. (2005). Learning multiple tasks with kernel methods . Journal of Machine Learning Research , 6:615--637

  23. [23]

    and Pontil, M

    Evgeniou, T. and Pontil, M. (2004). Regularized multi-task learning . International Conference on Knowledge Discovery and Data Mining , page 109

  24. [24]

    and Lempitsky, V

    Ganin, Y. and Lempitsky, V. (2015). Unsupervised Domain Adaptation by Backpropagation . In Proceedings of the 32nd International Conference on Machine Learning. , volume 37

  25. [25]

    Girshick, R. (2015). Fast R-CNN . In Proceedings of the IEEE International Conference on Computer Vision , pages 1440--1448

  26. [26]

    Hashimoto, K., Xiong, C., Tsuruoka, Y., and Socher, R. (2016). A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks

  27. [27]

    Heskes, T. (2000). Empirical Bayes for Learning to Learn . Proceedings of the Seventeenth International Conference on Machine Learning , pages 367--364

  28. [28]

    R., and Vert, J.-p

    Jacob, L., Vert, J.-p., Bach, F. R., and Vert, J.-p. (2009). Clustered Multi-Task Learning: A Convex Formulation . Advances in Neural Information Processing Systems 21 , pages 745--752

  29. [29]

    Jalali, A., Ravikumar, P., Sanghavi, S., and Ruan, C. (2010). A Dirty Model for Multi-task Learning . Advances in Neural Information Processing Systems

  30. [30]

    Kang, Z., Grauman, K., and Sha, F. (2011). Learning with whom to share in multi-task feature learning . Proceedings of the 28th International Conference on Machine Learning , (4):4--5

  31. [31]

    Kendall, A., Gal, Y., and Cipolla, R. (2017). Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics

  32. [32]

    and Xing, E

    Kim, S. and Xing, E. P. (2010). Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity . 27th International Conference on Machine Learning , pages 1--14

  33. [33]

    and Daum \' e III , H

    Kumar, A. and Daum \' e III , H. (2012). Learning Task Grouping and Overlap in Multi-task Learning . Proceedings of the 29th International Conference on Machine Learning , pages 1383--1390

  34. [34]

    Lawrence, N. D. and Platt, J. C. (2004). Learning to learn with the informative vector machine . Twenty-first international conference on Machine learning - ICML '04 , page 65

  35. [35]

    J., and Ho, Q

    Liu, S., Pan, S. J., and Ho, Q. (2016). Distributed Multi-task Relationship Learning . In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS) , pages 751--760

  36. [36]

    Liu, X., Gao, J., He, X., Deng, L., Duh, K., and Wang, Y.-Y. (2015). Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval . NAACL-2015 , pages 912--921

  37. [37]

    and Wang, J

    Long, M. and Wang, J. (2015). Learning Multiple Tasks with Deep Relationship Networks . arXiv preprint arXiv:1506.02117

  38. [38]

    B., and van de Geer, S

    Lounici, K., Pontil, M., Tsybakov, A. B., and van de Geer, S. (2009). Taking Advantage of Sparsity in Multi-Task Learning . Stat , (1)

  39. [39]

    Lu, Y., Kumar, A., Zhai, S., Cheng, Y., Javidi, T., and Feris, R. (2016). Fully-adaptive Feature Sharing in Multi-Task Networks with Applications in Person Attribute Classification

  40. [40]

    Misra, I., Shrivastava, A., Gupta, A., and Hebert, M. (2016). Cross-stitch Networks for Multi-task Learning . In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  41. [41]

    and Wainwright, M

    Negahban, S. and Wainwright, M. J. (2008). Joint support recovery under high-dimensional scaling: Benefits and perils of \ ell \_ \ 1,infty \ \ -regularization . Advances in Neural Information Processing Systems , pages 1161--1168

  42. [42]

    Ramsundar, B., Kearnes, S., Riley, P., Webster, D., Konerding, D., and Pande, V. (2015). Massively Multitask Networks for Drug Discovery

  43. [43]

    Rei, M. (2017). Semi-supervised Multitask Learning for Sequence Labeling . In Proceedings of ACL 2017

  44. [44]

    Ruder, S., Bingel, J., Augenstein, I., and S gaard, A. (2017). Sluice networks: Learning what to share between loosely related tasks

  45. [45]

    Saha, A., Rai, P., Daum \' e , H., and Venkatasubramanian, S. (2011). Online learning of multiple tasks and their relationships . Journal of Machine Learning Research , 15:643--651

  46. [46]

    and Goldberg, Y

    S gaard, A. and Goldberg, Y. (2016). Deep multi-task learning with low level tasks supervised at lower layers . Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics , pages 231--235

  47. [47]

    and O'Sullivan, J

    Thrun, S. and O'Sullivan, J. (1996). Discovering Structure in Multiple Learning Tasks: The TC Algorithm . Proceedings of the Thirteenth International Conference on Machine Learning , 28(1):5--5

  48. [48]

    Xue, Y., Liao, X., Carin, L., and Krishnapuram, B. (2007). Multi-Task Learning for Classification with Dirichlet Process Priors . Journal of Machine Learning Research , 8:35--63

  49. [49]

    and Hospedales, T

    Yang, Y. and Hospedales, T. (2017a). Deep Multi-task Representation Learning: A Tensor Factorisation Approach . In Proceedings of ICLR 2017

  50. [50]

    and Hospedales, T

    Yang, Y. and Hospedales, T. M. (2017b). Trace Norm Regularised Deep Multi-Task Learning . In Workshop track - ICLR 2017

  51. [51]

    and Jiang, J

    Yu, J. and Jiang, J. (2016). Learning Sentence Embeddings with Auxiliary Tasks for Cross-Domain Sentiment Classification . Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP2016) , pages 236--246

  52. [52]

    Yu, K., Tresp, V., and Schwaighofer, A. (2005). Learning Gaussian processes from multiple tasks . Proceedings of the International Conference on Machine Learning (ICML) , 22:1012--1019

  53. [53]

    and Lin, Y

    Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables . Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 68(1):49--67

  54. [54]

    Zhang, C. H. and Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression . Annals of Statistics , 36(4):1567--1594

  55. [55]

    and Yeung, D.-y

    Zhang, Y. and Yeung, D.-y. (2010). A Convex Formulation for Learning Task Relationships in Multi-Task Learning . Uai , pages 733--442

  56. [56]

    C., and Tang, X

    Zhang, Z., Luo, P., Loy, C. C., and Tang, X. (2014). Facial Landmark Detection by Deep Multi-task Learning . In European Conference on Computer Vision , pages 94--108